{"id":299,"date":"2014-06-14T08:03:52","date_gmt":"2014-06-14T08:03:52","guid":{"rendered":"http:\/\/evolvedmicrobe.com\/blogs\/?p=299"},"modified":"2014-06-15T19:13:58","modified_gmt":"2014-06-15T19:13:58","slug":"c-vs-java-xamarin-vs-oracle-performance-comparison-version-2-0","status":"publish","type":"post","link":"http:\/\/evolvedmicrobe.com\/blogs\/?p=299","title":{"rendered":"C# vs. Java, Xamarin vs. Oracle, Performance Comparison version 2.0"},"content":{"rendered":"<p>Today I noticed the SIMD implementation of the Mandelbrot set algorithm I blogged about last year was successfully submitted to the <a title=\"Language Shootout\" href=\"http:\/\/benchmarksgame.alioth.debian.org\/u64q\/program.php?test=mandelbrot&amp;lang=csharp&amp;id=6\" target=\"_blank\">language shootout webpage<\/a>.  However, I was a bit disappointed to see the C# version was still slower than the Java version, despite my use of the special SIMD instructions (though I was pleased that a <a href=\"http:\/\/benchmarksgame.alioth.debian.org\/u32\/fsharp.php\">quick port of my code to F#<\/a>  absolutely annihilates the OCAML version of this benchmark).\r\n<\/p>\r\n<p>I had benchmarked my code as faster than the Java code on my Mac, the CentOS cluster at my work and two Ubuntu virtual machines (Mono >3.0, LLVM compiler).  What gave? <\/p>\r\n\r\n<p> Undeterred, I thought I would try to use SIMD to improve another C# benchmark, and took a shot at the <a href=\"http:\/\/benchmarksgame.alioth.debian.org\/u64\/program.php?test=nbody&#038;lang=csharp&#038;id=7\">N-body benchmark test<\/a>.  Once again, the version I wrote, in my hands, was much faster.  But when submitted to the shoot-out and accepted, it was either much slower, or <a href=\"http:\/\/benchmarksgame.alioth.debian.org\/u32\/benchmark.php?test=nbody&#038;lang=csharp&#038;id=7&#038;data=u32\">didn&#8217;t even run<\/a>. My submission using the SSE instructions not only didn&#8217;t beat Java, it was actually slower than the original C# version!<\/p>\r\n\r\n<p>\r\nBelow are the timings on my top-of-the-line Mac for the two benchmarks, in both cases we see that the C# program runs in 80-90% of the time the Java Program takes.  There are several key take aways here.\r\n<\/p>\r\n\r\n<a href=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2014\/06\/tests.png\"><img data-attachment-id=\"300\" data-permalink=\"http:\/\/evolvedmicrobe.com\/blogs\/?attachment_id=300\" data-orig-file=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2014\/06\/tests.png?fit=480%2C480\" data-orig-size=\"480,480\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"C#JavaTimings\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2014\/06\/tests.png?fit=300%2C300\" data-large-file=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2014\/06\/tests.png?fit=480%2C480\" loading=\"lazy\" width=\"480\" height=\"480\" class=\"size-full wp-image-300\" src=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2014\/06\/tests.png?resize=480%2C480\" alt=\"C#JavaTimings\" w srcset=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2014\/06\/tests.png?w=480 480w, https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2014\/06\/tests.png?resize=150%2C150 150w, https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2014\/06\/tests.png?resize=300%2C300 300w\" sizes=\"(max-width: 480px) 100vw, 480px\" data-recalc-dims=\"1\" \/><\/a>\r\n<br\/>\r\n\r\n\t<h2>C# and Java have no meaningful performance differences.<\/h2> C# may use a lot less memory, but this is my optimized C# code and it is only beating the optimized Java by <= 20%.  When does that matter?\r\n\r\n<h2> C# and Java&#8217;s similar performance has different underpinnings<\/h2>\r\n\r\n<p> There are a lot of ways code can be fast, and I was surprised that Java and C# achieve similar speeds in very different ways.<\/p>\r\n\r\n<p> The key advantage of the C# code was the SIMD instructions, which in theory give a ~2X speed up.  However, they only win by ~20%. Why?<\/p>\r\n\r\n<p>I think the assembly gives the answer.  Just some quick snippets tell the whole story.  Here is some C# code for the N-Body problem compiled to assembly:  <\/p>\r\n\r\n<pre lang=\"asm\">\r\n00000034\tvsubpd\t0x18(%esi), %xmm2, %xmm2\r\n00000039\tvmulpd\t%xmm1, %xmm1, %xmm3\r\n0000003d\tvmulpd\t%xmm2, %xmm2, %xmm4\r\n00000041\tvhaddpd\t%xmm3, %xmm3, %xmm3\r\n00000045\tvhaddpd\t%xmm4, %xmm4, %xmm4\r\n00000049\tvaddpd\t%xmm4, %xmm3, %xmm3\r\n0000004d\tvsqrtpd\t%xmm3, %xmm4\r\n00000051\tvmulpd\t%xmm4, %xmm3, %xmm3\r\n<\/pre>\r\n\r\n<p> The important insights into performance from the assembly here are: <ol>\r\n\r\n<li> Similar instructions are stacked on top of each other, allowing for pipelining (e.g. the same vhaddpd instruction follows the same vhaddpd, and both use different registers, so can execute simultaneously).<\/li>\r\n<li>  The &#8220;p&#8221; in the instructions (i.e. vhadd-&#8220;p&#8221;-d).  This stands for &#8220;packed&#8221; meaning we are packing\/doing two operations at once via SIMD.\r\n<\/li>\r\n<li> Only registers XMM1-XMM4 appear in the instructions.  There are more registers available, and more pipelining possible, but the Mono\/LLVM compiler appears to only use the low-number\/scratch registers.  It is easier to write compilers that obey this restriction.\r\n<\/li>\r\n<\/ol>\r\n<br\/>\r\n\r\n<p>Now let&#8217;s compare that to some Java assembly emitted by the Oracle runtime for the same benchmark:<\/p>\r\n<pre brush=\"asm\">\r\n0x0000000109ce2315: vmulsd %xmm11,%xmm1,%xmm2\r\n0x0000000109ce231a: vmulsd %xmm11,%xmm9,%xmm3\r\n0x0000000109ce231f: vmulsd %xmm9,%xmm9,%xmm0\r\n0x0000000109ce2324: vmulsd %xmm1,%xmm1,%xmm5\r\n0x0000000109ce2328: vmovsd 0x38(%r12,%rdx,8),%xmm4\r\n0x0000000109ce232f: vaddsd %xmm0,%xmm5,%xmm5\r\n0x0000000109ce2333: vmovsd 0x30(%r12,%rdx,8),%xmm6\r\n0x0000000109ce233a: vmovsd 0x28(%r12,%rdx,8),%xmm7\r\n0x0000000109ce2341: vmovsd 0x20(%r12,%rbp,8),%xmm0  \r\n0x0000000109ce2348: vmovsd 0x40(%r12,%rbp,8),%xmm8 \r\n0x0000000109ce234f: vsubsd %xmm0,%xmm12,%xmm0 \r\n0x0000000109ce2353: vmulsd %xmm8,%xmm9,%xmm9\r\n0x0000000109ce2358: vmulsd %xmm11,%xmm0,%xmm14\r\n0x0000000109ce235d: vmulsd %xmm8,%xmm0,%xmm15\r\n0x0000000109ce2362: vmulsd %xmm0,%xmm0,%xmm0\r\n0x0000000109ce2366: vmulsd %xmm8,%xmm1,%xmm1\r\n<\/pre>\r\n\r\n<p> The important insights here are: <ol>\r\n\r\n<li>Java uses the version of the instructions with &#8220;s&#8221; (i.e. vmul-&#8220;s&#8221;-d) meaning single, and gets no SSE advantage. No doubt this makes writing compilers easier.<\/li>\r\n<li>  However, Java is using lots of registers, XMM15 shows up!<\/li>\r\n<li> As a result of using all the registers used, the Java code has great pipelining.  Note that up to 5 vmulsd instructions show up at once.  The JVM is simply brilliant at packing operations together, and this means that even though I, the programmer, was smart and used SSE2, my C# code only won by 20%, and not 200%.  It&#8217;s hard to beat Java pipelining. <\/li>\r\n<\/ol>\r\n<br\/>\r\n<p> All of which makes me wonder.  What if we took the high-level language advantages of C# (low-overhead value types, easy interop via pointers, baked in generics, etc.).  And combined them with the low-level advantages of the JVM (better array bounds check elimination, pipelining compiler optimizations, and maybe someday the occasional stack allocation of reference type variables&#8230;) \r\n\r\n\r\n\r\n","protected":false},"excerpt":{"rendered":"Today I noticed the SIMD implementation of the Mandelbrot set algorithm I blogged about last year was successfully submitted to the language shootout webpage. However, I was a bit disappointed to see the C# version was still slower than the Java version, despite my use of the special SIMD instructions (though I was pleased that [&hellip;]","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[1],"tags":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":112,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=112","url_meta":{"origin":299,"position":0},"title":"Mono.Simd and the Mandlebrot Set.","date":"September 10, 2013","format":false,"excerpt":"C# and .NET are some of the fastest high level languages, but still cannot truly compete with C\/C++ for low level speed, and C# code can be anywhere from 20%-300% slower. This is despite the fact that the C# compiler often gets as much information about a method as the\u2026","rel":"","context":"In &quot;Algorithms&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/img2_thumb.gif?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":71,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=71","url_meta":{"origin":299,"position":1},"title":"Java vs. C# Performance Comparison for Parsing VCF Files","date":"May 26, 2013","format":false,"excerpt":"Making a comparison with a reasonably complex program ported between the two languages. Update 3\/10\/2014: After writing this post I changed the C# parser to remove an extra List<> allocation in the C# code that was not in the Java code.\u00a0\u00a0After this, the Java\/C# versions are indistinguishable on speed, but\u2026","rel":"","context":"In &quot;Algorithms&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/05\/image_thumb1.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":398,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=398","url_meta":{"origin":299,"position":2},"title":".NET Bio is Significantly Faster on .Net Core 2.0","date":"November 5, 2017","format":false,"excerpt":"Summary: With the release of .NET Core 2.0, .NET Bio is able to run significantly faster (~2X) on Mac OSX due to better compilation and memory mangement. The .NET Bio\u00a0library contains libraries for genomic data processing tasks like parsing, alignment, etc. that are too computationally intense to be\u00a0undertaken with interpreted\u2026","rel":"","context":"In \".NET Bio\"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2017\/11\/Benchmark-1.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":6,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=6","url_meta":{"origin":299,"position":3},"title":"Not All Poisson Random Variables Are Created Equally","date":"January 30, 2013","format":false,"excerpt":"Spurred by a slow running program, I spent an afternoon researching what algorithms are available for generating Poisson random variables and figuring out which methods are used by R, Matlab, NumPy, the GNU Science Libraray and various other available packages. I learned some things that I think would be useful\u2026","rel":"","context":"In &quot;Algorithms&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/01\/img34-300x239.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":188,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=188","url_meta":{"origin":299,"position":4},"title":"The .NET Bio BAM Parser is Smoking Fast","date":"October 12, 2013","format":false,"excerpt":"The .NET Bio library has an improved version of it's BAM file\u00a0parser, which makes it significantly faster and easily competitive with the\u00a0current standard C coded SAMTools for obtaining\u00a0sequencing data and working with it. The chart below compares the time it\u00a0takes in seconds for the old version of the parser and\u2026","rel":"","context":"In &quot;.NET Bio&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/10\/img5.gif?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":53,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=53","url_meta":{"origin":299,"position":5},"title":"How to remove the &ldquo;Trial Edition&rdquo; banner from the VisiFire open source chart kit","date":"April 13, 2013","format":false,"excerpt":"Visifire is a very good graphing component for making silverlight or WPF applications.\u00a0 The component was first released as an open source library on GoogleCode, but since then has been made a closed source proprietary and for profit project.\u00a0 The newer version contains several enhancements, but the open source version\u2026","rel":"","context":"In &quot;C#&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/04\/image_thumb.png?resize=350%2C200","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/299"}],"collection":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=299"}],"version-history":[{"count":39,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/299\/revisions"}],"predecessor-version":[{"id":339,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/299\/revisions\/339"}],"wp:attachment":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=299"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=299"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=299"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}