{"id":112,"date":"2013-09-10T22:18:00","date_gmt":"2013-09-10T22:18:00","guid":{"rendered":"http:\/\/evolvedmicrobe.com\/blogs\/?p=112"},"modified":"2014-06-29T02:23:24","modified_gmt":"2014-06-29T02:23:24","slug":"mono-simd-and-the-mandlebrot-set","status":"publish","type":"post","link":"http:\/\/evolvedmicrobe.com\/blogs\/?p=112","title":{"rendered":"Mono.Simd and the Mandlebrot Set."},"content":{"rendered":"<p>C# and .NET are some of the fastest high level languages, but still cannot truly compete with C\/C++ for low level speed, and C# code can be anywhere from 20%-300% slower.  This is despite the fact that the C# compiler often gets as much information about a method as the C compiler.&nbsp; It has been suggested that <a href=\"https:\/\/en.wikipedia.org\/wiki\/SIMD\">SSE\/SIMD<\/a> instructions in C# could overcome some of these issues.&nbsp; To evaluate this, I took a famous computational task, re-coded it using C# SIMD instructions, and evaluated the performance by looking at the execution time and how the emitted assembly code compared to the optimal assembly code.<\/p> <p><strong>Background<\/strong><\/p> <p>In 2009, <a href=\"http:\/\/tirania.org\/blog\/archive\/2008\/Nov-03.html\">Miguel de Icaza demonstrated a framework<\/a> that allows C# to use SSE2 intrinsic operations.This is now part of the Mono library and in theory, such methods can greatly save computational time (1.5X-16X), particularly for operations on byte arrays, a common task in bioinformatics. <\/p> <p><strong>Test Case<\/strong><\/p> <p>Computing the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Mandelbrot_set#For_programmers\">mandelbrot set<\/a> is one of the tasks of the <a href=\"http:\/\/benchmarksgame.alioth.debian.org\/u32\/performance.php?test=mandelbrot\">computer benchmarks game<\/a>, which compares program speed in different languages.&nbsp; Currently, C# is listed as being slower than Java, though the programs in each language use different techniques and neither uses an optimal algorithm (<a href=\"http:\/\/benchmarksgame.alioth.debian.org\/u32\/program.php?test=mandelbrot&amp;lang=gcc&amp;id=5\">see here for a better one<\/a>).&nbsp; However, it makes a useful test case for raw computing speed.&nbsp; The challenge is to compute the picture shown below by recursively squaring a complex number and adding a constant to it.<\/p> <p class=\"auto-style1\"><img alt=\"z_{n+1} = z_n^2 + c\" src=\"https:\/\/upload.wikimedia.org\/math\/1\/6\/8\/1686ce42df2b6ee51a3ae880613ca4d9.png\"><\/p> \r\n\r\n<p class=\"auto-style1\"><a href=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/image.png\"><img loading=\"lazy\" title=\"image_thumb[1]\" style=\"border-top: 0px; border-right: 0px; background-image: none; border-bottom: 0px; padding-top: 0px; padding-left: 0px; border-left: 0px; display: inline; padding-right: 0px\" border=\"0\" alt=\"image_thumb[1]\" src=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/image_thumb11.png?resize=244%2C237\" width=\"244\" height=\"237\"  data-recalc-dims=\"1\"><\/a><\/p> \r\n\r\n<p><strong>The Algorithm<\/strong><\/p> \r\n\r\n<p>Complex numbers are a particularly challenging case because their multiplication is not a simple operation, but involves a rather annoyingly squaring of different terms and then adding\/subtracting them. <\/p> <p class=\"auto-style1\"><a href=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/image1.png\">\r\n\r\n<img loading=\"lazy\" title=\"image_thumb1\" style=\"border-top: 0px; border-right: 0px; background-image: none; border-bottom: 0px; padding-top: 0px; padding-left: 0px; border-left: 0px; display: inline; padding-right: 0px\" border=\"0\" alt=\"image_thumb1\" src=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/image_thumb12.png?resize=430%2C70\" width=\"430\" height=\"70\"  data-recalc-dims=\"1\"><\/a><a href=\"http:\/\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/image1.png\"><br><\/a><\/p> <p>These are not easy vector operations, and as will be shown later one result of this is that using SSE to speed up the values for one multiplication is useless (I tried, it was worse). So, the performance improvement comes from doing two elements of the loop at a time (this is how the current Java submission is faster than the C# one, though it does not use SIMD).&nbsp; The current C# version does the inner loop of the program without SIMD instructions as follows:<\/p> \r\n\r\n<pre class=\"csharp\">\r\nint i = 49;   \r\ndo {\r\n     zi = zr * zi + zr * zi + ci; \r\n     zr = tr - ti + cr;   \r\n     tr = zr * zr; \r\n     ti = zi * zi;  \r\n} while <sup id=\"footnote_plugin_tooltip_1\" class=\"footnote_plugin_tooltip_text\" onclick=\"footnote_moveToAnchor('footnote_plugin_reference_1');\">1)<\/sup><span class=\"footnote_tooltip\" id=\"footnote_plugin_tooltip_text_1\">tr + ti <= 4.0) &#038;&#038; (--i > 0<\/span><script type=\"text\/javascript\">\tjQuery(\"#footnote_plugin_tooltip_1\").tooltip({\t\ttip: \"#footnote_plugin_tooltip_text_1\",\t\ttipClass: \"footnote_tooltip\",\t\teffect: \"fade\",\t\tfadeOutSpeed: 100,\t\tpredelay: 400,\t\tposition: \"top right\",\t\trelative: true,\t\toffset: [10, 10]\t});<\/script>; \r\n<\/pre>\r\n\r\n<br>\r\n\r\n<p>This loop is iterating until convergence (&lt;4) or until the max iterations (i&lt;0) have expired. See the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Mandelbrot_set#For_programmers\">wikipedia page <\/a>for a full explanation.&nbsp; Of interest here is that the &#8220;t&#8221; variables exist for pipelining purposes, so this is a reasonably optimized code snippet.<\/p>\r\n<p><strong>Trying to Implement the Optimal Solution Using Mono.Simd<\/strong><\/p>\r\n<p>Actually figuring out how to do the complex arthimetic with SSE2 is rather challenging.&nbsp; Fortunately Intel <a href=\"http:\/\/tempus.metal.agh.edu.pl\/~agumula\/zajecia2\/po\/po.pliki\/SSE3.Intel.Instruction.Set66715.66715.pdf\">published a document<\/a> giving the best solution as hand coded assembly, which involves using some special SSE3 instructions I was not aware of.&nbsp; Notably, the Intel optimized code is far faster than even their C code, but is only about 25% better than their optimized assembly code without SSE3.<\/p>\r\n<p>The code below shows my attempt to implement the Intel approach in Mono.&nbsp; It should be pretty fast, however it may suffer from one fatal flaw.&nbsp; The comparison at the end to check for the final conditions currently requires unpacking both values in the Vector2D (when either Length.X or Length.Y have been &lt; 4.0 at least once, then the loop stops).&nbsp; The comparison for both X and Y can be done in one operation in SIMD using the built in less than statement.&nbsp; However, I do not know how to turn that into a control loop function, as this requires a movmskps assembly function which mono does not seem to expose.<\/p>\r\n\r\n\r\n<pre class=\"csharp\">\r\nint i = 49;   \r\nint b = 0;\r\ndo {\r\n    Vector2d Zri=Zr*Zi; \/\/calculate r*i for both       \r\n    Zi=Zri+Zri+Ci;  \/\/double that and add a constant\r\n    Zr=Temp; \/\/pre-calculated on previous loop\r\n    var V0=Zr.InterleaveLow(Zi); \/\/r0,i0\r\n    var V1=Zr.InterleaveHigh(Zi); \/\/r1,i1\r\n    V0=V0*V0; \/\/r0^2,i0^2      \r\n    V1=V1*V1;      \r\n    var Length=V0.HorizontalAdd(V1); \/\/(r0^2+i0^2),(r1^2+i1^2)\r\n    Temp=V0.HorizontalSub(V1)+Cr; \/\/(r0^2-i0^2),(r1^2-i1^2)\r\n    \/\/now to determine end condition,                       \r\n    if (Length.X > 4.0) \r\n    {b|=2; if (b==3) break;}\r\n    if (Length.Y>;4.0) \r\n    {b|=1; if (b==3) break;}}\r\n while  (--i > 0); \r\n <\/pre>\r\n\r\n<p><\/p>\r\n<p><strong>Results &#8211; It&#8217;s Faster on Ubuntu<\/strong><\/p>\r\n<p class=\"auto-style1\"><a href=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/img2.gif\"><img loading=\"lazy\" title=\"img2\" style=\"border-top: 0px; border-right: 0px; background-image: none; border-bottom: 0px; padding-top: 0px; padding-left: 0px; border-left: 0px; display: inline; padding-right: 0px\" border=\"0\" alt=\"img2\" src=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/img2_thumb.gif?resize=379%2C245\" width=\"379\" height=\"245\"  data-recalc-dims=\"1\"><\/a><\/p>\r\n\r\n<p class=\"auto-style2\">Shown above are the times for three different algorithms. The first is the original best submission on the website.&nbsp; My SIMD submission is shown on the bottom.&nbsp; Because the SIMD version does two cycles within each inner loop, as the Java submission does, I tested the Java submission converted to C# as well.&nbsp; Compared to the current submission this version shows a 81% improvement, but clearly much of that is simply from doing 2 cycles in one loop.&nbsp; However, even once this change is made, the SIMD instructions still give a performance boost.<\/p>\r\n<p class=\"auto-style2\"><strong>The Assembly Generated is Still Not Optimal<\/strong><\/p>\r\n<p class=\"auto-style2\">Although the assembly generated did include lots of SSE commands, in general inspecting the assembly I noticed several things.<\/p>\r\n\r\n<ol>\r\n<li>\r\n<p class=\"auto-style2\">Never unpack the double array values!&nbsp; My first attempt tried to do the SSE2 steps with those instructions, and then unpack the single values as needed.&nbsp; However, this failed prettty miserably, as accessing the double seemed to involve a lot of stack to register movement.<\/p>\r\n<li>\r\n<p class=\"auto-style2\">All the XMM registers are not used.&nbsp; The optimal version uses all of them, the C# version uses only 0-3, and soften moves things to and from the stack. Not using registers optimally seems to be a common problem though with C#.<\/p><\/li><\/ol>\r\n<p class=\"auto-style2\"><strong>Conclusions<\/strong><\/p>\r\n<p class=\"auto-style2\">The SIMD version was indeed a fair bit faster, which is nice!&nbsp; However, in this case it was not a game-changer.&nbsp; Most importantly though, it was really easy to use, and I think I might incorporate the byte array operations at some point in future code.&nbsp; This also gave me an appreciation for assembly, which in contrast to what I had heard is easy to read and seems easy to optimize.&nbsp; I just submitted the code to the shoot-out, assuming it works there it should be up soon, and I would love for someone to show me how to fix the end statement.<\/p><div class=\"footnote_container_prepare\">\t<p><span onclick=\"footnote_expand_reference_container();\">References<\/span><span style=\"display: none;\">&nbsp;&nbsp;&nbsp;[ <a id=\"footnote_reference_container_collapse_button\" style=\"cursor:pointer;\" onclick=\"footnote_expand_collapse_reference_container();\">+<\/a> ]<\/span><\/p><\/div><div id=\"footnote_references_container\" style=\"\">\t<table class=\"footnote-reference-container\">\t\t<tbody>\t\t<tr>\t<td class=\"footnote_plugin_index\"><span id=\"footnote_plugin_reference_1\">1.<\/span><\/td>\t<td class=\"footnote_plugin_link\"><span onclick=\"footnote_moveToAnchor('footnote_plugin_tooltip_1');\">&#8593;<\/span><\/td>\t<td class=\"footnote_plugin_text\">tr + ti <= 4.0) &#038;&#038; (--i > 0<\/td><\/tr>\t\t<\/tbody>\t<\/table><\/div><script type=\"text\/javascript\">\tfunction footnote_expand_reference_container() {\t\tjQuery(\"#footnote_references_container\").show();        jQuery(\"#footnote_reference_container_collapse_button\").text(\"-\");\t}    function footnote_collapse_reference_container() {        jQuery(\"#footnote_references_container\").hide();        jQuery(\"#footnote_reference_container_collapse_button\").text(\"+\");    }\tfunction footnote_expand_collapse_reference_container() {\t\tif (jQuery(\"#footnote_references_container\").is(\":hidden\")) {            footnote_expand_reference_container();\t\t} else {            footnote_collapse_reference_container();\t\t}\t}    function footnote_moveToAnchor(p_str_TargetID) {        footnote_expand_reference_container();        var l_obj_Target = jQuery(\"#\" + p_str_TargetID);        if(l_obj_Target.length) {            jQuery('html, body').animate({                scrollTop: l_obj_Target.offset().top - window.innerHeight\/2            }, 1000);        }    }<\/script>","protected":false},"excerpt":{"rendered":"C# and .NET are some of the fastest high level languages, but still cannot truly compete with C\/C++ for low level speed, and C# code can be anywhere from 20%-300% slower. This is despite the fact that the C# compiler often gets as much information about a method as the C compiler.&nbsp; It has been [&hellip;]","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[2,8,3],"tags":[20,15,16],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":299,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=299","url_meta":{"origin":112,"position":0},"title":"C# vs. Java, Xamarin vs. Oracle, Performance Comparison version 2.0","date":"June 14, 2014","format":false,"excerpt":"Today I noticed the SIMD implementation of the Mandelbrot set algorithm I blogged about last year was successfully submitted to the language shootout webpage. However, I was a bit disappointed to see the C# version was still slower than the Java version, despite my use of the special SIMD instructions\u2026","rel":"","context":"Similar post","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":398,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=398","url_meta":{"origin":112,"position":1},"title":".NET Bio is Significantly Faster on .Net Core 2.0","date":"November 5, 2017","format":false,"excerpt":"Summary: With the release of .NET Core 2.0, .NET Bio is able to run significantly faster (~2X) on Mac OSX due to better compilation and memory mangement. The .NET Bio\u00a0library contains libraries for genomic data processing tasks like parsing, alignment, etc. that are too computationally intense to be\u00a0undertaken with interpreted\u2026","rel":"","context":"In \".NET Bio\"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2017\/11\/Benchmark-1.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":35,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=35","url_meta":{"origin":112,"position":2},"title":"Comparing data structure enumeration speeds in C#","date":"March 1, 2013","format":false,"excerpt":"Determining which data structure to use for storing data involves trade-offs between how much memory they require and how long different operations, such as insertions, deletions, or searches take.\u00a0 In C#, using a linear array is the fastest way to enumerate all of the objects in a collection.\u00a0 However, the\u2026","rel":"","context":"Similar post","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":188,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=188","url_meta":{"origin":112,"position":3},"title":"The .NET Bio BAM Parser is Smoking Fast","date":"October 12, 2013","format":false,"excerpt":"The .NET Bio library has an improved version of it's BAM file\u00a0parser, which makes it significantly faster and easily competitive with the\u00a0current standard C coded SAMTools for obtaining\u00a0sequencing data and working with it. The chart below compares the time it\u00a0takes in seconds for the old version of the parser and\u2026","rel":"","context":"In &quot;.NET Bio&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/10\/img5.gif?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":71,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=71","url_meta":{"origin":112,"position":4},"title":"Java vs. C# Performance Comparison for Parsing VCF Files","date":"May 26, 2013","format":false,"excerpt":"Making a comparison with a reasonably complex program ported between the two languages. Update 3\/10\/2014: After writing this post I changed the C# parser to remove an extra List<> allocation in the C# code that was not in the Java code.\u00a0\u00a0After this, the Java\/C# versions are indistinguishable on speed, but\u2026","rel":"","context":"In &quot;Algorithms&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/05\/image_thumb1.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":53,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=53","url_meta":{"origin":112,"position":5},"title":"How to remove the &ldquo;Trial Edition&rdquo; banner from the VisiFire open source chart kit","date":"April 13, 2013","format":false,"excerpt":"Visifire is a very good graphing component for making silverlight or WPF applications.\u00a0 The component was first released as an open source library on GoogleCode, but since then has been made a closed source proprietary and for profit project.\u00a0 The newer version contains several enhancements, but the open source version\u2026","rel":"","context":"In &quot;C#&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/04\/image_thumb.png?resize=350%2C200","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/112"}],"collection":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=112"}],"version-history":[{"count":35,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/112\/revisions"}],"predecessor-version":[{"id":342,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/112\/revisions\/342"}],"wp:attachment":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=112"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=112"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=112"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}