{"id":376,"date":"2016-09-19T07:08:51","date_gmt":"2016-09-19T07:08:51","guid":{"rendered":"http:\/\/evolvedmicrobe.com\/blogs\/?p=376"},"modified":"2016-09-19T17:00:14","modified_gmt":"2016-09-19T17:00:14","slug":"why-r-math-functions-on-windows-are-slow-and-how-to-fix-it","status":"publish","type":"post","link":"http:\/\/evolvedmicrobe.com\/blogs\/?p=376","title":{"rendered":"Why R Math Functions on Windows are Slow, and How to Fix It"},"content":{"rendered":"R on windows has much slower versions of the log, sine and cosine functions than are available on other platforms, and this can be a serious performance bottleneck for programs which frequently call these math functions.\u00a0 The reason for this is that the library R uses to obtain the log function on windows (libmingwex.a) contains a version of the log implementation which is out of date relative to more modern code and much slower than other available versions. \u00a0That the glibc implementations of common math functions\u00a0are slow\u00a0is a known issue that others have discussed on the internet. \u00a0This post uses the log function as an example to show specifically why it is slow, and then suggest some quick work arounds for these math functions.\r\n<h4>The log Function on MinGW \/ Windows<\/h4>\r\nComparing the assembly code of the log function as generated with MinGW on windows with the assembly generated for the log function on Mac OSX shows why the code is slow. Below is the assembly code for log function as decompiled from\u00a0MinGW (<a href=\"https:\/\/sourceforge.net\/p\/mingw\/mingw-org-wsl\/ci\/21762bb4a1bd0c88c38eead03f59e8d994349e83\/tree\/src\/libcrt\/math\/logl.S\">their implementation<\/a>\u00a0is basically a cut\/paste from GNU libc).\u00a0The \u00a0crucial feature here is that most instructions start with an &#8220;f&#8221; indicating that they are using the floating point registers, and there is one instruction the <code>fyl2xp1<\/code>\u00a0that is one of the most\u00a0<a href=\"http:\/\/www.agner.org\/optimize\/instruction_tables.pdf\">expensive<\/a>\u00a0operations out there. This instruction takes a log using hardware, and is known to be slower than most other possible ways to calculate log for modern machines.\r\n<pre title=\"MinGW Log Function\" class=\"lang:asm decode:true\">0000000000000010 &lt;__logl_internal&gt;:\r\n  10:\td9 ed                \tfldln2 \r\n  12:\tdb 2a                \tfldt   (%rdx)\r\n  14:\td9 c0                \tfld    %st(0)\r\n  16:\tdc 25 e4 ff ff ff    \tfsubl  -0x1c(%rip)        # 0 &lt;one&gt;\r\n  1c:\td9 c0                \tfld    %st(0)\r\n  1e:\td9 e1                \tfabs   \r\n  20:\tdc 1d e2 ff ff ff    \tfcompl -0x1e(%rip)        # 8 &lt;limit&gt;\r\n  26:\tdf e0                \tfnstsw %ax\r\n  28:\t80 e4 45             \tand    $0x45,%ah\r\n  2b:\t74 12                \tje     3f &lt;__logl_internal+0x2f&gt;\r\n  2d:\tdd d9                \tfstp   %st(1)\r\n  2f:\td9 f9                \tfyl2xp1 \r\n  31:\t48 89 c8             \tmov    %rcx,%rax\r\n  34:\t48 c7 41 08 00 00 00 \tmovq   $0x0,0x8(%rcx)\r\n  3b:\t00 \r\n  3c:\tdb 39                \tfstpt  (%rcx)\r\n  3e:\tc3                   \tretq   \r\n  3f:\tdd d8                \tfstp   %st(0)\r\n  41:\td9 f1                \tfyl2x  \r\n  43:\t48 89 c8             \tmov    %rcx,%rax\r\n  46:\t48 c7 41 08 00 00 00 \tmovq   $0x0,0x8(%rcx)\r\n  4d:\t00 \r\n  4e:\tdb 39                \tfstpt  (%rcx)<\/pre>\r\nThese floating point register instructions were state of the art way back when, but nowadays most hardware is using special XMM\/YMM registers and instructions that can make for much faster code to calculate logarithms (See\u00a0<a href=\"http:\/\/www.agner.org\/optimize\/optimizing_assembly.pdf\">Section 17<\/a>\u00a0for more information on this).\r\n<h4>The log Function on Mac OSX<\/h4>\r\nDemonstrating faster assembly code is the Mac OSX implementation of log shown below. \u00a0The key feature here is that it uses XMM registers and has a more modern and performant\u00a0implementation. \u00a0Note that many other implementations (e.g. the Microsoft compiler) also use faster versions like this. That is, <strong>Windows is not slow, R is slow on Windows.<\/strong> It&#8217;s all the same Intel chips under the hood.\r\n<pre class=\"tab-size:2 lang:asm decode:true\">+0x00\tvmovq               %xmm0, %rax\r\n+0x05\tshrq                $32, %rax\r\n+0x09\tmovl                %eax, %edx\r\n+0x0b\tsubl                $1048576, %eax\r\n+0x10\tcmpl                $2145386496, %eax\r\n+0x15\tjae                 \"0x7fff9793e100+0xe4\"\r\n+0x1b\t    andl                $1048575, %eax\r\n+0x20\t    addl                $4096, %eax\r\n+0x25\t    andl                $2088960, %eax\r\n+0x2a\t    shrl                $9, %eax\r\n+0x2d\t    addl                $3222802432, %edx\r\n+0x33\t    sarl                $20, %edx\r\n+0x36\t    vcvtsi2sdl          %edx, %xmm1, %xmm1\r\n+0x3a\t    vmovsd              112942(%rip), %xmm6\r\n+0x42\t    vandpd              %xmm6, %xmm0, %xmm0\r\n+0x46\t    vmovsd              74890(%rip), %xmm7\r\n+0x4e\t    vorpd               %xmm7, %xmm0, %xmm0\r\n+0x52\t    leaq                112999(%rip), %rdx\r\n+0x59\t    vmovss              (%rdx,%rax), %xmm3\r\n+0x5e\t    vpsllq              $32, %xmm3, %xmm3\r\n+0x63\t    vfmsub213sd         %xmm7, %xmm0, %xmm3\r\n+0x68\t    vmovddup            %xmm3, %xmm3\r\n+0x6c\t    vaddpd              -48(%rdx), %xmm3, %xmm4\r\n+0x71\t    vfmadd213pd         -32(%rdx), %xmm3, %xmm4\r\n+0x77\t    vmovapd             -16(%rdx), %xmm5\r\n+0x7c\t    vaddsd              %xmm3, %xmm5, %xmm5\r\n+0x80\t    vmulpd              %xmm3, %xmm5, %xmm5\r\n+0x84\t    vmulpd              %xmm5, %xmm4, %xmm4\r\n+0x88\t    vmovhlps            %xmm4, %xmm5, %xmm5\r\n+0x8c\t    vmulsd              %xmm5, %xmm4, %xmm4\r\n+0x90\t    je                  \"0x7fff9793e100+0xc7\"\r\n+0x92\t    vmovss              4(%rdx,%rax), %xmm5\r\n+0x98\t    vpsllq              $32, %xmm5, %xmm5\r\n+0x9d\t    vmulsd              112867(%rip), %xmm1, %xmm0\r\n+0xa5\t    vmulsd              112851(%rip), %xmm1, %xmm1\r\n+0xad\t    vaddsd              8(%rdx,%rax), %xmm0, %xmm0\r\n+0xb3\t    vaddsd              %xmm5, %xmm1, %xmm1\r\n+0xb7\t    vaddsd              %xmm4, %xmm0, %xmm0\r\n+0xbb\t    vaddsd              %xmm3, %xmm0, %xmm0\r\n+0xbf\t    vaddsd              %xmm1, %xmm0, %xmm0\r\n+0xc3\t    vzeroupper\r\n+0xc6\t    retq\r\n+0xc7\t    vmovss              4(%rdx,%rax), %xmm5\r\n+0xcd\t    vpsllq              $32, %xmm5, %xmm5\r\n+0xd2\t    vaddsd              %xmm5, %xmm3, %xmm3\r\n+0xd6\t    vaddsd              8(%rdx,%rax), %xmm4, %xmm0\r\n+0xdc\t    vaddsd              %xmm3, %xmm0, %xmm0\r\n+0xe0\t    vzeroupper\r\n+0xe3\t    retq\r\n<\/pre>\r\n<h4>Solving the problem<\/h4>\r\nIdeally we could just push a faster log implementation to the R core repo for Windows, but the fact of the matter is the R core team does not accept code changes that only improve performance (and fair play to them, R is the most cross platform software out their, and that wasn&#8217;t easy to do). \u00a0Also, realistically the current implementation is feasible for most peoples work. So the solution is to switch out the log function for a better one only when it matters, and if it matters that means most of the library is already written in compiled code.\r\n\r\nWhere to get a cross-platform compatible log function? It seems the folks over at Julia ran into a similar problem with slow Log on different machines. They looked at two version in libm (e.g.\u00a0<a href=\"http:\/\/opensource.apple.com\/\/source\/Libm\/Libm-2026\/Source\/Intel\/xmm_log.c\"> Libm Version #1<\/a>\u00a0and\u00a0<a href=\"http:\/\/opensource.apple.com\/\/source\/Libm\/Libm-2026\/Source\/Intel\/log_universal.h\">Libm Version #2<\/a>), and also some really crazy implementations such as this one available from\u00a0<a href=\"https:\/\/bitbucket.org\/MDukhan\/yeppp\/src\/b8db687e912bbd7e2a26dd93c348ef2d7a5febdc\/library\/headers\/yepBuiltin.h?at=default&amp;fileviewer=file-view-default#yepBuiltin.h-1128\">Yepp<\/a>. However, what they <a href=\"https:\/\/github.com\/JuliaLang\/openlibm\/blob\/960fdbf8bd01cc2524c1a9e1a94110d1a4f1964a\/src\/e_log.c\">settled on<\/a>\u00a0is a very good function, and we can port those over directly into R. \u00a0An example is shown in the <a href=\"https:\/\/github.com\/mlysy\/msdeTest\/commit\/8b0f0f2ce3274e58c3f0b77e78749cad6c560bbb\">pull request here<\/a>.\r\n\r\n&nbsp;\r\n\r\n&nbsp;","protected":false},"excerpt":{"rendered":"R on windows has much slower versions of the log, sine and cosine functions than are available on other platforms, and this can be a serious performance bottleneck for programs which frequently call these math functions.\u00a0 The reason for this is that the library R uses to obtain the log function on windows (libmingwex.a) contains [&hellip;]","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[1],"tags":[21,22,24],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":6,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=6","url_meta":{"origin":376,"position":0},"title":"Not All Poisson Random Variables Are Created Equally","date":"January 30, 2013","format":false,"excerpt":"Spurred by a slow running program, I spent an afternoon researching what algorithms are available for generating Poisson random variables and figuring out which methods are used by R, Matlab, NumPy, the GNU Science Libraray and various other available packages. I learned some things that I think would be useful\u2026","rel":"","context":"In &quot;Algorithms&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/01\/img34-300x239.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":359,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=359","url_meta":{"origin":376,"position":1},"title":"Profiling Rcpp package code on Windows","date":"September 3, 2016","format":false,"excerpt":"Profiling Rcpp code on Unix\/Mac is easy, but is difficult on Windows because R uses a compilation toolchain (MinGW) that produces files that are not understood by common Windows profiling programs.\u00a0 Additionally, the R build process often removes\u00a0symbols which allow profilers to produce sensible interpretations of their data. The following\u2026","rel":"","context":"In \"Optimization\"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2016\/09\/assembly.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":398,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=398","url_meta":{"origin":376,"position":2},"title":".NET Bio is Significantly Faster on .Net Core 2.0","date":"November 5, 2017","format":false,"excerpt":"Summary: With the release of .NET Core 2.0, .NET Bio is able to run significantly faster (~2X) on Mac OSX due to better compilation and memory mangement. The .NET Bio\u00a0library contains libraries for genomic data processing tasks like parsing, alignment, etc. that are too computationally intense to be\u00a0undertaken with interpreted\u2026","rel":"","context":"In \".NET Bio\"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2017\/11\/Benchmark-1.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":153,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=153","url_meta":{"origin":376,"position":3},"title":"Using Selectome with .NET Bio, F# and R","date":"September 16, 2013","format":false,"excerpt":"The Bio.Selectome namespace has features to query\u00a0Selectome.Selectome is a database that merges data from Ensembl\u00a0and the programs in PAML used to compute the ratio of non-synonymous to synonymous (dN\/dS)\u00a0mutations along various branches of the phylogenetic tree. A low dN\/dS ratio\u00a0indicates that the protein sequence is under strong selective constraint, while\u2026","rel":"","context":"In &quot;.NET Bio&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":112,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=112","url_meta":{"origin":376,"position":4},"title":"Mono.Simd and the Mandlebrot Set.","date":"September 10, 2013","format":false,"excerpt":"C# and .NET are some of the fastest high level languages, but still cannot truly compete with C\/C++ for low level speed, and C# code can be anywhere from 20%-300% slower. This is despite the fact that the C# compiler often gets as much information about a method as the\u2026","rel":"","context":"In &quot;Algorithms&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/img2_thumb.gif?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":12,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=12","url_meta":{"origin":376,"position":5},"title":"Compile Bowtie2 on Windows 64 bit.","date":"January 30, 2013","format":false,"excerpt":"Bowtie 2 is a program that efficiently aligns next generation sequence data to a reference genome. However, the version distributed by the authors only compiles on POSIX platforms. These instructions will allow you to compile it on windows by downloading the Mingw64 tools and editing the make file before building\u2026","rel":"","context":"In &quot;Computing&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/01\/Capture.png?resize=350%2C200","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/376"}],"collection":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=376"}],"version-history":[{"count":10,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/376\/revisions"}],"predecessor-version":[{"id":387,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/376\/revisions\/387"}],"wp:attachment":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=376"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=376"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=376"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}