September | 2016 | nigel delaney

R on windows has much slower versions of the log, sine and cosine functions than are available on other platforms, and this can be a serious performance bottleneck for programs which frequently call these math functions. The reason for this is that the library R uses to obtain the log function on windows (libmingwex.a) contains a version of the log implementation which is out of date relative to more modern code and much slower than other available versions. That the glibc implementations of common math functions are slow is a known issue that others have discussed on the internet. This post uses the log function as an example to show specifically why it is slow, and then suggest some quick work arounds for these math functions.

The log Function on MinGW / Windows

Comparing the assembly code of the log function as generated with MinGW on windows with the assembly generated for the log function on Mac OSX shows why the code is slow. Below is the assembly code for log function as decompiled from MinGW (their implementation is basically a cut/paste from GNU libc). The crucial feature here is that most instructions start with an “f” indicating that they are using the floating point registers, and there is one instruction the fyl2xp1 that is one of the most expensive operations out there. This instruction takes a log using hardware, and is known to be slower than most other possible ways to calculate log for modern machines.

0000000000000010 <__logl_internal>:
  10:	d9 ed                	fldln2 
  12:	db 2a                	fldt   (%rdx)
  14:	d9 c0                	fld    %st(0)
  16:	dc 25 e4 ff ff ff    	fsubl  -0x1c(%rip)        # 0 <one>
  1c:	d9 c0                	fld    %st(0)
  1e:	d9 e1                	fabs   
  20:	dc 1d e2 ff ff ff    	fcompl -0x1e(%rip)        # 8 <limit>
  26:	df e0                	fnstsw %ax
  28:	80 e4 45             	and    $0x45,%ah
  2b:	74 12                	je     3f <__logl_internal+0x2f>
  2d:	dd d9                	fstp   %st(1)
  2f:	d9 f9                	fyl2xp1 
  31:	48 89 c8             	mov    %rcx,%rax
  34:	48 c7 41 08 00 00 00 	movq   $0x0,0x8(%rcx)
  3b:	00 
  3c:	db 39                	fstpt  (%rcx)
  3e:	c3                   	retq   
  3f:	dd d8                	fstp   %st(0)
  41:	d9 f1                	fyl2x  
  43:	48 89 c8             	mov    %rcx,%rax
  46:	48 c7 41 08 00 00 00 	movq   $0x0,0x8(%rcx)
  4d:	00 
  4e:	db 39                	fstpt  (%rcx)

0000000000000010 <__logl_internal>:

10: d9 ed fldln2

12: db 2a fldt (%rdx)

14: d9 c0 fld %st(0)

16: dc 25 e4 ff ff ff fsubl -0x1c(%rip) # 0 <one>

1c: d9 c0 fld %st(0)

1e: d9 e1 fabs

20: dc 1d e2 ff ff ff fcompl -0x1e(%rip) # 8 <limit>

26: df e0 fnstsw %ax

28: 80 e4 45 and $0x45,%ah

2b: 74 12 je 3f <__logl_internal+0x2f>

2d: dd d9 fstp %st(1)

2f: d9 f9 fyl2xp1

31: 48 89 c8 mov %rcx,%rax

34: 48 c7 41 08 00 00 00 movq $0x0,0x8(%rcx)

3b: 00

3c: db 39 fstpt (%rcx)

3e: c3 retq

3f: dd d8 fstp %st(0)

41: d9 f1 fyl2x

43: 48 89 c8 mov %rcx,%rax

46: 48 c7 41 08 00 00 00 movq $0x0,0x8(%rcx)

4d: 00

4e: db 39 fstpt (%rcx)

These floating point register instructions were state of the art way back when, but nowadays most hardware is using special XMM/YMM registers and instructions that can make for much faster code to calculate logarithms (See Section 17 for more information on this).

The log Function on Mac OSX

Demonstrating faster assembly code is the Mac OSX implementation of log shown below. The key feature here is that it uses XMM registers and has a more modern and performant implementation. Note that many other implementations (e.g. the Microsoft compiler) also use faster versions like this. That is, Windows is not slow, R is slow on Windows. It’s all the same Intel chips under the hood.

+0x00	vmovq               %xmm0, %rax
+0x05	shrq                $32, %rax
+0x09	movl                %eax, %edx
+0x0b	subl                $1048576, %eax
+0x10	cmpl                $2145386496, %eax
+0x15	jae                 "0x7fff9793e100+0xe4"
+0x1b	    andl                $1048575, %eax
+0x20	    addl                $4096, %eax
+0x25	    andl                $2088960, %eax
+0x2a	    shrl                $9, %eax
+0x2d	    addl                $3222802432, %edx
+0x33	    sarl                $20, %edx
+0x36	    vcvtsi2sdl          %edx, %xmm1, %xmm1
+0x3a	    vmovsd              112942(%rip), %xmm6
+0x42	    vandpd              %xmm6, %xmm0, %xmm0
+0x46	    vmovsd              74890(%rip), %xmm7
+0x4e	    vorpd               %xmm7, %xmm0, %xmm0
+0x52	    leaq                112999(%rip), %rdx
+0x59	    vmovss              (%rdx,%rax), %xmm3
+0x5e	    vpsllq              $32, %xmm3, %xmm3
+0x63	    vfmsub213sd         %xmm7, %xmm0, %xmm3
+0x68	    vmovddup            %xmm3, %xmm3
+0x6c	    vaddpd              -48(%rdx), %xmm3, %xmm4
+0x71	    vfmadd213pd         -32(%rdx), %xmm3, %xmm4
+0x77	    vmovapd             -16(%rdx), %xmm5
+0x7c	    vaddsd              %xmm3, %xmm5, %xmm5
+0x80	    vmulpd              %xmm3, %xmm5, %xmm5
+0x84	    vmulpd              %xmm5, %xmm4, %xmm4
+0x88	    vmovhlps            %xmm4, %xmm5, %xmm5
+0x8c	    vmulsd              %xmm5, %xmm4, %xmm4
+0x90	    je                  "0x7fff9793e100+0xc7"
+0x92	    vmovss              4(%rdx,%rax), %xmm5
+0x98	    vpsllq              $32, %xmm5, %xmm5
+0x9d	    vmulsd              112867(%rip), %xmm1, %xmm0
+0xa5	    vmulsd              112851(%rip), %xmm1, %xmm1
+0xad	    vaddsd              8(%rdx,%rax), %xmm0, %xmm0
+0xb3	    vaddsd              %xmm5, %xmm1, %xmm1
+0xb7	    vaddsd              %xmm4, %xmm0, %xmm0
+0xbb	    vaddsd              %xmm3, %xmm0, %xmm0
+0xbf	    vaddsd              %xmm1, %xmm0, %xmm0
+0xc3	    vzeroupper
+0xc6	    retq
+0xc7	    vmovss              4(%rdx,%rax), %xmm5
+0xcd	    vpsllq              $32, %xmm5, %xmm5
+0xd2	    vaddsd              %xmm5, %xmm3, %xmm3
+0xd6	    vaddsd              8(%rdx,%rax), %xmm4, %xmm0
+0xdc	    vaddsd              %xmm3, %xmm0, %xmm0
+0xe0	    vzeroupper
+0xe3	    retq

+0x00 vmovq %xmm0, %rax

+0x05 shrq $32, %rax

+0x09 movl %eax, %edx

+0x0b subl $1048576, %eax

+0x10 cmpl $2145386496, %eax

+0x15 jae "0x7fff9793e100+0xe4"

+0x1b andl $1048575, %eax

+0x20 addl $4096, %eax

+0x25 andl $2088960, %eax

+0x2a shrl $9, %eax

+0x2d addl $3222802432, %edx

+0x33 sarl $20, %edx

+0x36 vcvtsi2sdl %edx, %xmm1, %xmm1

+0x3a vmovsd 112942(%rip), %xmm6

+0x42 vandpd %xmm6, %xmm0, %xmm0

+0x46 vmovsd 74890(%rip), %xmm7

+0x4e vorpd %xmm7, %xmm0, %xmm0

+0x52 leaq 112999(%rip), %rdx

+0x59 vmovss (%rdx,%rax), %xmm3

+0x5e vpsllq $32, %xmm3, %xmm3

+0x63 vfmsub213sd %xmm7, %xmm0, %xmm3

+0x68 vmovddup %xmm3, %xmm3

+0x6c vaddpd -48(%rdx), %xmm3, %xmm4

+0x71 vfmadd213pd -32(%rdx), %xmm3, %xmm4

+0x77 vmovapd -16(%rdx), %xmm5

+0x7c vaddsd %xmm3, %xmm5, %xmm5

+0x80 vmulpd %xmm3, %xmm5, %xmm5

+0x84 vmulpd %xmm5, %xmm4, %xmm4

+0x88 vmovhlps %xmm4, %xmm5, %xmm5

+0x8c vmulsd %xmm5, %xmm4, %xmm4

+0x90 je "0x7fff9793e100+0xc7"

+0x92 vmovss 4(%rdx,%rax), %xmm5

+0x98 vpsllq $32, %xmm5, %xmm5

+0x9d vmulsd 112867(%rip), %xmm1, %xmm0

+0xa5 vmulsd 112851(%rip), %xmm1, %xmm1

+0xad vaddsd 8(%rdx,%rax), %xmm0, %xmm0

+0xb3 vaddsd %xmm5, %xmm1, %xmm1

+0xb7 vaddsd %xmm4, %xmm0, %xmm0

+0xbb vaddsd %xmm3, %xmm0, %xmm0

+0xbf vaddsd %xmm1, %xmm0, %xmm0

+0xc3 vzeroupper

+0xc6 retq

+0xc7 vmovss 4(%rdx,%rax), %xmm5

+0xcd vpsllq $32, %xmm5, %xmm5

+0xd2 vaddsd %xmm5, %xmm3, %xmm3

+0xd6 vaddsd 8(%rdx,%rax), %xmm4, %xmm0

+0xdc vaddsd %xmm3, %xmm0, %xmm0

+0xe0 vzeroupper

+0xe3 retq

Solving the problem

Ideally we could just push a faster log implementation to the R core repo for Windows, but the fact of the matter is the R core team does not accept code changes that only improve performance (and fair play to them, R is the most cross platform software out their, and that wasn’t easy to do). Also, realistically the current implementation is feasible for most peoples work. So the solution is to switch out the log function for a better one only when it matters, and if it matters that means most of the library is already written in compiled code. Where to get a cross-platform compatible log function? It seems the folks over at Julia ran into a similar problem with slow Log on different machines. They looked at two version in libm (e.g. Libm Version #1 and Libm Version #2), and also some really crazy implementations such as this one available from Yepp. However, what they settled on is a very good function, and we can port those over directly into R. An example is shown in the pull request here.

Profiling Rcpp code on Unix/Mac is easy, but is difficult on Windows because R uses a compilation toolchain (MinGW) that produces files that are not understood by common Windows profiling programs. Additionally, the R build process often removes symbols which allow profilers to produce sensible interpretations of their data. The following steps allow one to profile Rcpp code on windows.

Change compilation settings to add in symbol settings

A default R installation typically has certain compiler settings placed in the equivalent of the C:\Program Files\R\R-3.3.1\etc\x64\Makeconf that strips information needed for profiling during the Rcpp compilation process, in particular a line which reads: DLLFLAGS=-s . To override this and add some additionally needed flags, one should add a folder and file to their home directory which overrides and appends necessesary compilation flags. To a file located at a location equivalent to C:\Users\YOURNAME\.R\Makevars on your machine (note the ‘.’ before R), add the following lines:

CXXFLAGS+=-gdwarf-2
DLLFLAGS=

1 2	CXXFLAGS+=-gdwarf-2 DLLFLAGS=

You can verify this worked correctly by checking that -gdwarf-2 appears in the compilation messages, and that -s is missing in the final linker step.

Run a profiler which understands MinGW compiled code

The next key step is to run a profiler which can understand the Unix like symbols on windows. Two free and good options are Very Sleepy and AMD’s code analyst (which also works on Intel chips). Very Sleepy is very good at basic timings and providing stack traces, while AMD’s profiler is able to drill down to the assembly of a process. Both profilers are good but an example with AMD is shown below.

Open the program and setup a quick session to start and run a sample R script that uses your code, such as in the example shown below.
Next run the profiler and get ready to look at results. For example, here I can see that half the time was spent in my code, versus half in the R core’s code (generating random numbers)And digging further down I can see at the assembly level what the biggest bottlenecks were in my code

Its often helpful to look at the original source files in addition to the assembly, and this can be enabled by setting directory information by Tools-> CodeAnalyst Options -> Directories.

nigel delaney

evolutionary biologist, statistician, nice guy

Monthly Archives: September 2016

Why R Math Functions on Windows are Slow, and How to Fix It

The log Function on MinGW / Windows

The log Function on Mac OSX

Solving the problem

Profiling Rcpp package code on Windows

Change compilation settings to add in symbol settings

Run a profiler which understands MinGW compiled code