The log Function on MinGW / Windows
Comparing the assembly code of the log function as generated with MinGW on windows with the assembly generated for the log function on Mac OSX shows why the code is slow. Below is the assembly code for log function as decompiled from MinGW (their implementation is basically a cut/paste from GNU libc). The crucial feature here is that most instructions start with an “f” indicating that they are using the floating point registers, and there is one instruction thefyl2xp1
that is one of the most expensive operations out there. This instruction takes a log using hardware, and is known to be slower than most other possible ways to calculate log for modern machines.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
0000000000000010 <__logl_internal>: 10: d9 ed fldln2 12: db 2a fldt (%rdx) 14: d9 c0 fld %st(0) 16: dc 25 e4 ff ff ff fsubl -0x1c(%rip) # 0 <one> 1c: d9 c0 fld %st(0) 1e: d9 e1 fabs 20: dc 1d e2 ff ff ff fcompl -0x1e(%rip) # 8 <limit> 26: df e0 fnstsw %ax 28: 80 e4 45 and $0x45,%ah 2b: 74 12 je 3f <__logl_internal+0x2f> 2d: dd d9 fstp %st(1) 2f: d9 f9 fyl2xp1 31: 48 89 c8 mov %rcx,%rax 34: 48 c7 41 08 00 00 00 movq $0x0,0x8(%rcx) 3b: 00 3c: db 39 fstpt (%rcx) 3e: c3 retq 3f: dd d8 fstp %st(0) 41: d9 f1 fyl2x 43: 48 89 c8 mov %rcx,%rax 46: 48 c7 41 08 00 00 00 movq $0x0,0x8(%rcx) 4d: 00 4e: db 39 fstpt (%rcx) |
The log Function on Mac OSX
Demonstrating faster assembly code is the Mac OSX implementation of log shown below. The key feature here is that it uses XMM registers and has a more modern and performant implementation. Note that many other implementations (e.g. the Microsoft compiler) also use faster versions like this. That is, Windows is not slow, R is slow on Windows. It’s all the same Intel chips under the hood.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
+0x00 vmovq %xmm0, %rax +0x05 shrq $32, %rax +0x09 movl %eax, %edx +0x0b subl $1048576, %eax +0x10 cmpl $2145386496, %eax +0x15 jae "0x7fff9793e100+0xe4" +0x1b andl $1048575, %eax +0x20 addl $4096, %eax +0x25 andl $2088960, %eax +0x2a shrl $9, %eax +0x2d addl $3222802432, %edx +0x33 sarl $20, %edx +0x36 vcvtsi2sdl %edx, %xmm1, %xmm1 +0x3a vmovsd 112942(%rip), %xmm6 +0x42 vandpd %xmm6, %xmm0, %xmm0 +0x46 vmovsd 74890(%rip), %xmm7 +0x4e vorpd %xmm7, %xmm0, %xmm0 +0x52 leaq 112999(%rip), %rdx +0x59 vmovss (%rdx,%rax), %xmm3 +0x5e vpsllq $32, %xmm3, %xmm3 +0x63 vfmsub213sd %xmm7, %xmm0, %xmm3 +0x68 vmovddup %xmm3, %xmm3 +0x6c vaddpd -48(%rdx), %xmm3, %xmm4 +0x71 vfmadd213pd -32(%rdx), %xmm3, %xmm4 +0x77 vmovapd -16(%rdx), %xmm5 +0x7c vaddsd %xmm3, %xmm5, %xmm5 +0x80 vmulpd %xmm3, %xmm5, %xmm5 +0x84 vmulpd %xmm5, %xmm4, %xmm4 +0x88 vmovhlps %xmm4, %xmm5, %xmm5 +0x8c vmulsd %xmm5, %xmm4, %xmm4 +0x90 je "0x7fff9793e100+0xc7" +0x92 vmovss 4(%rdx,%rax), %xmm5 +0x98 vpsllq $32, %xmm5, %xmm5 +0x9d vmulsd 112867(%rip), %xmm1, %xmm0 +0xa5 vmulsd 112851(%rip), %xmm1, %xmm1 +0xad vaddsd 8(%rdx,%rax), %xmm0, %xmm0 +0xb3 vaddsd %xmm5, %xmm1, %xmm1 +0xb7 vaddsd %xmm4, %xmm0, %xmm0 +0xbb vaddsd %xmm3, %xmm0, %xmm0 +0xbf vaddsd %xmm1, %xmm0, %xmm0 +0xc3 vzeroupper +0xc6 retq +0xc7 vmovss 4(%rdx,%rax), %xmm5 +0xcd vpsllq $32, %xmm5, %xmm5 +0xd2 vaddsd %xmm5, %xmm3, %xmm3 +0xd6 vaddsd 8(%rdx,%rax), %xmm4, %xmm0 +0xdc vaddsd %xmm3, %xmm0, %xmm0 +0xe0 vzeroupper +0xe3 retq |