June | 2014 | nigel delaney

Today I noticed the SIMD implementation of the Mandelbrot set algorithm I blogged about last year was successfully submitted to the language shootout webpage. However, I was a bit disappointed to see the C# version was still slower than the Java version, despite my use of the special SIMD instructions (though I was pleased that a quick port of my code to F# absolutely annihilates the OCAML version of this benchmark).

I had benchmarked my code as faster than the Java code on my Mac, the CentOS cluster at my work and two Ubuntu virtual machines (Mono >3.0, LLVM compiler). What gave?

Undeterred, I thought I would try to use SIMD to improve another C# benchmark, and took a shot at the N-body benchmark test. Once again, the version I wrote, in my hands, was much faster. But when submitted to the shoot-out and accepted, it was either much slower, or didn’t even run. My submission using the SSE instructions not only didn’t beat Java, it was actually slower than the original C# version!

Below are the timings on my top-of-the-line Mac for the two benchmarks, in both cases we see that the C# program runs in 80-90% of the time the Java Program takes. There are several key take aways here.

C# and Java have no meaningful performance differences.

C# may use a lot less memory, but this is my optimized C# code and it is only beating the optimized Java by <= 20%. When does that matter?

C# and Java’s similar performance has different underpinnings

There are a lot of ways code can be fast, and I was surprised that Java and C# achieve similar speeds in very different ways.

The key advantage of the C# code was the SIMD instructions, which in theory give a ~2X speed up. However, they only win by ~20%. Why?

I think the assembly gives the answer. Just some quick snippets tell the whole story. Here is some C# code for the N-Body problem compiled to assembly:

00000034	vsubpd	0x18(%esi), %xmm2, %xmm2
00000039	vmulpd	%xmm1, %xmm1, %xmm3
0000003d	vmulpd	%xmm2, %xmm2, %xmm4
00000041	vhaddpd	%xmm3, %xmm3, %xmm3
00000045	vhaddpd	%xmm4, %xmm4, %xmm4
00000049	vaddpd	%xmm4, %xmm3, %xmm3
0000004d	vsqrtpd	%xmm3, %xmm4
00000051	vmulpd	%xmm4, %xmm3, %xmm3

00000034 vsubpd 0x18(%esi), %xmm2, %xmm2

00000039 vmulpd %xmm1, %xmm1, %xmm3

0000003d vmulpd %xmm2, %xmm2, %xmm4

00000041 vhaddpd %xmm3, %xmm3, %xmm3

00000045 vhaddpd %xmm4, %xmm4, %xmm4

00000049 vaddpd %xmm4, %xmm3, %xmm3

0000004d vsqrtpd %xmm3, %xmm4

00000051 vmulpd %xmm4, %xmm3, %xmm3

The important insights into performance from the assembly here are:

Similar instructions are stacked on top of each other, allowing for pipelining (e.g. the same vhaddpd instruction follows the same vhaddpd, and both use different registers, so can execute simultaneously).
The “p” in the instructions (i.e. vhadd-“p”-d). This stands for “packed” meaning we are packing/doing two operations at once via SIMD.
Only registers XMM1-XMM4 appear in the instructions. There are more registers available, and more pipelining possible, but the Mono/LLVM compiler appears to only use the low-number/scratch registers. It is easier to write compilers that obey this restriction.

Now let’s compare that to some Java assembly emitted by the Oracle runtime for the same benchmark:

0x0000000109ce2315: vmulsd %xmm11,%xmm1,%xmm2
0x0000000109ce231a: vmulsd %xmm11,%xmm9,%xmm3
0x0000000109ce231f: vmulsd %xmm9,%xmm9,%xmm0
0x0000000109ce2324: vmulsd %xmm1,%xmm1,%xmm5
0x0000000109ce2328: vmovsd 0x38(%r12,%rdx,8),%xmm4
0x0000000109ce232f: vaddsd %xmm0,%xmm5,%xmm5
0x0000000109ce2333: vmovsd 0x30(%r12,%rdx,8),%xmm6
0x0000000109ce233a: vmovsd 0x28(%r12,%rdx,8),%xmm7
0x0000000109ce2341: vmovsd 0x20(%r12,%rbp,8),%xmm0  
0x0000000109ce2348: vmovsd 0x40(%r12,%rbp,8),%xmm8 
0x0000000109ce234f: vsubsd %xmm0,%xmm12,%xmm0 
0x0000000109ce2353: vmulsd %xmm8,%xmm9,%xmm9
0x0000000109ce2358: vmulsd %xmm11,%xmm0,%xmm14
0x0000000109ce235d: vmulsd %xmm8,%xmm0,%xmm15
0x0000000109ce2362: vmulsd %xmm0,%xmm0,%xmm0
0x0000000109ce2366: vmulsd %xmm8,%xmm1,%xmm1

0x0000000109ce2315: vmulsd %xmm11,%xmm1,%xmm2

0x0000000109ce231a: vmulsd %xmm11,%xmm9,%xmm3

0x0000000109ce231f: vmulsd %xmm9,%xmm9,%xmm0

0x0000000109ce2324: vmulsd %xmm1,%xmm1,%xmm5

0x0000000109ce2328: vmovsd 0x38(%r12,%rdx,8),%xmm4

0x0000000109ce232f: vaddsd %xmm0,%xmm5,%xmm5

0x0000000109ce2333: vmovsd 0x30(%r12,%rdx,8),%xmm6

0x0000000109ce233a: vmovsd 0x28(%r12,%rdx,8),%xmm7

0x0000000109ce2341: vmovsd 0x20(%r12,%rbp,8),%xmm0

0x0000000109ce2348: vmovsd 0x40(%r12,%rbp,8),%xmm8

0x0000000109ce234f: vsubsd %xmm0,%xmm12,%xmm0

0x0000000109ce2353: vmulsd %xmm8,%xmm9,%xmm9

0x0000000109ce2358: vmulsd %xmm11,%xmm0,%xmm14

0x0000000109ce235d: vmulsd %xmm8,%xmm0,%xmm15

0x0000000109ce2362: vmulsd %xmm0,%xmm0,%xmm0

0x0000000109ce2366: vmulsd %xmm8,%xmm1,%xmm1

The important insights here are:

Java uses the version of the instructions with “s” (i.e. vmul-“s”-d) meaning single, and gets no SSE advantage. No doubt this makes writing compilers easier.
However, Java is using lots of registers, XMM15 shows up!
As a result of using all the registers used, the Java code has great pipelining. Note that up to 5 vmulsd instructions show up at once. The JVM is simply brilliant at packing operations together, and this means that even though I, the programmer, was smart and used SSE2, my C# code only won by 20%, and not 200%. It’s hard to beat Java pipelining.

All of which makes me wonder. What if we took the high-level language advantages of C# (low-overhead value types, easy interop via pointers, baked in generics, etc.). And combined them with the low-level advantages of the JVM (better array bounds check elimination, pipelining compiler optimizations, and maybe someday the occasional stack allocation of reference type variables…)

nigel delaney

evolutionary biologist, statistician, nice guy

Monthly Archives: June 2014

C# vs. Java, Xamarin vs. Oracle, Performance Comparison version 2.0

C# and Java have no meaningful performance differences.

C# and Java’s similar performance has different underpinnings