Monthly Archives: June 2014

C# vs. Java, Xamarin vs. Oracle, Performance Comparison version 2.0

Today I noticed the SIMD implementation of the Mandelbrot set algorithm I blogged about last year was successfully submitted to the language shootout webpage. However, I was a bit disappointed to see the C# version was still slower than the Java version, despite my use of the special SIMD instructions (though I was pleased that a quick port of my code to F# absolutely annihilates the OCAML version of this benchmark).

I had benchmarked my code as faster than the Java code on my Mac, the CentOS cluster at my work and two Ubuntu virtual machines (Mono >3.0, LLVM compiler). What gave?

Undeterred, I thought I would try to use SIMD to improve another C# benchmark, and took a shot at the N-body benchmark test. Once again, the version I wrote, in my hands, was much faster. But when submitted to the shoot-out and accepted, it was either much slower, or didn’t even run. My submission using the SSE instructions not only didn’t beat Java, it was actually slower than the original C# version!

Below are the timings on my top-of-the-line Mac for the two benchmarks, in both cases we see that the C# program runs in 80-90% of the time the Java Program takes. There are several key take aways here.

C#JavaTimings

C# and Java have no meaningful performance differences.

C# may use a lot less memory, but this is my optimized C# code and it is only beating the optimized Java by <= 20%. When does that matter?

C# and Java’s similar performance has different underpinnings

There are a lot of ways code can be fast, and I was surprised that Java and C# achieve similar speeds in very different ways.

The key advantage of the C# code was the SIMD instructions, which in theory give a ~2X speed up. However, they only win by ~20%. Why?

I think the assembly gives the answer. Just some quick snippets tell the whole story. Here is some C# code for the N-Body problem compiled to assembly:

The important insights into performance from the assembly here are:

  1. Similar instructions are stacked on top of each other, allowing for pipelining (e.g. the same vhaddpd instruction follows the same vhaddpd, and both use different registers, so can execute simultaneously).
  2. The “p” in the instructions (i.e. vhadd-“p”-d). This stands for “packed” meaning we are packing/doing two operations at once via SIMD.
  3. Only registers XMM1-XMM4 appear in the instructions. There are more registers available, and more pipelining possible, but the Mono/LLVM compiler appears to only use the low-number/scratch registers. It is easier to write compilers that obey this restriction.

Now let’s compare that to some Java assembly emitted by the Oracle runtime for the same benchmark:

The important insights here are:

  1. Java uses the version of the instructions with “s” (i.e. vmul-“s”-d) meaning single, and gets no SSE advantage. No doubt this makes writing compilers easier.
  2. However, Java is using lots of registers, XMM15 shows up!
  3. As a result of using all the registers used, the Java code has great pipelining. Note that up to 5 vmulsd instructions show up at once. The JVM is simply brilliant at packing operations together, and this means that even though I, the programmer, was smart and used SSE2, my C# code only won by 20%, and not 200%. It’s hard to beat Java pipelining.

All of which makes me wonder. What if we took the high-level language advantages of C# (low-overhead value types, easy interop via pointers, baked in generics, etc.). And combined them with the low-level advantages of the JVM (better array bounds check elimination, pipelining compiler optimizations, and maybe someday the occasional stack allocation of reference type variables…)