Monthly Archives: May 2013

Java vs. C# Performance Comparison for Parsing VCF Files

Making a comparison with a reasonably complex program ported between the two languages.

Update 3/10/2014: After writing this post I changed the C# parser to remove an extra List<> allocation in the C# code that was not in the Java code.  After this, the Java/C# versions are indistinguishable on speed, but the C# code used ~88 MB of memory while the java version used >1GB.  Therefore, I now believe the winner is C# and a fast implementation of this parser (which can be over an order of magnitude faster for certain scenarios not in this test) is available here

VCF files are a popular way to store information about genotypic variation from next generation sequencing studies.  The files are essentially large matrices, where each element in the matrix represents a collection of information about the genotype of a particular person at a particular locus in the genome (in this sense, they can be considered as a multi-dimensional matrix in a flat file format). The Java Picard package is a common utility used for parsing these files.  While parsing, the challenge is to read each line (row) of the file, and construct objects for each element in that row that can then be manipulated or inspected.  I just finished translating the Java VCF parser in Picard to C#, and so thought it might be a good chance to compare the two different languages and runtimes. C# showed a number of advantages in the translation.  The translation itself was mostly a lot of deleting.  The get/set assessors in C# really allowed for the removal of seemingly endless amount of getXXXX/setXXX methods in Java.  It also seemed like every other line in Java was a call to some apache commons class to perform a simple task like get the maximum value in an array, create an empty list, or do a selection on data.  Extension methods and Linq have clear advantages for data processing here (though I have found these have a slight overhead relative to the for loop equivalents).  Yield statements in Java also would seem to be useful. At the same time, Java had some things that would have been nice in C#.  I had to implement basic collection types like immutable hashsets and dictionaries, a LinkedHashSet class as well as an OrderedGenericDictionary during the port.  These should be in the C# language.

Performance

This of course am what I am most interested in.  My main computer is broken, so I had to test on my windows desktop at home.  For the test, each program would read a gzipped VCF file for 20,000 lines, first creating an initial lazy class representing a portion of the data and then fully decoding this lazy version to represent the complete data as objects.  The test file was a VCF with >1,000 individuals, though unfortunately most of these individuals were non-calls at various positions, but its what I had on hand. Immediately after porting – After essentially re-writing the Java in C#, I ran some tests.  Both Java and C# can run in either client (low-memory) or server modes.  So I did both, here are the results:
Platform VM Options Working Set Paged Memory Time (s)
Java None 27.8 MB 41.32 MB 11.5
.NET None 28.9 MB 30.22 MB 15.1
Ratio 1 1.35 0.76
Java Server 362.2 MB 414 MB 7.4
.NET Server (GC) 126 MB 332 MB 14.7
Ratio 2.9 1.25 0.5
A couple noticeable conclusions here. First Java is smoking .NET on performance, but this is essentially Java code at this point and it wasn’t written for C#.  Second, there is a massive amount more memory used in server mode, and in Java at least one obtains a large performance win for this cost. After “Sharpening” the code – The initial port was basically Java, so I cleaned it up a bit after running it through the profiler, here are some notable changes:
  • String.Split winds up being a big cost in this parsing.  When porting I initially just recreated an array every time, after I realized I was recreating such large arrays I reused them as in the Java code.
  • In C# you can use unsafe pointers and I got a big performance win out of these.  I grabbed the System.String.Split code from the Mono framework, trimmed/altered it, and used it for all the split methods that seemed to be taking a long time.  The Java version also implements a custom Split method, though obviously can’t use pointers.
  • Some additional cleanup in the logic.
Platform VM Options Working Set Paged Memory Time (s)
Java None 27.8 MB 41.32 MB 11.6
.NET None 28.9 MB 30.22 MB 12.6
Ratio 1 1.35 0.92
Java Server 362.2 MB 414 MB 7.4
.NET Server (GC) 123 MB 181.61 MB 10.8
Ratio 3 2.27 0.68
So round 2, and once again Java is the winner for speed in server mode, though at a high cost in memory.  For the lower memory client model, it is nearly a tie between the two.

What explains the difference?

These programs are nearly identical but the bottleneck is not the same in both. It seems C# can split strings much faster and Java can allocate memory much faster.  Although I am less familiar with the Java profiler, it seems to show 66% of the time is spent on the String splits.  In contrast, in C# the methods that are taking up time have to do with allocating memory (constructors) and the GC, string splits are only ~17% of the total time. image image One the one hand, this means that the JVM is really doing a great job in server mode on memory allocations and other optimizations.  I can’t think why C# shouldn’t be able to match the JVM performance (perhaps dynamic recompilation is really killing it here).  On the other hand, it means that the C# program can still be improved, while I can’t really see how to improve the Java one.  The String.Split method has already been rewritten, and I didn’t see any reasonable improvements for it.  In contrast, both programs have several places where memory allocations can be saved.  For example, one aspect of the parser is that it relies on a factory class to create another class and so allocates twice the memory that it needs as it creates two large objects.  Simply having one object be it’s own factory would solve this.  Similarly, several empty lists are repeatedly created that could point to the same list or simply be skipped.  I wanted this parser to generally match the Java version, so did not pursue these changes, but my guess is they may shrink the difference (though again, in Java this clearly didn’t matter).

Conclusions

The JVM is the clear winner on speed, particularly in server mode where memory isn’t an issue.  C# was the clear winner on brevity, syntax and language features.  The difference was only substantial in server mode, and the C# program (and likely the Java program) were far from optimal, but it gives a rough hint at how they compare.  The next question will be how they compare when C# runs with mono on a Linux environment.

Software Patents

In the dark moments when one wonders if their research is doing enough to better the world, it’s always nice to remember you’re not a patent lawyer.  The radio program This American Life did a fantastic investigation of this issue recently, highly recommended and very entertaining, so I thought I would point it out here.  It explains how broken the system is, and although I don’t know how best to ensure people are compensated for their creative works, clearly we need better solutions.

The outcome of perceived or real patent fights basically moved me away from the Linux desktop.  In 2006 I thought the Linux desktop would become the world standard (the other options were far from awesome).  Instead the technology company that spends the least on research and development rose to prominence and essentially everyone I know who uses Linux comes from the tail end of the most affluent class of society (which probably makes them great targets for the now integrated Amazon search).  Here’s hoping this mess gets cleaned up in the next few years.