Tag Archives: C#

.NET Bio is Significantly Faster on .Net Core 2.0

Summary: With the release of .NET Core 2.0, .NET Bio is able to run significantly faster (~2X) on Mac OSX due to better compilation and memory mangement.

The .NET Bio library contains libraries for genomic data processing tasks like parsing, alignment, etc. that are too computationally intense to be undertaken with interpreted languages like Python or R, but can be efficiently performed in code that is typed and compiled, like Java and C#, without suffering the productivity burden of coding in C/C++.

Historically, .NET Bio ran on two different frameworks.  On windows one could leverage all the substantial engineering Microsoft invested in their CLR to create fast and stable programs.  On Mac or Linux, one could use the independently implemented Mono Framework, which was a solid effort by a significantly less resourced team that performed well, but never really came close to the quality of the Microsoft implementation.  In practice, as bioinformaticians are weened on the Linux/GCC toolchain and infrastructure, Mono was the frequently used option.  This led to performance and reliability issues.

All of this changed when Microsoft decided to open source .NET , releasing a version of their framework for Mac, Linux and Windows called .NET Core.  This was a watershed moment, as it allows open source writers to take advantage of all the advanced memory management and compilation techniques Microsoft developed.  Better Garbage collection, advanced pre-compilation and vector instructions are all allowing higher level languages like C# to perform nearly as well as aggressively optimized C/C++ for many practical applications.

To test out the new possibilites, I compared the time it takes to parse all the reads in a 1.2 GB BAM file using either the old Mono runtime or the new dotnet core runtime on my MacBook computer.  The test code was simple counting of reads after they were deserialized from the compressed format and converted into objects.  

And the test case was timed on each platform as follows, the times were taken in triplicate and random order and the results are shown below.  I also benchmarked an equivalent task using the industry standard samtools available from htslib, a C library for the same task that uses tons of bit operations, and is about as performant as one can get 1)This is for methods that fully parse the BAM read, PySAM, which lazily decodes the data, can finish a simple counting task in about 15 seconds, but this isn’t representative of a workflow which would require full decoding..

Profiling Each Platform

Mono: The Mono platform allows us to easily profile the code using instruments on Mac OSX.  Examining the profile of the running code, we can see where the bottlenecks are.  A portion of time is spent in libz doing decompression of the underlying gzipped data, but it appears that much is also spent on garbage collection associated events, e.g. preparing for memory allocation with mono_handle_new.

.NET Core 2.0:  Being newly available on multiple platforms, there is no good way to profile .NET Core 2.0 on Mac OSX.  Although typical profilers cannot understand the C# JITted code, we can look at the compiled C/C++ code and here we see that .NET is spending twice as much time in libz as a percentage than Mono is.  Since much of this task is simply decompressing data, more time in the decompression library means the rest of the code is running faster, and is a good sign.   

To further profile the managed code, I wrote a custom MacOS Profiler that changed how each method was compiled to insert a hook that allowed me to track the enter and exit time of each function with a P/Invoke call to C++ code at the entrance and exit of each managed method frame, and so calculated the inclusive time spent in each method.  Although the addition of this instrumentation is expected to greatly skew the results, we saw that the majority of the managed code is spent parsing the data, and much of that is spent doing validations!

Method NameInclusive Time (s)Call Count
Bio.IO.BAM.BAMParser.GetAlignedSequence2.0123565255
Bio.Util.Helper.StringContainsIllegalCharacters0.670411487805
Bio.IO.SAM.SAMOptionalField.set_Tag0.551263711275
Bio.IO.SAM.SAMOptionalField.set_VType0.549131711275
Bio.IO.SAM.SAMOptionalField.IsValidTag0.547594711275
Bio.IO.SAM.SAMOptionalField.IsValidVType0.54615711275
Bio.IO.SAM.SAMOptionalField.set_Value0.543942711275
Bio.IO.SAM.SAMOptionalField.IsValidValue0.543379711275
SAMTools Continuing the trend that the more performant library spends more time in libz, samtools spends half of it’s time there, and the other half basically doing all the decompression operations to convert the binary data into usable data (in sam_format1). SAMTools is optimized C code with minimal error checking, so is a rough upper bound on the speed that could be obtained.

Conclusions

The C# code was written in a manner free of performance concerns and has tons of validation checks and unnecessary copies. Yet .NET core performed only 2X slower than high performance C code.  This is due to the fact that thanks to the engineering in .NET core, the code can now run twice as fast on Mac OSX as it used to.

Although C# is still slower than C, it is the clear winner here.  The SAMTools code bottlenecks at a method that although heavily optimized is horrifically difficult to read.  In contrast, the .NET code bottlenecks at a method that is filled with nearly gratuitous memory allocations, and a plethora of checks where useful exceptions could be thrown and meaningful error messages delivered if assumptions aren’t met.  Plus, it has all the advantages of memory safety and quality tooling we’ve come to expect in 2017.  It’s intriguing to ponder how the benchmark would look if the C# code had been written in a style as performance based as the C code, by avoiding unnecessary allocations, using pointers, or using the new ref returns available in C# 7.0. I’ve shown previously such changes s could significantly increase the parsers performance.

The .NET ecosystem is moving along well for meaningful scientific computing. Importantly, although time-spent C/C++ code will always be more performant, as .NET Core is now open source, we can embed any time-critical functionality in the runtime itself now, allowing us to have things like ludicrously performant suffix-arrays interacting with parsers.  .NET Core 2.0 is a big step forward.

References   [ + ]

1. This is for methods that fully parse the BAM read, PySAM, which lazily decodes the data, can finish a simple counting task in about 15 seconds, but this isn’t representative of a workflow which would require full decoding.

Mono.Simd and the Mandlebrot Set.

C# and .NET are some of the fastest high level languages, but still cannot truly compete with C/C++ for low level speed, and C# code can be anywhere from 20%-300% slower. This is despite the fact that the C# compiler often gets as much information about a method as the C compiler.  It has been suggested that SSE/SIMD instructions in C# could overcome some of these issues.  To evaluate this, I took a famous computational task, re-coded it using C# SIMD instructions, and evaluated the performance by looking at the execution time and how the emitted assembly code compared to the optimal assembly code.

Background

In 2009, Miguel de Icaza demonstrated a framework that allows C# to use SSE2 intrinsic operations.This is now part of the Mono library and in theory, such methods can greatly save computational time (1.5X-16X), particularly for operations on byte arrays, a common task in bioinformatics.

Test Case

Computing the mandelbrot set is one of the tasks of the computer benchmarks game, which compares program speed in different languages.  Currently, C# is listed as being slower than Java, though the programs in each language use different techniques and neither uses an optimal algorithm (see here for a better one).  However, it makes a useful test case for raw computing speed.  The challenge is to compute the picture shown below by recursively squaring a complex number and adding a constant to it.

z_{n+1} = z_n^2 + c

image_thumb[1]

The Algorithm

Complex numbers are a particularly challenging case because their multiplication is not a simple operation, but involves a rather annoyingly squaring of different terms and then adding/subtracting them.

image_thumb1

These are not easy vector operations, and as will be shown later one result of this is that using SSE to speed up the values for one multiplication is useless (I tried, it was worse). So, the performance improvement comes from doing two elements of the loop at a time (this is how the current Java submission is faster than the C# one, though it does not use SIMD).  The current C# version does the inner loop of the program without SIMD instructions as follows:


This loop is iterating until convergence (<4) or until the max iterations (i<0) have expired. See the wikipedia page for a full explanation.  Of interest here is that the “t” variables exist for pipelining purposes, so this is a reasonably optimized code snippet.

Trying to Implement the Optimal Solution Using Mono.Simd

Actually figuring out how to do the complex arthimetic with SSE2 is rather challenging.  Fortunately Intel published a document giving the best solution as hand coded assembly, which involves using some special SSE3 instructions I was not aware of.  Notably, the Intel optimized code is far faster than even their C code, but is only about 25% better than their optimized assembly code without SSE3.

The code below shows my attempt to implement the Intel approach in Mono.  It should be pretty fast, however it may suffer from one fatal flaw.  The comparison at the end to check for the final conditions currently requires unpacking both values in the Vector2D (when either Length.X or Length.Y have been < 4.0 at least once, then the loop stops).  The comparison for both X and Y can be done in one operation in SIMD using the built in less than statement.  However, I do not know how to turn that into a control loop function, as this requires a movmskps assembly function which mono does not seem to expose.

Results – It’s Faster on Ubuntu

img2

Shown above are the times for three different algorithms. The first is the original best submission on the website.  My SIMD submission is shown on the bottom.  Because the SIMD version does two cycles within each inner loop, as the Java submission does, I tested the Java submission converted to C# as well.  Compared to the current submission this version shows a 81% improvement, but clearly much of that is simply from doing 2 cycles in one loop.  However, even once this change is made, the SIMD instructions still give a performance boost.

The Assembly Generated is Still Not Optimal

Although the assembly generated did include lots of SSE commands, in general inspecting the assembly I noticed several things.

  1. Never unpack the double array values!  My first attempt tried to do the SSE2 steps with those instructions, and then unpack the single values as needed.  However, this failed prettty miserably, as accessing the double seemed to involve a lot of stack to register movement.

  2. All the XMM registers are not used.  The optimal version uses all of them, the C# version uses only 0-3, and soften moves things to and from the stack. Not using registers optimally seems to be a common problem though with C#.

Conclusions

The SIMD version was indeed a fair bit faster, which is nice!  However, in this case it was not a game-changer.  Most importantly though, it was really easy to use, and I think I might incorporate the byte array operations at some point in future code.  This also gave me an appreciation for assembly, which in contrast to what I had heard is easy to read and seems easy to optimize.  I just submitted the code to the shoot-out, assuming it works there it should be up soon, and I would love for someone to show me how to fix the end statement.

References   [ + ]

1. tr + ti <= 4.0) && (--i > 0