Category Archives: Computing

The .NET Bio BAM Parser is Smoking Fast

The .NET Bio library has an improved version of it’s BAM file parser, which makes it significantly faster and easily competitive with the current standard C coded SAMTools for obtaining sequencing data and working with it. The chart below compares the time it takes in seconds for the old version of the parser and the current version to parse a 76 MB BAM file. The current parser can easily create ~150K validated sequence objects per second on the clunky old servers I typically run code on. Note that the windows and unix numbers are from completely different machines and not comparable. Also included is a comparison to a “custom” version of the parser that I wrote, which uses unsafe code, assumes the system architecture is always little endian, caches strings and does some other tricks to get some further performance improvements and reduce memory usage. img5 The comparison to samtools is based on the system time to parse the file using this task on the same unix server used for the C# tests. samtools view Test.bam | wc -l And is just meant to give a general idea of the performance comparison, as there are several important differences between the samtools and .NET Bio test. The C# version was not allowed to reuse memory for objects as it was supposed to be working as a data producer, while the Samtools version processes reads one at a time and does reuse memory. C# also made a lot of dictionaries to aid quick access to the read groups, which isn’t done by samtools. However, samtools had to write the files to the output pipe, while the C# version did not, which undoubtably introduces a fair bit of overhead for it. Both tools however, are clearly plenty fast and at this stage further performance improvements would come from lazy evaluation (or not sticking unnecessary information like the original quality scores in the BAM files!), and the language won’t matter much. Performance Comments One task when parsing BAMs is unpacking lots of information that is packed together in arrays.  In SAMtools and the current .NET Bio parser, this is done with lots of unpacking of bits by integer manipulations.  For example this code from SAMTools: Because C# has pointers and value-type structs however, I discovered that it is a lot more fun just to define a structure that contains those fields and unpack directly with a pointer cast in C#. Blam! Now all the data related to the position, bin read group is in the object with those three lines that copy the data very fast. So where are the bottlenecks remaining? On windows about a third of the time is spent doing the decompression. In Mono, because the decompression is done by zlib and not in managed code, it’s effectively free. Currently, the quality data and sequence data are passed around a bunch, and the code could likely be made about 10% faster by not copying that data but reusing a single byte array each time. However, it is so fast it hardly seems worth worrying about.

Using Selectome with .NET Bio, F# and R

The Bio.Selectome namespace has features to query Selectome.Selectome is a database that merges data from Ensembl and the programs in PAML used to compute the ratio of non-synonymous to synonymous (dN/dS) mutations along various branches of the phylogenetic tree. A low dN/dS ratio indicates that the protein sequence is under strong selective constraint, while a high one indicates that selective constraint is more relaxed. Selectome is also a fantastic resource to get gene trees and multiple sequence alignments. Using selectome and .NET Bio allows you to quickly investigate divergence across the vertebrate phylogeny. This page gives a walk of how to query selectome, convert the text data it returns in to objects, and the compute and plot various quantities from those objects.

Example Walk Through.


Step 0: Setup F#–If you haven’t used F# before, you can download Tsunami. Open the program and add references to Bio.Selectome using the #r command and open the relevant namespaces. (highlight and hit Alt+Enter to run in the console).


Step 1: Make Gene List – Selectome requires ensembl identifiers be used in queries.To create a set of interesting genes, I first downloaded the full set of genes from the MitoCarta website. These genes are identified by Entrez IDs, while selectome uses Ensembl IDs, so to convert between these I used the GeneID converter website to create a flatfile of the new ids. Given this flatfile, we can load it and convert it to typed classes as follows:
Step 2: Query Selectome –  All selectome data is accessed through the SelectomeDataFetcher class. This class will return a SelelctomeQueryResult that will let you know if the query was successful. Currently, the queries will only be successful for genes that exist in the database and have data available for the full vertebrate tree. If no data is available the Result will be NoResultsFound, if selectome returned data but there was no tree available for all vertebrates(but maybe just primates) the result will be NoVeterbrateTreeDataFound. We want to extract genes from query results that successfully returned data for the vertebrate tree.
Step 3: Determine how many genes show positive selection – F# makes this easy: Interestingly, roughly 33% of genes show selection, so we know not to get too excited by any one result!

Step 4: Download Multiple Sequence Alignments  – In order to decide how conserved a protein is relative to other proteins, we can download the multiple sequence alignment for each protein in this set and compare it to a particular protein of interest.  In Selectome, each protein comes with a masked and unmasked alignment for both proteins and DNA. These objects are available from the SelectomeGene class and are lazily downloaded when requested from the Selectome server.  These sequence alignment downloads are also cached for 30 days in a temporary directory to avoid waiting for downloads if you want to reload your script of interest.  Once downloaded they are converted to full-fledged .NET BIO multiple sequence alignments, meaning one can do nearly anything with them. The example below gets the DNA alignment and the BLOSUM90 alignment score for the masked amino acid alignments.


Step 5: Download the Gene Trees – The selectome gene defines a class, SelectomeTree, that provides a great set of functions to query all the interesting metadata provided by selectome. These features are most usefully investigated by using the autocomplete functionality of your editor, but there is a lot of useful information! Some examples are also shown below.

Tree queries are also cached locally to avoid going back to the server in the event of repeated requests.
Step 6: Plot distribution of interest using the R data provider – You can call R plotting functions directly from F# using the R type provider. More information is available from that site, but the code snippet below is sufficient to produce a histogram of alignment scores, no need to dump to a flat file first! Huzzah! One intermediate machine to rule them all (or at least to avoid useless glue between different libraries/APIs).

Mono.Simd and the Mandlebrot Set.

C# and .NET are some of the fastest high level languages, but still cannot truly compete with C/C++ for low level speed, and C# code can be anywhere from 20%-300% slower. This is despite the fact that the C# compiler often gets as much information about a method as the C compiler.  It has been suggested that SSE/SIMD instructions in C# could overcome some of these issues.  To evaluate this, I took a famous computational task, re-coded it using C# SIMD instructions, and evaluated the performance by looking at the execution time and how the emitted assembly code compared to the optimal assembly code.

Background

In 2009, Miguel de Icaza demonstrated a framework that allows C# to use SSE2 intrinsic operations.This is now part of the Mono library and in theory, such methods can greatly save computational time (1.5X-16X), particularly for operations on byte arrays, a common task in bioinformatics.

Test Case

Computing the mandelbrot set is one of the tasks of the computer benchmarks game, which compares program speed in different languages.  Currently, C# is listed as being slower than Java, though the programs in each language use different techniques and neither uses an optimal algorithm (see here for a better one).  However, it makes a useful test case for raw computing speed.  The challenge is to compute the picture shown below by recursively squaring a complex number and adding a constant to it.

z_{n+1} = z_n^2 + c

image_thumb[1]

The Algorithm

Complex numbers are a particularly challenging case because their multiplication is not a simple operation, but involves a rather annoyingly squaring of different terms and then adding/subtracting them.

image_thumb1

These are not easy vector operations, and as will be shown later one result of this is that using SSE to speed up the values for one multiplication is useless (I tried, it was worse). So, the performance improvement comes from doing two elements of the loop at a time (this is how the current Java submission is faster than the C# one, though it does not use SIMD).  The current C# version does the inner loop of the program without SIMD instructions as follows:


This loop is iterating until convergence (<4) or until the max iterations (i<0) have expired. See the wikipedia page for a full explanation.  Of interest here is that the “t” variables exist for pipelining purposes, so this is a reasonably optimized code snippet.

Trying to Implement the Optimal Solution Using Mono.Simd

Actually figuring out how to do the complex arthimetic with SSE2 is rather challenging.  Fortunately Intel published a document giving the best solution as hand coded assembly, which involves using some special SSE3 instructions I was not aware of.  Notably, the Intel optimized code is far faster than even their C code, but is only about 25% better than their optimized assembly code without SSE3.

The code below shows my attempt to implement the Intel approach in Mono.  It should be pretty fast, however it may suffer from one fatal flaw.  The comparison at the end to check for the final conditions currently requires unpacking both values in the Vector2D (when either Length.X or Length.Y have been < 4.0 at least once, then the loop stops).  The comparison for both X and Y can be done in one operation in SIMD using the built in less than statement.  However, I do not know how to turn that into a control loop function, as this requires a movmskps assembly function which mono does not seem to expose.

Results – It’s Faster on Ubuntu

img2

Shown above are the times for three different algorithms. The first is the original best submission on the website.  My SIMD submission is shown on the bottom.  Because the SIMD version does two cycles within each inner loop, as the Java submission does, I tested the Java submission converted to C# as well.  Compared to the current submission this version shows a 81% improvement, but clearly much of that is simply from doing 2 cycles in one loop.  However, even once this change is made, the SIMD instructions still give a performance boost.

The Assembly Generated is Still Not Optimal

Although the assembly generated did include lots of SSE commands, in general inspecting the assembly I noticed several things.

  1. Never unpack the double array values!  My first attempt tried to do the SSE2 steps with those instructions, and then unpack the single values as needed.  However, this failed prettty miserably, as accessing the double seemed to involve a lot of stack to register movement.

  2. All the XMM registers are not used.  The optimal version uses all of them, the C# version uses only 0-3, and soften moves things to and from the stack. Not using registers optimally seems to be a common problem though with C#.

Conclusions

The SIMD version was indeed a fair bit faster, which is nice!  However, in this case it was not a game-changer.  Most importantly though, it was really easy to use, and I think I might incorporate the byte array operations at some point in future code.  This also gave me an appreciation for assembly, which in contrast to what I had heard is easy to read and seems easy to optimize.  I just submitted the code to the shoot-out, assuming it works there it should be up soon, and I would love for someone to show me how to fix the end statement.

References   [ + ]

1. tr + ti <= 4.0) && (--i > 0

Accessing dbSNP with C# and the .NET Platform

NCBI Entrez can be accessed with many different platforms (python, R, etc.) , but I find .NET one of the best because the static typing makes it easy to infer what all the datafields mean, and navigate the data with much greater ease.

Documentation is sparse for this task, but here is how to access NCBI from the .NET platform.  The example steps/program show how to query dbSNP for information about particular ids.

  1. Open visual studio, start a new project and add two Service References to the project: http://eutils.ncbi.nlm.nih.gov/soap/v2.0/eutils.wsdl and http://eutils.ncbi.nlm.nih.gov/soap/v2.0/efetch_snp.wsdl files. Note that the efetch depends on the database used, in this case “snp” other databases have different ones.  You should now have two references:image

More information is available here on setting up visual studio: http://www.ncbi.nlm.nih.gov/books/NBK55694/

2. Next up grab the data as shown in the example.  Each SNP has A LOT more data, which can be inspected in the IDE.

   1:  static void Main(string[] args)
   2:  {        
   3:          string dbSNPid="28357684";
   4:          efetch.eFetchRequest re = new efetch.eFetchRequest();
   5:          re.id = dbSNPid;
   6:          var serv=new efetch.eUtilsServiceSoapClient();
   7:          var exchange = serv.run_eFetch(re);
   8:          var snpData = exchange.ExchangeSet.Rs[0];
   9:          object[] dataToReport=new object[] {
  10:              snpData.Het.value,
  11:              snpData.hgvs.First(),
  12:              snpData.PrimarySequence.First().accession,
  13:          };
  14:          Console.WriteLine( String.Join("\t",dataToReport.Select(x=>x.ToString()).ToArray())); 
  15:          Console.ReadLine();
  16:  }

The following links contain other useful information for consuming the entrez webservice in C#.

Setting up visual studio: http://www.ncbi.nlm.nih.gov/books/NBK55694/

Using efetch: http://www.ncbi.nlm.nih.gov/books/NBK55694/#csharp_msvs.Call_HYPERLINK__chaptercha_8

Forming queries: http://www.biostars.org/p/3436/

More information: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html

Also note that the first query takes much longer than subsequent ones, for reasons unknown to me at present.

Java vs. C# Performance Comparison for Parsing VCF Files

Making a comparison with a reasonably complex program ported between the two languages.

Update 3/10/2014: After writing this post I changed the C# parser to remove an extra List<> allocation in the C# code that was not in the Java code.  After this, the Java/C# versions are indistinguishable on speed, but the C# code used ~88 MB of memory while the java version used >1GB.  Therefore, I now believe the winner is C# and a fast implementation of this parser (which can be over an order of magnitude faster for certain scenarios not in this test) is available here

VCF files are a popular way to store information about genotypic variation from next generation sequencing studies.  The files are essentially large matrices, where each element in the matrix represents a collection of information about the genotype of a particular person at a particular locus in the genome (in this sense, they can be considered as a multi-dimensional matrix in a flat file format). The Java Picard package is a common utility used for parsing these files.  While parsing, the challenge is to read each line (row) of the file, and construct objects for each element in that row that can then be manipulated or inspected.  I just finished translating the Java VCF parser in Picard to C#, and so thought it might be a good chance to compare the two different languages and runtimes. C# showed a number of advantages in the translation.  The translation itself was mostly a lot of deleting.  The get/set assessors in C# really allowed for the removal of seemingly endless amount of getXXXX/setXXX methods in Java.  It also seemed like every other line in Java was a call to some apache commons class to perform a simple task like get the maximum value in an array, create an empty list, or do a selection on data.  Extension methods and Linq have clear advantages for data processing here (though I have found these have a slight overhead relative to the for loop equivalents).  Yield statements in Java also would seem to be useful. At the same time, Java had some things that would have been nice in C#.  I had to implement basic collection types like immutable hashsets and dictionaries, a LinkedHashSet class as well as an OrderedGenericDictionary during the port.  These should be in the C# language.

Performance

This of course am what I am most interested in.  My main computer is broken, so I had to test on my windows desktop at home.  For the test, each program would read a gzipped VCF file for 20,000 lines, first creating an initial lazy class representing a portion of the data and then fully decoding this lazy version to represent the complete data as objects.  The test file was a VCF with >1,000 individuals, though unfortunately most of these individuals were non-calls at various positions, but its what I had on hand. Immediately after porting – After essentially re-writing the Java in C#, I ran some tests.  Both Java and C# can run in either client (low-memory) or server modes.  So I did both, here are the results:
Platform VM Options Working Set Paged Memory Time (s)
Java None 27.8 MB 41.32 MB 11.5
.NET None 28.9 MB 30.22 MB 15.1
Ratio 1 1.35 0.76
Java Server 362.2 MB 414 MB 7.4
.NET Server (GC) 126 MB 332 MB 14.7
Ratio 2.9 1.25 0.5
A couple noticeable conclusions here. First Java is smoking .NET on performance, but this is essentially Java code at this point and it wasn’t written for C#.  Second, there is a massive amount more memory used in server mode, and in Java at least one obtains a large performance win for this cost. After “Sharpening” the code – The initial port was basically Java, so I cleaned it up a bit after running it through the profiler, here are some notable changes:
  • String.Split winds up being a big cost in this parsing.  When porting I initially just recreated an array every time, after I realized I was recreating such large arrays I reused them as in the Java code.
  • In C# you can use unsafe pointers and I got a big performance win out of these.  I grabbed the System.String.Split code from the Mono framework, trimmed/altered it, and used it for all the split methods that seemed to be taking a long time.  The Java version also implements a custom Split method, though obviously can’t use pointers.
  • Some additional cleanup in the logic.
Platform VM Options Working Set Paged Memory Time (s)
Java None 27.8 MB 41.32 MB 11.6
.NET None 28.9 MB 30.22 MB 12.6
Ratio 1 1.35 0.92
Java Server 362.2 MB 414 MB 7.4
.NET Server (GC) 123 MB 181.61 MB 10.8
Ratio 3 2.27 0.68
So round 2, and once again Java is the winner for speed in server mode, though at a high cost in memory.  For the lower memory client model, it is nearly a tie between the two.

What explains the difference?

These programs are nearly identical but the bottleneck is not the same in both. It seems C# can split strings much faster and Java can allocate memory much faster.  Although I am less familiar with the Java profiler, it seems to show 66% of the time is spent on the String splits.  In contrast, in C# the methods that are taking up time have to do with allocating memory (constructors) and the GC, string splits are only ~17% of the total time. image image One the one hand, this means that the JVM is really doing a great job in server mode on memory allocations and other optimizations.  I can’t think why C# shouldn’t be able to match the JVM performance (perhaps dynamic recompilation is really killing it here).  On the other hand, it means that the C# program can still be improved, while I can’t really see how to improve the Java one.  The String.Split method has already been rewritten, and I didn’t see any reasonable improvements for it.  In contrast, both programs have several places where memory allocations can be saved.  For example, one aspect of the parser is that it relies on a factory class to create another class and so allocates twice the memory that it needs as it creates two large objects.  Simply having one object be it’s own factory would solve this.  Similarly, several empty lists are repeatedly created that could point to the same list or simply be skipped.  I wanted this parser to generally match the Java version, so did not pursue these changes, but my guess is they may shrink the difference (though again, in Java this clearly didn’t matter).

Conclusions

The JVM is the clear winner on speed, particularly in server mode where memory isn’t an issue.  C# was the clear winner on brevity, syntax and language features.  The difference was only substantial in server mode, and the C# program (and likely the Java program) were far from optimal, but it gives a rough hint at how they compare.  The next question will be how they compare when C# runs with mono on a Linux environment.

How to remove the “Trial Edition” banner from the VisiFire open source chart kit

Visifire is a very good graphing component for making silverlight or WPF applications.  The component was first released as an open source library on GoogleCode, but since then has been made a closed source proprietary and for profit project.  The newer version contains several enhancements, but the open source version is still quite useful.  The greatest advantage is that while chart rendering can be impossibly slow with the Silverlight/WPF toolkit, it is very fast with visifire.

Unfortunately, it is hard to find the original open source version, it is available at: https://code.google.com/p/visifire/ .  Even after downloading the open source version, one still sees a rather annoying “Visifire Trial Edition” banner in the upper right corner (shown below), which looks unprofessional.

image

Removing this in the source code is complicated because this “Trial Edition Tag” is generated in an obfuscated fashion (presumably to prevent people from doing just that).  The text and hyperlink are encoded as byte arrays, which are unpacked in a somewhat convoluted way.  They can be found in the Visifire Control class.

private static Byte[] wmRegVal = new Byte[] { 0x56, 0x69, 0x73, 0x69, 0x66, 0x69, 0x72, 0x65, 0x20, 0x54, 0x72, 0x69, 0x61, 0x6C, 0x20, 0x45, 0x64, 0x69, 0x74, 0x69, 0x6F, 0x6E };

private Byte[] wmLinkVal = new Byte[] { 0x68, 0x74, 0x74, 0x70, 0x3A, 0x2F, 0x2F, 0x77, 0x77, 0x77, 0x2E, 0x56, 0x69, 0x73, 0x69, 0x66, 0x69, 0x72, 0x65, 0x2E, 0x63, 0x6F, 0x6D, 0x2F, 0x6C, 0x69, 0x63, 0x65, 0x6E, 0x73, 0x65, 0x2E, 0x70, 0x68, 0x70 };

After downloading the open source version, simply search for these arrays, comment them out, and then comment out the one line that uses them them when directed to by the compiler error.  Rebuild, and voila!  No more banner in your open source software.

Compile Bowtie2 on Windows 64 bit.

Bowtie 2 is a program that efficiently aligns next generation sequence data to a reference genome. However, the version distributed by the authors only compiles on POSIX platforms. These instructions will allow you to compile it on windows by downloading the Mingw64 tools and editing the make file before building the program. Instructions
  1. Download the Bowtie 2 source code. Extract the code to a location on your disk.
  2. Download the mingW64 compiler tools. These tools are much easier to use if they are downloaded as a package from this site: TDM Compiler Package. Be sure to select the TDM 64 bit version of the tools.
  3. Run the installer for the package. When prompted, select all of the available packages for installation.
  4. In explorer, navigate to where you unzipped the source code for bowtie. Find the file called Makefile and edit it. A great way to do this is to install the program notepad++. If notepad++ is installed, simply write click on the file and select “edit with notepad++”.
  5. Edit the file so that it knows it is compiling on Ming and Windows. To do this, insert # marks in front of all the if/else statements, so that lines 35 to 53 of the file look like this:
    # Detect Cygwin or MinGW
    #WINDOWS = 0
    #CYGWIN = 0
    #MINGW = 0
    #ifneq (,$(findstring CYGWIN,$(shell uname)))
    #WINDOWS = 1 
    #CYGWIN = 1
    # POSIX memory-mapped files not currently supported on Windows
    #BOWTIE_MM = 0
    #BOWTIE_SHARED_MEM = 0
    #else
    #ifneq (,$(findstring MINGW,$(shell uname)))
    WINDOWS = 1
    MINGW = 1
    # POSIX memory-mapped files not currently supported on Windows
    BOWTIE_MM = 0
    BOWTIE_SHARED_MEM = 0
    #endif
    #endif
    
  6. Now edit the makefile so it points to a correct pthreads library. Edit line 76 so it reads as follows:
    PTHREAD_LIB = -lpthread
  7. Edit the file so that it compiles as 64 bit, change lines 121-132 to the following
    # Convert BITS=?? to a -m flag
    BITS=64
    #ifeq (x86_64,$(shell uname -m))
    BITS=64
    #endif
    BITS_FLAG =
    #ifeq (32,$(BITS))
    #BITS_FLAG = -m32
    #endif
    #ifeq (64,$(BITS))
    BITS_FLAG = -m64
    #endif
    
  8. Go to start->All Programs->MingGW64->MingGW Command prompt
  9. Navigate to the directory with the source code and make file by entering the cd command at the prompt, e.g.
    cd C:\Programs\bowtie2-2.0.6-source\bowtie2-2.0.6 
  10. Type “make” and hit enter.
  11. All done!
  12. Edit: One person had a comment on this, if this doesn’t work you may have to use teh MinGW shell. Edit: It has been pointed out that the BowTie2 team doesn’t use memory mapped files on windows. This might mean large genomes are less performant.