Category Archives: Bioinformatics

The .NET Bio BAM Parser is Smoking Fast

The .NET Bio library has an improved version of it’s BAM file parser, which makes it significantly faster and easily competitive with the current standard C coded SAMTools for obtaining sequencing data and working with it. The chart below compares the time it takes in seconds for the old version of the parser and the current version to parse a 76 MB BAM file. The current parser can easily create ~150K validated sequence objects per second on the clunky old servers I typically run code on. Note that the windows and unix numbers are from completely different machines and not comparable. Also included is a comparison to a “custom” version of the parser that I wrote, which uses unsafe code, assumes the system architecture is always little endian, caches strings and does some other tricks to get some further performance improvements and reduce memory usage. img5 The comparison to samtools is based on the system time to parse the file using this task on the same unix server used for the C# tests. samtools view Test.bam | wc -l And is just meant to give a general idea of the performance comparison, as there are several important differences between the samtools and .NET Bio test. The C# version was not allowed to reuse memory for objects as it was supposed to be working as a data producer, while the Samtools version processes reads one at a time and does reuse memory. C# also made a lot of dictionaries to aid quick access to the read groups, which isn’t done by samtools. However, samtools had to write the files to the output pipe, while the C# version did not, which undoubtably introduces a fair bit of overhead for it. Both tools however, are clearly plenty fast and at this stage further performance improvements would come from lazy evaluation (or not sticking unnecessary information like the original quality scores in the BAM files!), and the language won’t matter much. Performance Comments One task when parsing BAMs is unpacking lots of information that is packed together in arrays.  In SAMtools and the current .NET Bio parser, this is done with lots of unpacking of bits by integer manipulations.  For example this code from SAMTools: Because C# has pointers and value-type structs however, I discovered that it is a lot more fun just to define a structure that contains those fields and unpack directly with a pointer cast in C#. Blam! Now all the data related to the position, bin read group is in the object with those three lines that copy the data very fast. So where are the bottlenecks remaining? On windows about a third of the time is spent doing the decompression. In Mono, because the decompression is done by zlib and not in managed code, it’s effectively free. Currently, the quality data and sequence data are passed around a bunch, and the code could likely be made about 10% faster by not copying that data but reusing a single byte array each time. However, it is so fast it hardly seems worth worrying about.

Using Selectome with .NET Bio, F# and R

The Bio.Selectome namespace has features to query Selectome.Selectome is a database that merges data from Ensembl and the programs in PAML used to compute the ratio of non-synonymous to synonymous (dN/dS) mutations along various branches of the phylogenetic tree. A low dN/dS ratio indicates that the protein sequence is under strong selective constraint, while a high one indicates that selective constraint is more relaxed. Selectome is also a fantastic resource to get gene trees and multiple sequence alignments. Using selectome and .NET Bio allows you to quickly investigate divergence across the vertebrate phylogeny. This page gives a walk of how to query selectome, convert the text data it returns in to objects, and the compute and plot various quantities from those objects.

Example Walk Through.


Step 0: Setup F#–If you haven’t used F# before, you can download Tsunami. Open the program and add references to Bio.Selectome using the #r command and open the relevant namespaces. (highlight and hit Alt+Enter to run in the console).


Step 1: Make Gene List – Selectome requires ensembl identifiers be used in queries.To create a set of interesting genes, I first downloaded the full set of genes from the MitoCarta website. These genes are identified by Entrez IDs, while selectome uses Ensembl IDs, so to convert between these I used the GeneID converter website to create a flatfile of the new ids. Given this flatfile, we can load it and convert it to typed classes as follows:
Step 2: Query Selectome –  All selectome data is accessed through the SelectomeDataFetcher class. This class will return a SelelctomeQueryResult that will let you know if the query was successful. Currently, the queries will only be successful for genes that exist in the database and have data available for the full vertebrate tree. If no data is available the Result will be NoResultsFound, if selectome returned data but there was no tree available for all vertebrates(but maybe just primates) the result will be NoVeterbrateTreeDataFound. We want to extract genes from query results that successfully returned data for the vertebrate tree.
Step 3: Determine how many genes show positive selection – F# makes this easy: Interestingly, roughly 33% of genes show selection, so we know not to get too excited by any one result!

Step 4: Download Multiple Sequence Alignments  – In order to decide how conserved a protein is relative to other proteins, we can download the multiple sequence alignment for each protein in this set and compare it to a particular protein of interest.  In Selectome, each protein comes with a masked and unmasked alignment for both proteins and DNA. These objects are available from the SelectomeGene class and are lazily downloaded when requested from the Selectome server.  These sequence alignment downloads are also cached for 30 days in a temporary directory to avoid waiting for downloads if you want to reload your script of interest.  Once downloaded they are converted to full-fledged .NET BIO multiple sequence alignments, meaning one can do nearly anything with them. The example below gets the DNA alignment and the BLOSUM90 alignment score for the masked amino acid alignments.


Step 5: Download the Gene Trees – The selectome gene defines a class, SelectomeTree, that provides a great set of functions to query all the interesting metadata provided by selectome. These features are most usefully investigated by using the autocomplete functionality of your editor, but there is a lot of useful information! Some examples are also shown below.

Tree queries are also cached locally to avoid going back to the server in the event of repeated requests.
Step 6: Plot distribution of interest using the R data provider – You can call R plotting functions directly from F# using the R type provider. More information is available from that site, but the code snippet below is sufficient to produce a histogram of alignment scores, no need to dump to a flat file first! Huzzah! One intermediate machine to rule them all (or at least to avoid useless glue between different libraries/APIs).

Accessing dbSNP with C# and the .NET Platform

NCBI Entrez can be accessed with many different platforms (python, R, etc.) , but I find .NET one of the best because the static typing makes it easy to infer what all the datafields mean, and navigate the data with much greater ease.

Documentation is sparse for this task, but here is how to access NCBI from the .NET platform.  The example steps/program show how to query dbSNP for information about particular ids.

  1. Open visual studio, start a new project and add two Service References to the project: http://eutils.ncbi.nlm.nih.gov/soap/v2.0/eutils.wsdl and http://eutils.ncbi.nlm.nih.gov/soap/v2.0/efetch_snp.wsdl files. Note that the efetch depends on the database used, in this case “snp” other databases have different ones.  You should now have two references:image

More information is available here on setting up visual studio: http://www.ncbi.nlm.nih.gov/books/NBK55694/

2. Next up grab the data as shown in the example.  Each SNP has A LOT more data, which can be inspected in the IDE.

   1:  static void Main(string[] args)
   2:  {        
   3:          string dbSNPid="28357684";
   4:          efetch.eFetchRequest re = new efetch.eFetchRequest();
   5:          re.id = dbSNPid;
   6:          var serv=new efetch.eUtilsServiceSoapClient();
   7:          var exchange = serv.run_eFetch(re);
   8:          var snpData = exchange.ExchangeSet.Rs[0];
   9:          object[] dataToReport=new object[] {
  10:              snpData.Het.value,
  11:              snpData.hgvs.First(),
  12:              snpData.PrimarySequence.First().accession,
  13:          };
  14:          Console.WriteLine( String.Join("\t",dataToReport.Select(x=>x.ToString()).ToArray())); 
  15:          Console.ReadLine();
  16:  }

The following links contain other useful information for consuming the entrez webservice in C#.

Setting up visual studio: http://www.ncbi.nlm.nih.gov/books/NBK55694/

Using efetch: http://www.ncbi.nlm.nih.gov/books/NBK55694/#csharp_msvs.Call_HYPERLINK__chaptercha_8

Forming queries: http://www.biostars.org/p/3436/

More information: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html

Also note that the first query takes much longer than subsequent ones, for reasons unknown to me at present.