{"id":188,"date":"2013-10-12T21:34:30","date_gmt":"2013-10-12T21:34:30","guid":{"rendered":"http:\/\/evolvedmicrobe.com\/blogs\/?p=188"},"modified":"2014-07-09T16:42:21","modified_gmt":"2014-07-09T16:42:21","slug":"the-net-bio-bam-parser-is-smoking-fast","status":"publish","type":"post","link":"http:\/\/evolvedmicrobe.com\/blogs\/?p=188","title":{"rendered":"The .NET Bio BAM Parser is Smoking Fast"},"content":{"rendered":"The .NET Bio library has an improved version of it&#8217;s BAM file\u00a0parser, which makes it significantly faster and easily competitive with the\u00a0current standard C coded <a href=\"http:\/\/samtools.sourceforge.net\/\">SAMTools<\/a> for obtaining\u00a0sequencing data and working with it. The chart below compares the time it\u00a0takes in seconds for the old version of the parser and the current version to\u00a0parse a 76 MB BAM file. The current parser\u00a0can easily create ~150K validated sequence objects per second on the clunky old\u00a0servers I typically run code on. Note that the windows and unix numbers\u00a0are from completely different machines and not comparable. Also included is a comparison to a\u00a0&#8220;custom&#8221; version of the parser that I wrote, which uses unsafe code, assumes the\u00a0system architecture is always little endian, caches strings and does some other\u00a0tricks to get some further performance improvements and reduce memory usage.\r\n\r\n<a href=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/10\/img5.gif\"><img data-attachment-id=\"192\" data-permalink=\"http:\/\/evolvedmicrobe.com\/blogs\/?attachment_id=192\" data-orig-file=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/10\/img5.gif?fit=816%2C491\" data-orig-size=\"816,491\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"img5\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/10\/img5.gif?fit=300%2C180\" data-large-file=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/10\/img5.gif?fit=625%2C376\" loading=\"lazy\" class=\"aligncenter size-full wp-image-192\" alt=\"img5\" src=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/10\/img5.gif?resize=625%2C376\" width=\"625\" height=\"376\" data-recalc-dims=\"1\" \/><\/a>\r\n\r\nThe comparison to samtools is based on the system time to parse the file using this task on the\r\nsame unix server used for the C# tests.\r\n\r\n<code>samtools view Test.bam | wc -l<\/code>\r\n\r\nAnd is just meant to give a general idea of the performance comparison, as\u00a0there are several important differences between the samtools and .NET Bio test.\u00a0The C# version was not allowed to reuse memory for objects as it was supposed to\u00a0be working as a data producer, while the Samtools version processes reads one at\u00a0a time and does reuse memory. C# also made a lot of dictionaries to aid\u00a0quick access to the read groups, which isn&#8217;t done by samtools. However, samtools had to write the files to the output pipe, while the C# version did not, which undoubtably introduces a\u00a0fair bit of overhead for it. Both tools however, are clearly plenty fast\u00a0and at this stage further performance improvements would come from lazy\u00a0evaluation (or not sticking unnecessary information like the original quality\u00a0scores in the BAM files!), and the language won&#8217;t matter much.\r\n\r\n<strong>Performance Comments<\/strong>\r\n\r\nOne task when parsing BAMs is unpacking lots of information that is packed\u00a0<span style=\"line-height: 1.714285714; font-size: 1rem;\">together in arrays.\u00a0 In SAMtools and the current .NET Bio parser, this is\u00a0<\/span>done with lots of unpacking of bits by integer manipulations.\u00a0 For example\u00a0this code from SAMTools:\r\n<pre class=\"brush: c\">\t\r\n\tuint32_t x[8], block_len = data_len + BAM_CORE_SIZE, y;\r\n\tint i;\r\n\tassert(BAM_CORE_SIZE == 32);\r\n\tx[0] = c->tid;\r\n\tx[1] = c->pos;\r\n\tx[2] = (uint32_t)c->bin<qual<l_qname;\r\n\tx[3] = (uint32_t)c->flag<n_cigar;\r\n\tx[4] = c->l_qseq;\r\n\tx[5] = c->mtid;\r\n\tx[6] = c->mpos;\r\n\tx[7] = c->isize;<\/pre>\r\nBecause C# has pointers and value-type structs however, I discovered that it is a lot more fun just to define a structure that contains those fields and unpack directly with a pointer cast in C#.\r\n\r\n<pre class=\"brush: csharp\">\t\r\n        AlignmentData ad;\r\n\tfixed (byte* alblck = alignmentBlock)\r\n\t{ ad = *((AlignmentData*)alblck); }\r\n<\/pre>\r\n\r\nBlam! Now all the data related to the position, bin read group is in the object with those three lines that copy the data very fast.\r\n\r\nSo where are the bottlenecks remaining? On windows about a third of the time is spent doing the decompression. In Mono, because the decompression is done by zlib and not in managed code, it&#8217;s effectively free.\u00a0Currently, the quality data and sequence data are passed around a bunch, and the\u00a0code could likely be made about 10% faster by not copying that data but reusing a single byte array each time. However, it is so fast it hardly seems worth worrying about.","protected":false},"excerpt":{"rendered":"The .NET Bio library has an improved version of it&#8217;s BAM file\u00a0parser, which makes it significantly faster and easily competitive with the\u00a0current standard C coded SAMTools for obtaining\u00a0sequencing data and working with it. The chart below compares the time it\u00a0takes in seconds for the old version of the parser and the current version to\u00a0parse a [&hellip;]","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[18,14,8,3,1],"tags":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":71,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=71","url_meta":{"origin":188,"position":0},"title":"Java vs. C# Performance Comparison for Parsing VCF Files","date":"May 26, 2013","format":false,"excerpt":"Making a comparison with a reasonably complex program ported between the two languages. Update 3\/10\/2014: After writing this post I changed the C# parser to remove an extra List<> allocation in the C# code that was not in the Java code.\u00a0\u00a0After this, the Java\/C# versions are indistinguishable on speed, but\u2026","rel":"","context":"In &quot;Algorithms&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/05\/image_thumb1.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":398,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=398","url_meta":{"origin":188,"position":1},"title":".NET Bio is Significantly Faster on .Net Core 2.0","date":"November 5, 2017","format":false,"excerpt":"Summary: With the release of .NET Core 2.0, .NET Bio is able to run significantly faster (~2X) on Mac OSX due to better compilation and memory mangement. The .NET Bio\u00a0library contains libraries for genomic data processing tasks like parsing, alignment, etc. that are too computationally intense to be\u00a0undertaken with interpreted\u2026","rel":"","context":"In \".NET Bio\"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2017\/11\/Benchmark-1.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":112,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=112","url_meta":{"origin":188,"position":2},"title":"Mono.Simd and the Mandlebrot Set.","date":"September 10, 2013","format":false,"excerpt":"C# and .NET are some of the fastest high level languages, but still cannot truly compete with C\/C++ for low level speed, and C# code can be anywhere from 20%-300% slower. This is despite the fact that the C# compiler often gets as much information about a method as the\u2026","rel":"","context":"In &quot;Algorithms&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/img2_thumb.gif?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":153,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=153","url_meta":{"origin":188,"position":3},"title":"Using Selectome with .NET Bio, F# and R","date":"September 16, 2013","format":false,"excerpt":"The Bio.Selectome namespace has features to query\u00a0Selectome.Selectome is a database that merges data from Ensembl\u00a0and the programs in PAML used to compute the ratio of non-synonymous to synonymous (dN\/dS)\u00a0mutations along various branches of the phylogenetic tree. A low dN\/dS ratio\u00a0indicates that the protein sequence is under strong selective constraint, while\u2026","rel":"","context":"In &quot;.NET Bio&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":299,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=299","url_meta":{"origin":188,"position":4},"title":"C# vs. Java, Xamarin vs. Oracle, Performance Comparison version 2.0","date":"June 14, 2014","format":false,"excerpt":"Today I noticed the SIMD implementation of the Mandelbrot set algorithm I blogged about last year was successfully submitted to the language shootout webpage. However, I was a bit disappointed to see the C# version was still slower than the Java version, despite my use of the special SIMD instructions\u2026","rel":"","context":"Similar post","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=12","url_meta":{"origin":188,"position":5},"title":"Compile Bowtie2 on Windows 64 bit.","date":"January 30, 2013","format":false,"excerpt":"Bowtie 2 is a program that efficiently aligns next generation sequence data to a reference genome. However, the version distributed by the authors only compiles on POSIX platforms. These instructions will allow you to compile it on windows by downloading the Mingw64 tools and editing the make file before building\u2026","rel":"","context":"In &quot;Computing&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/01\/Capture.png?resize=350%2C200","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/188"}],"collection":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=188"}],"version-history":[{"count":18,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/188\/revisions"}],"predecessor-version":[{"id":222,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/188\/revisions\/222"}],"wp:attachment":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=188"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=188"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=188"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}