The .NET Bio library has an improved version of it’s BAM file parser, which makes it significantly faster and easily competitive with the current standard C coded
SAMTools for obtaining sequencing data and working with it. The chart below compares the time it takes in seconds for the old version of the parser and the current version to parse a 76 MB BAM file. The current parser can easily create ~150K validated sequence objects per second on the clunky old servers I typically run code on. Note that the windows and unix numbers are from completely different machines and not comparable. Also included is a comparison to a “custom” version of the parser that I wrote, which uses unsafe code, assumes the system architecture is always little endian, caches strings and does some other tricks to get some further performance improvements and reduce memory usage.
The comparison to samtools is based on the system time to parse the file using this task on the
same unix server used for the C# tests.
samtools view Test.bam | wc -l
And is just meant to give a general idea of the performance comparison, as there are several important differences between the samtools and .NET Bio test. The C# version was not allowed to reuse memory for objects as it was supposed to be working as a data producer, while the Samtools version processes reads one at a time and does reuse memory. C# also made a lot of dictionaries to aid quick access to the read groups, which isn’t done by samtools. However, samtools had to write the files to the output pipe, while the C# version did not, which undoubtably introduces a fair bit of overhead for it. Both tools however, are clearly plenty fast and at this stage further performance improvements would come from lazy evaluation (or not sticking unnecessary information like the original quality scores in the BAM files!), and the language won’t matter much.
Performance Comments
One task when parsing BAMs is unpacking lots of information that is packed
together in arrays. In SAMtools and the current .NET Bio parser, this is done with lots of unpacking of bits by integer manipulations. For example this code from SAMTools:
|
uint32_t x[8], block_len = data_len + BAM_CORE_SIZE, y; int i; assert(BAM_CORE_SIZE == 32); x[0] = c->tid; x[1] = c->pos; x[2] = (uint32_t)c->bin<qual<l_qname; x[3] = (uint32_t)c->flag<n_cigar; x[4] = c->l_qseq; x[5] = c->mtid; x[6] = c->mpos; x[7] = c->isize; |
Because C# has pointers and value-type structs however, I discovered that it is a lot more fun just to define a structure that contains those fields and unpack directly with a pointer cast in C#.
|
AlignmentData ad; fixed (byte* alblck = alignmentBlock) { ad = *((AlignmentData*)alblck); } |
Blam! Now all the data related to the position, bin read group is in the object with those three lines that copy the data very fast.
So where are the bottlenecks remaining? On windows about a third of the time is spent doing the decompression. In Mono, because the decompression is done by zlib and not in managed code, it’s effectively free. Currently, the quality data and sequence data are passed around a bunch, and the code could likely be made about 10% faster by not copying that data but reusing a single byte array each time. However, it is so fast it hardly seems worth worrying about.
Related