1 |
bwa mem hg19.fna f.fq r.fq > simulatedData.sam |
1 |
samtools view -f 4 Simulated.bam |
1 |
bwa mem hg19.fna f.fq r.fq > simulatedData.sam |
1 |
samtools view -f 4 Simulated.bam |
samtools view Test.bam | wc -l
And is just meant to give a general idea of the performance comparison, as there are several important differences between the samtools and .NET Bio test. The C# version was not allowed to reuse memory for objects as it was supposed to be working as a data producer, while the Samtools version processes reads one at a time and does reuse memory. C# also made a lot of dictionaries to aid quick access to the read groups, which isn’t done by samtools. However, samtools had to write the files to the output pipe, while the C# version did not, which undoubtably introduces a fair bit of overhead for it. Both tools however, are clearly plenty fast and at this stage further performance improvements would come from lazy evaluation (or not sticking unnecessary information like the original quality scores in the BAM files!), and the language won’t matter much.
Performance Comments
One task when parsing BAMs is unpacking lots of information that is packed together in arrays. In SAMtools and the current .NET Bio parser, this is done with lots of unpacking of bits by integer manipulations. For example this code from SAMTools:
1 2 3 4 5 6 7 8 9 10 11 |
uint32_t x[8], block_len = data_len + BAM_CORE_SIZE, y; int i; assert(BAM_CORE_SIZE == 32); x[0] = c->tid; x[1] = c->pos; x[2] = (uint32_t)c->bin<qual<l_qname; x[3] = (uint32_t)c->flag<n_cigar; x[4] = c->l_qseq; x[5] = c->mtid; x[6] = c->mpos; x[7] = c->isize; |
1 2 3 |
AlignmentData ad; fixed (byte* alblck = alignmentBlock) { ad = *((AlignmentData*)alblck); } |