October | 2013 | nigel delaney

There are a lot of NuMTs (nuclear encoded mitochondrial sequences) in the genome, and when the mtDNA is sequenced, so reads may align to the nuclear genome instead of the mtDNA because of this. But how much winds up in the nuclear DNA and where does it go? To answer this, I simulated reads from a diverse collection of mitochodria, and tracked where they landed when aligned with bwa mem. The reads were simulated from the whole collection of mtDNA molecules available from phylotree, and the simulated reads were 100 bp in length, have a 1% error rate, and an insert size normally distributed around a mean of 150 bp with a std. dev. of 30 (but bounded at a minimum of 40 insert and max of 700). After simulating, I then aligned with.

bwa mem hg19.fna f.fq r.fq > simulatedData.sam

1	bwa mem hg19.fna f.fq r.fq > simulatedData.sam

And discovered that almost all reads align to the mtDNA, only 3% of reads aligned elsewhere. As a result, the distribution of coverage depth across the whole genome is very bi-modal. Histograms showing the coverage depth distrbution of sites with data is shown below.

For reads that did align to the nucleus, the MAPQ was typically 0, but could be as high as 60 and had an unexplained peak at 27.

And below shows the normalized coverage by positions across the mtDNA, clearly some regions are more affected by NuMTs.

Reads from the first and last 500 bp of the mtDNA are poorly aligned by bwa. It appears most go to chromosome 17, but their true location is belied by their mate pair. In fact only 0.6% of reads in this region that map to the nuclear DNA do not have their paired read map to the mtDNA.

I also wanted to see how reads that represent a heteroplasmic deletion would be handled. I simulated reads that either spanned or included a deletion randomly chosen to be in the mtDNA, again virtually all mapped to the mitochondria, and the coverage profile looked similar to the simulation with complete reads. Perhaps most reassuringly, almost all reads are mapped. Checking for unmapped reads with the command:

samtools view -f 4 Simulated.bam

1	samtools view -f 4 Simulated.bam

Showed only one un-aligned read out of the simulated millions, and this read had many errors compared with the original sequence it was simulated from. The result of all of this is one large BedFile giving the location of all possible reads from elsewhere.

The .NET Bio library has an improved version of it’s BAM file parser, which makes it significantly faster and easily competitive with the current standard C coded SAMTools for obtaining sequencing data and working with it. The chart below compares the time it takes in seconds for the old version of the parser and the current version to parse a 76 MB BAM file. The current parser can easily create ~150K validated sequence objects per second on the clunky old servers I typically run code on. Note that the windows and unix numbers are from completely different machines and not comparable. Also included is a comparison to a “custom” version of the parser that I wrote, which uses unsafe code, assumes the system architecture is always little endian, caches strings and does some other tricks to get some further performance improvements and reduce memory usage.

The comparison to samtools is based on the system time to parse the file using this task on the same unix server used for the C# tests. samtools view Test.bam | wc -l And is just meant to give a general idea of the performance comparison, as there are several important differences between the samtools and .NET Bio test. The C# version was not allowed to reuse memory for objects as it was supposed to be working as a data producer, while the Samtools version processes reads one at a time and does reuse memory. C# also made a lot of dictionaries to aid quick access to the read groups, which isn’t done by samtools. However, samtools had to write the files to the output pipe, while the C# version did not, which undoubtably introduces a fair bit of overhead for it. Both tools however, are clearly plenty fast and at this stage further performance improvements would come from lazy evaluation (or not sticking unnecessary information like the original quality scores in the BAM files!), and the language won’t matter much. Performance Comments One task when parsing BAMs is unpacking lots of information that is packed together in arrays. In SAMtools and the current .NET Bio parser, this is done with lots of unpacking of bits by integer manipulations. For example this code from SAMTools:

	uint32_t x[8], block_len = data_len + BAM_CORE_SIZE, y;
	int i;
	assert(BAM_CORE_SIZE == 32);
	x[0] = c->tid;
	x[1] = c->pos;
	x[2] = (uint32_t)c->bin<qual<l_qname;
	x[3] = (uint32_t)c->flag<n_cigar;
	x[4] = c->l_qseq;
	x[5] = c->mtid;
	x[6] = c->mpos;
	x[7] = c->isize;

uint32_t x[8], block_len = data_len + BAM_CORE_SIZE, y;

int i;

assert(BAM_CORE_SIZE == 32);

x[0] = c->tid;

x[1] = c->pos;

x[2] = (uint32_t)c->bin<qual<l_qname;

x[3] = (uint32_t)c->flag<n_cigar;

x[4] = c->l_qseq;

x[5] = c->mtid;

x[6] = c->mpos;

x[7] = c->isize;

Because C# has pointers and value-type structs however, I discovered that it is a lot more fun just to define a structure that contains those fields and unpack directly with a pointer cast in C#.

        AlignmentData ad;
	fixed (byte* alblck = alignmentBlock)
	{ ad = *((AlignmentData*)alblck); }

AlignmentData ad;

fixed (byte* alblck = alignmentBlock)

{ ad = *((AlignmentData*)alblck); }

Blam! Now all the data related to the position, bin read group is in the object with those three lines that copy the data very fast. So where are the bottlenecks remaining? On windows about a third of the time is spent doing the decompression. In Mono, because the decompression is done by zlib and not in managed code, it’s effectively free. Currently, the quality data and sequence data are passed around a bunch, and the code could likely be made about 10% faster by not copying that data but reusing a single byte array each time. However, it is so fast it hardly seems worth worrying about.

nigel delaney

evolutionary biologist, statistician, nice guy

Monthly Archives: October 2013

NuMTs, mtDNA sequencing and Aligners

The .NET Bio BAM Parser is Smoking Fast