There are a lot of NuMTs (nuclear encoded mitochondrial sequences) in the genome, and when the mtDNA is sequenced, so reads may align to the nuclear genome instead of the mtDNA because of this. But how much winds up in the nuclear DNA and where does it go? To answer this, I simulated reads from a diverse collection
of mitochodria, and tracked where they landed when aligned with bwa mem.
The reads were simulated from the whole collection of mtDNA molecules available from phylotree, and the simulated reads were 100 bp in length, have a 1% error rate, and an insert size normally distributed around a mean of 150 bp with a std. dev. of 30 (but bounded at a minimum of 40 insert and max of 700).
After simulating, I then aligned with.
bwa mem hg19.fna f.fq r.fq > simulatedData.sam
And discovered that almost all reads align to the mtDNA, only 3% of reads aligned elsewhere. As a result, the distribution of coverage depth across the whole genome is very bi-modal. Histograms showing the coverage depth distrbution of sites with data is shown below.
For reads that did align to the nucleus, the MAPQ was typically 0, but could be as high as 60 and had an unexplained peak at 27.
And below shows the normalized coverage by positions across the mtDNA, clearly some regions are more affected by NuMTs.
Reads from the first and last 500 bp of the mtDNA are poorly aligned by bwa. It appears most go to chromosome 17, but their true location is
belied by their mate pair. In fact only 0.6% of reads in this region that
map to the nuclear DNA do not have their paired read map to the mtDNA.
I also wanted to see how reads that represent a heteroplasmic deletion would be handled. I simulated reads that either spanned or included a deletion randomly chosen to be in the mtDNA, again virtually all mapped to the mitochondria, and the coverage profile looked similar to the simulation with complete reads. Perhaps most reassuringly, almost all reads are mapped. Checking for unmapped reads with the command:
samtools view -f 4 Simulated.bam
Showed only one un-aligned read out of the simulated millions, and this read had many errors compared with the original sequence it was simulated from. The result of all of this is one large BedFile giving the location of all possible reads from elsewhere.