January 23, 2020 –
Title: Algorithms for Mitochondrial Genome Assembly and Haplogroup Assignment from Low-Coverage Whole-Genome Sequencing Data
Ph.D. Candidate: Fahad Alqahtani
Major Advisor: Dr. Ion Mandoiu
Associate Advisors: Dr. Mukul Bansal, Dr. Derek Aguiar
Date/Time: Thursday, January 23rd,2020 at 1:00 - 2:00 P.M.
Location: HBL Instruction 2119A (formerly Video Theater 2 )
Mitochondria are cellular organelles present with very rare exceptions in all eukaryotic cells. In most animals, the mitochondria have their own genome, a double-stranded circular DNA molecule typically ranging in size between 15-20Kb that encodes for 37 genes. The mitochondrial genome is inherited maternally, and has much higher copy number than the nuclear genome. The small size, high copy number, and the presence of both coding and regulatory regions that mutate at different rates make the mitochondrial genome an ideal genetic marker. Indeed, mitochondrial sequences have been used in applications ranging from maternal ancestry inference and tracing human migrations to forensic analysis. The mitochondrial DNA has also become the workhorse of biodiversity studies since many non-model species do not yet have the nuclear genome sequenced. By using next-generation sequencing technologies it is possible to quickly and inexpensively generate large numbers of relatively short reads from both the nuclear and mitochondrial DNA contained in a biological sample. Unfortunately, assembling such whole-genome sequencing (WGS) data with standard de novo assemblers often fails to generate high quality mitochondrial genome sequences due to the large difference in copy number between the mitochondrial and nuclear genomes. Assembly of complete mitochondrial genome sequences is further complicated by the fact that many de novo assemblers are not designed for circular genomes, and by the presence of repeats in the mitochondrial genomes of some species.
This thesis presents several novel bioinformatic tools enabling highly accurate mitochondrial genome reconstruction from low coverage from WGS data. First, we describe the Statistical Mitogenome Assembly with RepeaTs (SMART) pipeline, which uses a seed sequence to estimate the distribution of mtDNA k-mer counts, then positively selects reads with k-mer counts matching this distribution before performing de novo assembly. Contigs produced by an initial assembly step are filtered using BLAST searches against a comprehensive mitochondrial genome database, and used as "baits" for an alignment-based filter that produces the set of reads used in a second de novo assembly and scaffolding step. In the presence of repeats, the possible paths through the assembly graph are evaluated using a maximum-likelihood model. Additionally, the assembly process is repeated a user-specified number of times on re-sampled subsets of reads to select for annotation the assembled sequences with highest bootstrap support. Experiments on WGS datasets from a variety of species show that the SMART pipeline produces complete circular mitochondrial genome sequences with a higher success rate than current state-of-the art tools, particularly for low-coverage WGS datasets. Second, we present SMART2, an enhanced version of the SMART pipeline that can take advantage of multiple sequencing libraries when available and automatically selects the optimal number of read pairs used for assembly. Experimental results on publicly available WGS datasets show that SMART2 can assemble high quality mitochondrial genomes from low coverage with minimal user intervention. Indeed, SMART2 succeeded in generating mitochondrial sequences -- including 16 complete circular mitogenomes -- for 27 metazoan species with no previously published mitogenomes in NCBI databases. Finally, we present efficient algorithms for highly accurate haplogroup assignment and mitochondrial-based forensic analysis of WGS data from mixed DNA samples.