January 16, 2020 –
Title: Complex Genome Analysis with High-throughput Sequencing Data: Methods and Applications
Ph.D. Candidate: Xin Li
Major Advisor: Dr. Yufeng Wu
Associate Advisors: Dr. Sanguthevar Rajasekaran, Dr. Sheida Nabavi
Day/Time: Thursday, Jan 16, 2020, 1:00 pm
Location: HBL 2119A (Formerly Video Theater 2)
The genomes of most eukaryotes are large and complex. The presence of large amounts of non-coding sequences is a general property of the genomes of complex eukaryotes. While RNA is often created from linear splicing during transcription, recent studies have found that non-canonical splicing sometimes occurs. Circular RNA (or circRNA) is a kind of non-coding RNA, which consists of a circular configuration through a typical 5' to 3' phosphodiester bond by non-canonical splicing. CircRNA was originally thought as a byproduct from the process of mis-splicing and considered to be of low abundance. Recently, however, circRNA is considered as a new class of functional molecule, and the importance of circRNA in gene regulation and their biological functions in some human diseases have started to be recognized. Several biological methods and multiple bioinformatics tools have been developed for circRNA detection in recent years. However, there is little overlap in the predictions between published circRNA detection algorithm. In addition, these tools are inefficient in running time and may miss many circRNA in some cases. Moreover, since the study of circRNA is still at an early stage, there is no widely accepted benchmark data for evaluating the circRNA calling at present. In this research work, we propose two algorithms to address those problems by using high-throughput next-generation sequencing reads. In order to improve the performance of running time, we design an algorithm called “CircMarker” to find circRNA by creating k-mer table rather than conventional reads mapping. Furthermore, we create an algorithm named “CircDBG” by taking advantage of the information from both reads and annotated genome to create de Bruijn graph for circRNA detection, which improves the accuracy and sensitivity. We also design several different methods to evaluate the performance among multiple tools. Both simulated data and real data from different species are used in this research work.
Structural variation (SV), which ranges from 50 bp to ~3 Mb in size, is an important type of genetic variations. Deletion is a type of SV in which a part of a chromosome or a sequence of DNA is lost during DNA replication. Three types of signals, including discordant read-pairs, reads depth and split reads, are commonly used for SV detection from high-throughput sequence data. Many tools have been developed for detecting SVs by using one or multiple of these signals. In this research work, we develop a new method called “EigenDel” for detecting genomic deletions. EigenDel first takes advantage of discordant read-pairs and clipped reads to get initial deletion candidates, and then it clusters similar candidates by using unsupervised learning methods. After that, EigenDel uses a carefully designed approach to call true deletions from each cluster. We conduct various experiments to evaluate the performance of EigenDel on low coverage sequence data from 1000 Genomes Project. Our results show that EigenDel outperforms other major methods in terms of improving capability of balancing accuracy and sensitivity as well as reducing bias.