May 3, 2018 –
Title: Complex Genome Analysis with High-throughput Sequencing Data: Methods and Applications
PhD Candidate: Xin Li
Major Advisor: Dr. Yufeng Wu
Associate Advisors: Dr. Ion Mandoiu, Dr. Sheida Nabavi
Day/Time: Thursday, May 03, 2018 10:00am
Location: Laurel Hall 102
The genomes of most eukaryotes are large and complex. The presence of large amounts of non-coding sequences is a general property of the genomes of complex eukaryotes. While RNA is often created from linear splicing during transcription, recent studies have found that non-canonical splicing sometimes occurs. Circular RNA (or circRNA) is a kind of non-coding RNA, which consists of a circular configuration through a typical 5’ to 3’ phosphodiester bond by non-canonical splicing. CircRNA was originally thought as the byproduct from the process of mis-splicing and considered to be of low abundance. Recently, however, circRNA is considered as a new class of functional molecule, and the importance of circRNA in gene regulation and their biological functions in some human diseases have started to be recognized. Several biological methods and multiple bioinformatics tools have been developed for circRNA detection in recent years. However, there is little overlap in the predictions between published circRNA detection algorithm. In addition, these tools are inefficient in running time and may miss many circRNA in some cases. Moreover, since the study of circRNA is still at an early stage, there is no widely accepted benchmark data for evaluating the circRNA calling at present. In this project, we propose multiple algorithms to address those problems by using high-throughput next-generation sequencing reads. In order to save the running time, we design an algorithm called “CircMarker” to find circRNA by creating k-mer table rather than conventional reads mapping. Furthermore, we create an algorithm named CircDBG by taking advantage of the information from both reads and annotated genome to create de Bruijn graph for circRNA detection, which improves the accuracy and sensitivity. We also design several different methods to evaluate the performance among multiple tools. Both simulated data and real data from different species are used in this project. Finally, we propose using machine learning or deep neural network to deal with two additional issues. The first one is complex genome detection, such as back-splicing (e.g. circRNA) and repetitive elements. The second one is long reads error correction.