- This event has passed.
Ph.D. Defense: Timothy Becker
July 19 @ 12:00 pm - 1:30 pm EDT
Title: Machine Learning Methods for Complex Structural Variation Analysis
Ph.D. Candidate: Timothy Becker
Major Advisor: Dong-Guk Shin
Associate Advisors: Ion Mandoiu, Yufeng Wu
Additional Readers: Sheida Nabavi, Derek Aguiar
Date/Time: Monday July 19th, 12:00pm
Meeting Number: 120 709 1994
Abstract: Detecting germline variations larger than 50 nucleotide bases called Structural Variants (SV) in normal tissue with high accuracy remains challenging with high throughput DNA sequencing. It is even more difficult with the heterogeneity and allele complexity associated with tumor tissues in somatic SV calling. We first show that existing germline SV calling accuracy can benefit from a supervised ensemble training method called FusorSV. This method however is not immediately applicable to somatic SV calling since there are so few somatic SV callers compared to the germline counterparts and no open-source datasets to learn from. Thus, we developed three complementary methods that are used in conjunction to formulate a somatic SV calling framework. We first describe a feature moment extraction system called HFM that extracts important SV signatures like read-depth, insert-size, mapping quality and orientation information from the reads. Next, we detail a supervised variable topology neural network called TensorSV that is designed to work with HFM features and variable heterogeneity VCF calls. To overcome the lack of gold-standard somatic SV datasets, we designed a complex somatic genome generation framework called somaCX that generates a germline genome and then simulates the somatic evolutionary process over that basis using non-uniform distributions that account for gene function. We then build a somatic training set comprising high SV rates and oncogene enrichment to build specialized TensorSV models for deletion (DEL), duplication (DUP), and inversion types (INV). We strike a balance in somatic SV calling accuracy by using four-class models: one for no event, two for the standard germline frequencies and one specifically designed for low-frequency events that are present only in mixed-purity tissues. We demonstrate that these models are effective by applying them to open-source TCRB samples and find that our SV calls in the oncogene regions have higher enrichment than healthy background samples in addition to having high correlated differential oncogene expression patterns.