Title: Machine Learning methods for Complex Structural Variation analysis
Student: Timothy Becker
Major Advisor: Dr. Dong-Guk Shin
Associate Advisors: Dr. Yufeng Wu, Dr. Ion Mandoiu
Date/Time: Wednesday, November 1st, 2017 at 11:30am in Babbidge 1947 meeting room
Detecting variations larger than 50 nucleotide bases called Structural Variants (SV) with current DNA sequencing technology remains challenging in normal tissues, but becomes more problematic with the increased heterogeneity and allele complexity found in tumor tissues. We focus on three areas:
(1) Multi-input SV arbitration method
The first part of the dissertation describes a multi-input SV fusion method (FusorSV) that uses features and prior knowledge to produce a comprehensive and arbitrated call set. We show that this approach works well on deletion, duplication and inversion call types in germline data by constructing a fully automated SV calling engine (SVE) that runs eight popular calling algorithms and utilizes the freely available 1000 Genomes Phase 3 high coverage data set. By focusing on the SV type and length as features, FusorSV outperformed existing algorithms based on 1000 rounds of permutation testing and had a concordantly high in vitro validation rate in excess of 85% for novel SV events.
(2) Somatic genome generation method
The second area details a genome generator (soMaCX) that models somatic evolution from sub clonal to cancer stem cell instances under continuous control. Joint SV distributions are constructed from SV type, size, complexity and region controls. Gain and loss of function are modeled by considering the SV type and its positive or negative effect after transcription, therein providing the needed mechanism to simulate selective pressure to ONCO genes and replication regions like the NHEJ pathway. To provide user control of sample purity, reads are simulated for both normal and somatic tissues and resulting data is randomly sampled.
(3) Sequence feature extraction method and application
The final part of this dissertation comprises a sequence feature extraction framework (SAFE) and its application to somatic SV analysis. We propose a genomic signal processor framework that abstracts and transforms sequences and alignment entries into feature vectors such as read depth, split read depth, clipped read depth, supplemental read depth, strand bias, k-mer frequency and nucleic acid proportion. Integral to this framework will be an out-of-core data structure that will offer efficient random access, normalization and indexing on large data sets. We will then prove effectiveness by application to SV allele complexity and heterogeneity using machine learning methods in conjunction with SAFE.