Title: Developing Novel Copy Number Variation Detection Method using Emerging Sequencing Data
Student: Fatemeh Zare
Major Advisor: Prof. Sheida Nabavi
Associate Advisors: Prof. Ion Mandoiu and Prof. Jeffrey Chuang
Date/Time: Wednesday, Dec 11, 2019 4:00-5:00 P.M.
Location: ITE 336
Recently copy number variation (CNV) has gained considerable interest as a type of genomic/genetic variation that plays an important role in disease susceptibility. Advances in sequencing technology have created an opportunity for detecting CNVs more accurately. To identify CNVs, whole-exome sequencing (WES) and whole-genome sequencing (WGS), have become primary strategies for next-generation sequencing (NGS). CNV detection tools developed for WGS data are not appropriate for WES data. This is because, in WES, sequencing data are available only for exonic regions, and exome capture procedures introduce more biases and noise. Therefore, it is necessary to build a robust and precise model to detect CNVs for WES data. Moreover, the key feature of NGS is that it generates huge amounts of data (usually at the scale of gigabytes), which requires to use efficient methods. The depth of coverage (DOC) approach is the most appropriate method to identify CNVs for WES data. In general, the DOC-based tools for CNVs detection are divided into two major steps: 1) preprocessing, and 2) segmentation. In the preprocessing step, noise and biases are reduced from a read-count signal, and in the segmentation part, CNV segments are identified by merging the regions with similar read-count values.
In this proposal, we propose a method for detecting CNVs from WES data. For this purpose, first, we evaluated the performance of the most recent and commonly used CNV detection tools for WES data in cancer to address their limitations and provide guidelines for developing new ones. Then we introduce a novel preprocessing pipeline to improve the detection accuracy of CNVs in heterogeneous next-generation sequencing data such as cancer whole-exome sequencing data. We employed several normalizations to reduce biases due to GC content, mappability, and tumor contamination. We also developed a novel efficient and effective smoothing approach based on the Taut String method to reduce noise and increase the detection power of the CNV detection methods. Also, we proposed a novel efficient segmentation algorithm that integrates information from partially mapped (soft-clipped) reads with read depth data for more precise CNV detection. The proposed method employs an efficient implementation of the solution to the change-point optimization problem, Taut String, to smooth the read depth data and to generate piecewise constant signals as CNV segments. Furthermore, we propose a novel segmentation algorithm based on the modified Taut String to detect CNVs more precisely and efficiently using WES data. The proposed method also filters out outlier read-counts and identifies significant change points to reduce false positives. We used real and simulated data to evaluate the performance of the proposed method and compare its performance with those of other commonly used CNV detection methods. Using simulated and real data, we show that the proposed segmentation method outperforms the existing CNV detection methods in terms of accuracy and false discovery rate and has a faster runtime compared to the circular binary segmentation method.
Also, there are high molecular heterogeneities between single cells. Next-generation sequencing has been successfully adapted to the sequence of complete genomes at the single-cell level. Single-cell sequencing (SCS) is a useful tool to determine somatic genomic heterogeneity. Precise identification of CNVs may help to understand some of the genetic origins of cancer and to develop targeted drugs. General steps of a CNV detection from single-cell sequencing are binning, GC correction, Mappability correction, removal of outlier bins, removal of outlier cells, segmentation, and Calling the absolute copy numbers.
In the last part of this proposal, we discuss the challenge of detecting CNVs from SCS data and propose a model for detecting CNVs from SCS data.