- This event has passed.
Ph.D. Defense: Sultan Al Yami
December 17, 2020 @ 1:15 pm - 2:15 pm EST
Doctoral Dissertation Oral Defense
Title: Lossless Compression Tools for Genomics Data
Ph.D. Candidate: Sultan Al Yami
Major Advisor: Dr. Chun-Hsi Huang
Associate Advisors: Dr. Reda A. Ammar, Dr. Sanguthevar Rajasekaran.
Date/Time: Thursday, December 17, 2020, 1:15 PM – 2:15 PM
Meeting number: 120 973 5916
Join by phone: +1-415-655-0002 US Toll
Access code: 120 973 5916
While the rapid advancement of next-generation sequencing technologies has significantly accelerated biomedical research and discovery, the storage, transmission, and processing of the massive amount of genomic data have become a challenge. Due to a key fact that the next-generation sequencing data are highly redundant, Data Compression techniques have been used by researchers to save the storage space, transmission bandwidth, and the processing cost. Some use general-purpose compressors such as gzip or bzip2, while other tools take advantage of properties particular to genomic data such as the small-sized alphabets, the presence of many exact or approximate repeats, and the sequence redundancy, etc.
In this dissertation, we investigate the use of Huffman-Tree Encoding for efficient compression of NGS data. First of all, two specialized structures, i.e. the Unbalanced Huffman Tree and the Nongreedy Huffman Tree were proposed to better utilize the properties of repeats for a better compression ratio. Both demonstrate promising improvements over prior results based on Huffman-Tree Encoding. Furthermore, another specialized structure based on the Nongreedy Huffman Tree is proposed. Unlike the previous one, this structure uses multiple Nongreedy Huffman Trees to achieve a better compression ratio. All the previous methods are designed to compress single- or multi-FASTA files only. As NGS data come in different file formats such as FASTA (Reference Genome), as well as FASTQ (Raw Data) and SAM (Alignments), which contain quality scores and metadata in addition to the biological sequences, we also propose to investigate different compression tools for FASTA and FASTQ file formats. A specialized FASTQ compressor is proposed, which achieves the best compression ratios on the LS454, PacBio, and MinION data sets compared with other state-of-the-art FASTQ compressors. Related results have been reported in JCB 2017, JCB 2019, PLOS ONE 2019, and a recent submission to PLOS ONE in 2020.