May 4, 2020 –
Title: Lossless Compression Tools for Genomics Data
Ph.D. Candidate: Sultan AlYami
Major Advisor: Dr. Chun-Hsi Huang
Associate Advisors: Dr. Reda A. Ammar, Dr. Sanguthevar Rajasekaran.
Date/Time: Monday May 4, 2020 10:00 am-11:00 am
Meeting link: https://uconn-cmr.webex.com/uconn-cmr/j.php?MTID=m2817eed93d5eb8edd20c4c2b6c99ea34
Meeting number: 610 888 101
Join by phone: +1-415-655-0002 US Toll
While the rapid advancement of next-generation sequencing technologies has significantly accelerated biomedical research and discovery, the storage, transmission, and processing of the massive amount of genomic data have become a challenge. Due to a key fact that the next-generation sequencing data are highly redundant, Data Compression techniques have been used by researchers to reduce the storage space, transmission bandwidth, and the processing cost. Some use general-purpose compressors such as gzip or bzip2, while other tools take advantage of properties particular to genomic data such as the small-sized alphabets, the presence of many exact or approximate repeats, and the sequence redundancy.
In this dissertation, we investigate the use of Huffman-Tree Encoding for efficient compression of NGS data. Two specialized structures, i.e. the Unbalanced Huffman Tree and the Nongreedy Huffman Tree are proposed to better utilize the properties of repeats for a better compression ratio. Both demonstrate promising improvements over prior results based on Huffman-Tree Encoding. These methods are designed to compress single- or multi-FASTA files. As NGS data come in different file formats such as FASTA (Reference Genome), as well as FASTQ (Raw Data) and SAM (Alignments), which contain quality scores and metadata, in addition to the biological sequences. We also propose to investigate different compression tools for FASTA and FASTQ file formats. A specialized FASTQ compressor is proposed, which achieves the best compression ratios on the LS454, PacBio, and MinION data sets compared with other state-of-the-art FASTQ compressors. Preliminary results were reported in JCB 2017, JCB 2019 and PLOS ONE 2019.
Ongoing work includes the research of more efficient use of approximate repeats in Huffman-Tree encoding and adding support for SOLiD data sets by allowing our LfastqC FASTQ file compressor to work with color space encoding.