Loading Events

« All Events

  • This event has passed.

Ph.D. Defense: Sultan Al Yami

December 17, 2020 @ 1:15 pm - 2:15 pm EST

Doctoral Dissertation Oral Defense

Title: Lossless Compression Tools for Genomics Data

Ph.D. Candidate: Sultan Al Yami

Major Advisor:  Dr. Chun-Hsi Huang

Associate Advisors: Dr. Reda A. Ammar, Dr. Sanguthevar Rajasekaran.

Date/Time: Thursday, December 17, 2020, 1:15 PM – 2:15 PM

Location:  

 Meeting link:  https://uconn-cmr.webex.com/uconn-cmr/j.php?MTID=m6fcf3679cffeccb06c35e6d7d902f540

Meeting number: 120 973 5916
Password: VCaVjbHN533

Join by phone: +1-415-655-0002 US Toll

 Access code: 120 973 5916

 

Abstract: 

While the rapid advancement of next-generation sequencing technologies has significantly accelerated biomedical research and discovery, the storage, transmission, and processing of the massive amount of genomic data have become a challenge. Due to a key fact that the next-generation sequencing data are highly redundant, Data Compression techniques have been used by researchers to save the storage space, transmission bandwidth, and the processing cost. Some use general-purpose compressors such as gzip or bzip2, while other tools take advantage of properties particular to genomic data such as the small-sized alphabets, the presence of many exact or approximate repeats, and the sequence redundancy, etc.

In this dissertation, we investigate the use of Huffman-Tree Encoding for efficient compression of NGS data. First of all, two specialized structures, i.e. the Unbalanced Huffman Tree and the Nongreedy Huffman Tree were proposed to better utilize the properties of repeats for a better compression ratio. Both demonstrate promising improvements over prior results based on Huffman-Tree Encoding.  Furthermore, another specialized structure based on the Nongreedy Huffman Tree is proposed. Unlike the previous one, this structure uses multiple Nongreedy Huffman Trees to achieve a better compression ratio. All the previous methods are designed to compress single- or multi-FASTA files only. As NGS data come in different file formats such as FASTA (Reference Genome), as well as FASTQ (Raw Data) and SAM (Alignments), which contain quality scores and metadata in addition to the biological sequences, we also propose to investigate different compression tools for FASTA and FASTQ file formats. A specialized FASTQ compressor is proposed, which achieves the best compression ratios on the LS454, PacBio, and MinION data sets compared with other state-of-the-art FASTQ compressors. Related results have been reported in JCB 2017, JCB 2019, PLOS ONE 2019, and a recent submission to PLOS ONE in 2020.

 

 

Details

Date:
December 17, 2020
Time:
1:15 pm - 2:15 pm EST

Connect With Us