January 14, 2019 –
Title: Software Tools for DNA Motif Similarity Comparison and Analysis
Name: Ngoc Tam Le Tran
Major Advisor: Prof. Chun-Hsi Huang
Associate Advisors: Prof. Sanguthevar Rajasekaran, Prof. Dong-Guk Shin
Date/Time/Location: January 14th, 2019 at 11AM, ITE 336
Binding site motifs are short sequences of similar patterns found in DNA or protein sequences. They play an essential role in revealing the transcription factors that control the gene expression. Several computational tools have been developed for finding binding site motifs. An earlier review on nine motif finding Web tools for ChIP-Seq data indicated that search results reported by individual finders for the same datasets vary significantly, due largely to the different strategies used by different search tools . It is therefore advisable to rely on multiple tools when finding motifs as motifs more commonly reported are more likely to be significant. However, the results from different tools on the same inputs need to be compared to identify common instances. Existing tools for this purpose only allow motif comparisons within a dataset or between two datasets.
To compare more than two datasets, pair-wise comparisons are performed first, and the results are checked against each other in an off-line manner to obtain common motifs. This is a time-consuming process and becomes impractical when attending to a large number of datasets or large datasets. As search results may differ from finder to finder, individual tools thus may not always be reliable. These limitations motivate us to develop software tools with the following features. First, they allow to determine similarity in multiple DNA motif datasets concurrently to extract common significant motifs and those reported by some but not by others. The results are cross-validated to demonstrate an improved prediction accuracy rate . Secondly, the tools are capable of efficiently comparing large datasets and a large number of datasets without manual intervention. Similar motifs reported by multiple tools can be merged into new ones. Thirdly, the results can be matched with motifs in a database for obtaining similar motifs and they can also be visualized as motif trees. And finally, we allow users to, without leaving the site, derive the predictive results from multiple motif finders, as well as the comparison results. These features have been integrated into two webtools: the MOTIF SIMilarity detection tool, MOTIFSIM [3-5]; and the MOtif Discovery pipeline and SImilarity DEtector, MODSIDE .
The MOTIFSIM has been developed with different platforms. The command-line version compares motifs locally in a stand-alone mode. The cluster-based version compares motifs on-line with a user-friendly Web interface. Users can save datasets and results on-line for later retrieval. The tool is also scalable as Web traffic is balanced by a load balancer. The cloud-based version, developed on the Amazon Web Services (AWS) cloud, allows to efficiently compare large datasets and a large number of datasets, is scalable with the expandable cloud services, provides more on-line storage space for users and performs better than the cluster- based version. The MODSIDE pipeline incorporates MOTIFSIM, as well as four de novo motif finders, i.e. ChIPMunk, MEME, Weeder, and XXmotif. Assessment outcome shows that MODSIDE achieves a better accuracy rate than individual motif finders. Compared with other existing motif discovery pipelines, MODSIDE performs similarly to RSAT peak-motifs but better than MEME-ChIP. In addition, MODSIDE is able to deliver various comparison results that are not offered by MEME-ChIP, RSAT peak-motifs, and other similar pipelines.