November 27, 2017 –

**Title: **Novel Algorithms for Some Fundamental Big Data Problems

**Student: **Abdullah-Al Mamun

**Major Advisor:** Dr. Sanguthevar Rajasekaran

**Associate Advisors:** Dr. Reda Ammar, Dr. Ion Mandoiu

**Date/Time: **Monday, Nov 27th, 2017 at 9:30 am

**Location: **ITE 125

**Abstract:**

In this digital era data sets are growing rapidly. Storing, processing, and analyzing large volume of data require efficient techniques. My proposal will focus the following three areas:

**K-mer counting problem:**

A massive number of bioinformatics applications require counting of k-length substrings in genetically important long strings. Genome assembly, repeat detection, multiple sequence alignment, error detection, and many other related applications use a k-mer counter as a building block. Very fast and efficient algorithms are necessary to count k-mers in large data sets to be useful in such applications. We propose a novel trie-based algorithm for this k-mer counting problem.

**Record linkage problem:**

Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. Here we have come up with efficient sequential and parallel algorithms for record linkage which can handle any number of datasets. Our methods employ single linkage as well as complete linkage hierarchical clustering to address this problem.

**Problems with algorithmic challenges:**

Finding minimum spanning trees (MST) in various types of networks is a well-studied problem in theory and practical applications. We have devised a very efficient algorithm which combines ideas from randomized selection, Kruskal’s algorithm and Prim’s algorithm.

Algorithms for finding the closest l-mers have been used in solving the (l, d)-motif search problem. We propose novel exact and approximate algorithms for this problem for the special case of m = 3.