- This event has passed.
Ph.D. Defense: Abdullah Baihan
August 28, 2019 @ 12:00 pm - 1:00 pm UTC-5
Title: Efficient Sequential and Parallel Algorithms for Incremental Record Linkage.
Ph.D. Candidate: Abdullah Baihan
Major Advisors: Prof. Sanguthevar Rajasekaran and Prof. Reda Ammar.
Associate Advisors: Prof. Song Han and Prof. Sheida Nabavi.
Date/Time: Wednesday, August 28nd, 2019 at 12:00 PM.
Location: HBL 1947 Conference Room
In the biomedical domain, the record linkage is considered as a crucial problem. When the number of records is very large, existing algorithms for record linkage take too much time. Often, we have to link a small set of new records with a large set of old records. This can be done by putting together the old and new records and performing a linkage on all the records. Clearly, this will call for an enormous amount of time. An alternative is to develop algorithms that perform linkage in an incremental manner. We refer to any such algorithm as an Incremental Record Linkage (IRL) algorithm.
In this thesis, we present an efficient IRL algorithm. In addition to taking large amounts of time, existing algorithms might also suffer from a chaining problem and hence introduce some errors in linking. As has been observed in the literature, this chaining problem can be solved by performing clustering under complete linkage.
This thesis makes two main contributions. Firstly, we have offer novel sequential and parallel algorithms for the critical incremental record linkage problem using single linkage. Secondly, we have come up with novel sequential and parallel algorithms for incremental record linkage using complete linkage to overcome the chaining problems.
Our algorithms can handle any number of datasets. In contrast, many of the existing algorithms can only link two datasets at a time. Our algorithms outperform previous algorithms and offer state-of-the-art solutions to the IRL problem. We have tested our algorithms on millions of records on synthetic and real datasets and shown that our algorithms outperform the best-known RLA algorithms when the number of new records is up to around 20% of the total number of old records. Our algorithms achieve a very nearly linear speedup in parallel.