May 1, 2019 –
Title: Efficient Sequential and Parallel Algorithms for Incremental Record Linkage.
Ph.D. Candidate: Abdullah Baihan
Major Advisors: Prof. Sanguthevar Rajasekaran and Prof. Reda Ammar
Associate Advisors: Prof. Song Han and Prof. Sheida Nabavi
Date/Time: Wednesday, May 1, 2019 2:00PM
Location: HBL 1947 Conference Room
Record linkage is the problem of integrating data from different sources and identifying duplicates. For instance, in the domain of health care, billions of records are stored and maintained in different data sources electronically wherein there could be multiple records for individuals. Record linkage has crucial benefits including cost saving, in analyzing and evaluating disease evolution, in the identification of disease origin and diversity, etc. Record linkage is a challenging problem since, typically, there may not be any global identifier. There are many algorithms to deal with the record linkage problem. A naive algorithm for this problem will compare every two records to find the matched records. However, this method takes a very long time. Also, most of the current efficient algorithms link only two datasets at a time, while other few algorithms link more than two datasets simultaneously. Record linkage algorithms take a long time especially when the number of records is very large. Very often, to make decisions on an individual (or a small group of individuals), the records will have to be linked with an already existing (large) database of records. The obvious way of linking the new records will be to add them to the database and perform a linkage on all of the records together. Clearly, this may take an unacceptable amount of time. Linking all the records may not be a viable solution. We need algorithms that can link the new records with the old ones efficiently. We refer to such algorithms as Incremental Record Linkage Algorithms. Unfortunately, the existing incremental algorithms have large run times. In this proposal we propose efficient sequential and parallel Incremental Record Linkage Algorithms “IRL” algorithms. Our algorithms employ agglomerative hierarchical clustering with single linkage. Also, we have extensively tested our algorithms on a large number of synthetic and real datasets. The results show that the proposed algorithms outperform previous algorithms in terms of running time and accuracy. The parallel algorithms achieve a very nearly linear speedup.