Title: Novel Techniques for Big Data Processing and Record Linkage Applications
Ph.D. Candidate: Ahmed Soliman
Major Advisor: Dr. Sanguthevar Rajasekaran
Associate Advisors: Dr. Nalini Ravishanker
Committee Members: Dr. Song Han, Dr. Sheida Nabavi
Date/Time: Wednesday, August 2nd, 2023, 10:00 am
Location: WebEx
Meeting link: https://uconn-cmr.webex.com/uconn-cmr/j.php?MTID=mb4e783d002b03fcd8e22e76cb2ef490d
Meeting number: 2623 160 9971
Password: kZpEcEys252
Abstract
Recent advances in technology have led to an unprecedented explosion of data in virtually all domains of our lives. Harnessing the power of the information mined from these rapidly changing and exponentially ever-growing data calls for the development of carefully designed and professionally crafted solutions. Solely relying on powerful computing resources is insufficient. Even the most up-to-date computing environment would lack the scalability requirements for efficiently handling today’s big data processing if not coupled with novel algorithmic techniques. We offer several such techniques to achieve more practical solutions for processing and linking big databases.
The first big data problem we study is the Record Linkage (RL) problem, also known as Entity Resolution (ER). Given several data sources, RL is the process of identifying and collecting all records pertaining to different entities. A unique real-world entity might represent a person, a product, or a business. The speed and quality of the record linkage process are of immense interest across many sectors, including government, health, public safety, and national security. For example, record linkage in governmental applications has significant implications on the quality of their census reports and hence achieves the fairness of the distribution of public funds. The RL problem has been well studied in the literature. However, in today’s big data era, the linkage process is still time consuming even with some well known clever techniques employed such as blocking and filtering techniques. Thus, there is a growing demand for developing more efficient algorithms.To support today’s rapidly growing data, we have designed our Fast Incremental Record Linkage Algorithm (FIRLA).
The second big data problem we study is the patient record linkage of time series data. Modern biomedical devices can acquire a large number of physical readings from patients. Often, these readings are stored in the form of time series data. Such time series data can form the basis for important research to advance healthcare and well being. Due to several considerations including data size, patient privacy, etc., the original, full data may not be available to secondary parties or researchers. Instead, suppose that a subset of the data is made available. We offer our TSLINK, a novel and scalable record linkage algorithm that is employed on time series data. TSLINK enables secondary study researchers to accurately match patient records in the original and subset databases while maintaining privacy.
In addition, we have studied the problem of designing an automated unsupervised online anomaly detection method. Many companies are leveraging the advancement in IoT technologies and employ more Wireless Sensor Networks (WSNs)-based solutions in their monitoring services. These IoT monitoring services entail the design and use of an automated anomaly prediction/detection solution. We offer IF+, a novel custom unsupervised approach for online anomaly detection that combines Isolation Forest with two generic techniques, namely, data thresholding and distance-based filtering. IF+ has been successfully deployed and tested in real IoT monitoring systems