Speaker: Dr. Eduard Dragut, Associate Professor, Computer and Information Sciences Department, Temple University
Date: Wednesday, March 4, 2020
Location: HBL 1947 room
Title: Human-in-the-Loop Entity Mention Mining from Noisy Web Data
Abstract: Recognizing entities that follow or closely resemble a regular expression (regex) pattern is an important task in information extraction. Due to a vast diversity of web documents and ways in which they are generated, even seemingly straightforward tasks such as identifying mentions of date in a document becomes very challenging. It is reasonable to claim that it is impossible to create a regex that is capable of identifying such entities from web documents with perfect precision and recall. Rather than abandoning regex as a go-to approach for entity detection, we present methods to combine the expressive power of regexes, ability of deep learning to learn from large data, and human-in-the loop approach into a new integrated framework for entity identification from web data. The framework starts by creating or collecting the existing regexes for a particular type of an entity. Those regexes are then used over a large document corpus to collect weak labels for the entity mentions and a neural network is trained to predict those regex-generated weak labels. Finally, a human expert is asked to label a set of documents and the neural network is fine tuned on those documents. While human effort is critical to build an entity recognition model, surprisingly little is known about how to best invest that effort given a limited time budget. Should a human’s effort be spent on writing a regex recognizing an entity or on manually label entity mentions in a document corpus? When a user is allowed to choose between regex construction and manual labeling, we discover that (1) if the time budget is low, spending all time for regex construction is often advantageous, (2) if the time budget is high, spending all time for manual labeling seems to be superior, and (3) between those two extremes, writing regexes followed by manual labeling is typically the best approach.
Bio: Eduard Dragut is an Associate Professor in the Computer and Information Sciences Department at Temple University. He received his Ph.D. degree in Computer Science from the University of Illinois at Chicago. He previously was a Postdoctoral Research Associate at Purdue University, Discovery Park, Cyber Center. His main area of research is Web data management, e.g., retrieval, extraction, representation, cleaning, analysis, and integration. He is actively pursuing projects in Entity Recognition and Linking in Social Media, Sentiment Analysis, and Cyber-Infrastructure for Scientific Research. He co-author a book on Deep Web data integration, Deep Web Query Interface Understanding and Integration.