neds.gif (1190 bytes)

New England Database Society

Friday, October 21, 2005

sponsored by Sun Microsystems

sunlogo.gif (4979 bytes)

NEDS

User-Centric Web Crawling

Christopher Olston
Carnegie-Mellon University
Pittsburgh, PA
 

Friday, October 21, 2005, 4:00 PM
Volen 101, Brandeis University

(preceded by a wine and cheese reception at 3:00 pm)

Abstract:

Given the considerable size, dynamicity, and degree of autonomy of the Web, it is not feasible for a search engine to maintain its local repository exactly synchronized with the Web. As a result, search query answers may be inaccurate. This problem can be especially pronounced for topic-specific search engines such as science portals, which do not always wield considerable computing and networking power.

We consider how to schedule Web pages for selective (re)downloading into a search engine repository. Our scheduling objective is to maximize the quality of the user experience for those who query the search engine. We begin with a quantitative characterization of the way in which the discrepancy between the content of the repository and the current content of the live Web impacts the quality of the user experience. This characterization leads to a user-centric metric of the quality of a search engine's local repository. We use this metric to derive a policy for scheduling Web page (re)downloading that is driven by search engine usage and free of exterior tuning parameters.

We provide empirical comparisons of our user-centric method against prior Web page refresh strategies, using real Web data. Our results demonstrate that our method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository.
 

Speaker Bio:

Christopher Olston is an assistant professor of computer science at Carnegie Mellon University. His research interests include data stream management and Web search. Olston received his Ph.D. in 2003 from Stanford University, where he was supported by dual fellowship awards from the National Science Foundation and the Stanford Graduate Fellowship program. Prior to attending graduate school, he received the 1998 Computing Research Association Award for Outstanding Undergraduates.


Maintained by Dina Goldin dqg AT cse.uconn.edu