|
New England Database
Society sponsored by Sun Microsystems |
| |
|
NEDS |
User-Centric Web Crawling
Christopher Olston
Carnegie-Mellon University
Pittsburgh, PA
Friday, October 21,
2005, 4:00 PM
Volen 101, Brandeis
University
(preceded by a wine and cheese reception at 3:00 pm)
Abstract:
Given the considerable size, dynamicity, and degree of
autonomy of the Web, it is not feasible for a search engine to maintain its
local repository exactly synchronized with the Web. As a result, search query
answers may be inaccurate. This problem can be especially pronounced for
topic-specific search engines such as science portals, which do not always wield
considerable computing and networking power.
We consider how to schedule Web pages for selective (re)downloading into a
search engine repository. Our scheduling objective is to maximize the quality of
the user experience for those who query the search engine. We begin with a
quantitative characterization of the way in which the discrepancy between the
content of the repository and the current content of the live Web impacts the
quality of the user experience. This characterization leads to a user-centric
metric of the quality of a search engine's local repository. We use this metric
to derive a policy for scheduling Web page (re)downloading that is driven by
search engine usage and free of exterior tuning parameters.
We provide empirical comparisons of our user-centric method against prior Web
page refresh strategies, using real Web data. Our results demonstrate that our
method requires far fewer resources to maintain same search engine quality level
for users, leaving substantially more resources available for incorporating new
Web pages into the search repository.
Speaker Bio:
Christopher Olston is an assistant professor of computer science at Carnegie Mellon University. His research interests include data stream management and Web search. Olston received his Ph.D. in 2003 from Stanford University, where he was supported by dual fellowship awards from the National Science Foundation and the Stanford Graduate Fellowship program. Prior to attending graduate school, he received the 1998 Computing Research Association Award for Outstanding Undergraduates.
Maintained by Dina Goldin dqg AT cse.uconn.edu