2003 Digital Symposium Collection

Complex Queries over Web Repositories

Sriram Raghavan and Hector Garcia-Molina
View Paper (PDF)

Return to Internet/WWW (Session A1)

Abstract

Web repositories, such as the Stanford WebBase repository, manage large heterogeneous collec- tions of Web pages and associated indexes. For effective analysis and mining, these repositories must provide a declarative query interface that supports complex expressive Web queries.Such queries have two key characteristics: (i) They view a Web repository simultaneously as a col- lection of text documents, as a navigable directed graph, and as a set of relational tables storing properties of Web pages (length, URL, title, etc.). (ii) The queries employ application-specific rank- ing and ordering relationships over pages and links to filter out and retrieve only the "best" query results. In this paper, we model a Web repos- itory in terms of "Web relations" and describe an algebra for expressing complex Web queries. Our algebra extends traditional relational opera- tors as well as graph navigation operators to uni- formly handle plain, ranked, and ordered Web re- lations. In addition, we present an overview of the cost-based optimizer and execution engine that we have developed, to efficiently execute Web queries over large repositories.