Welcome to D
SIGMOD 2003
PODS 2003
SIGMOD-RECOR
ADBIS
CIDR 2003
CIKM 2003
DASFAA 2003
Data Enginee
DEBS
DMKD 2003
DOLAP 2003
DPDJ 2003
ER
GIS 2003
Hypertext 20
ICDE 2003
ICDM 2003
ICDT 2003
JCDL 2003
KRDB 2003
MIR 2003
MIS 2003
MMDB 2003
RIDE 2003
SBBD 2003
SIGIR 2003
SIGIR-FORUM
SIGKDD 2003
SIGKDD-EXP
SSDBM 2003
TIME 2003
TODS
VLDB 2003
VLDB Journal
WIDM 2003
<<< = WIDM'03 Pape>>>

Fine-grain web site structure discovery


Valter Crescenzi, Paolo Merialdo, and Paolo Missier

  View Paper (PDF)  

Return to Web data extraction and structure mining


Abstract

Several techniques have been recently proposed to automatically derive web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML syntax. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually.In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small, representative, portion of it. The web site model we propose describes the structure of the site as a graph whose nodes are classes of pages that share a common structure, and whose edges represent links among instances of the page classes. Using this model, we have developed an algorithm that accepts the url of an entry point to the target web site, visits a limited portion of the site, and produces an accurate model of the site structure. We also report on preliminary experiments performed on actual web sites, that have produced encouraging results.


©2004 Association for Computing Machinery