Welcome to D
SIGMOD 2003
PODS 2003
SIGMOD-RECOR
ADBIS
CIDR 2003
CIKM 2003
DASFAA 2003
Data Enginee
DEBS
DMKD 2003
DOLAP 2003
DPDJ 2003
ER
GIS 2003
Hypertext 20
ICDE 2003
ICDM 2003
ICDT 2003
JCDL 2003
KRDB 2003
MIR 2003
MIS 2003
MMDB 2003
RIDE 2003
SBBD 2003
SIGIR 2003
SIGIR-FORUM
SIGKDD 2003
SIGKDD-EXP
SSDBM 2003
TIME 2003
TODS
VLDB 2003
VLDB Journal
WIDM 2003
<<< = WIDM'03 Pape>>>

Datarover: a taxonomy based crawler for automated data extraction from data-intensive websites


Hasan Davulcu, S. Koduri, and S. Nagarajan

  View Paper (PDF)  

Return to Web data extraction and structure mining


Abstract

The advent of e-commerce has created a trend that brought thousands of catalogs online. Most of these websites are "taxonomy-directed". A Web site is said to be "taxonomy-directed" if it contains at least one taxonomy for organizing its contents and it presents the instances belonging to a category in a regular fashion. This paper describes the DataRover system, which can automatically crawl and extract products from taxonomy-directed online catalogs. DataRover utilizes heuristic rules to discover the structural regularities among: taxonomy segments, list-of-product and single-product pages and it uses these regularities to turn the online catalogs into a database of categorized products without the need for user interaction or the wrapper maintenance burden. We provide experimental results to demonstrate the efficacy of the DataRover and point to its current limitations.


©2004 Association for Computing Machinery