
























|
 |
|
Computational Aspects of Resilient Data Extraction from Semistructured Sources
|
 |
Hasan Davulcu,
Guizhen Yang,
Michael Kifer, and
I. V. Ramakrishnan
View Paper (PDF)
Return to Semistructured Data
 |
|
Abstract
|
 |
Automatic data extraction from semistructured sources such as HTML pages is rapidly growing into a problem of significant importance, spurred by the growing popularity of the so called "shopbots" that enable end users to compare prices of goods and other services at various web sites without having to manually browse and fill out forms at each one of these sites.
The main problem one has to contend with when designing data extraction techniques is that the contents of a web page changes frequently, either because its data is generated dynamically, in response to filling out a form, or because of changes to its presentation format. This makes the problem of data extraction particularly challenging, since a desirable requirement of any data extraction technique is that it be "resilient", i.e., using it we should always be able to locate the object of interest in a page (such as a form or an element in a table generated by a form fill-out) in spite of changes to the page's ntent and layout.
In this paper we propose a formal computation model for developing resilient data extraction techniques from semistructured sources. Specifically we formalize the problem of data extraction as one of generating unambiguous extraction expressions, which are regular expressions with some additional structure. The problem of resilience is then formalized as one of generating a maximal extraction expression of this kind. We present characterization theorems for maximal extraction expressions, complexity results for testing them, and algorithms for synthesizing them.
 |
|
References
|
 |
Note: References link to DBLP on the Web.
-
[1]
-
Serge Abiteboul
: Querying Semi-Structured Data.
ICDT 1997
: 1-18
-
[2]
-
Serge Abiteboul
,
Dallan Quass
,
Jason McHugh
,
Jennifer Widom
,
Janet L. Wiener
: The Lorel Query Language for Semistructured Data.
Int. J. on Digital Libraries 1(1)
: 68-88(1997)
-
[3]
-
Brad Adelberg
: NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents.
SIGMOD Conference 1998
: 283-294
-
[4]
-
Dana Angluin
: Finding Patterns Common to a Set of Strings (Extended Abstract).
STOC 1979
: 130-141
-
[5]
-
Dana Angluin
: Inductive Inference of Formal Languages from Positive Data.
Information and Control 45(2)
: 117-135(1980)
-
[6]
-
Naveen Ashish
,
Craig A. Knoblock
: Wrapper Generation for Semi-structured Internet Sources.
SIGMOD Record 26(4)
: 8-15(1997)
-
[7]
-
Paolo Atzeni
,
Giansalvatore Mecca
: Cut & Paste.
PODS 1997
: 144-153
-
[8]
-
Robert C. Berwick
,
Samuel F. Pilato
: Learning Syntax by Automata Induction.
Machine Learning 2(1)
: 9-38(1987)
-
[9]
-
Berthier A. Ribeiro-Neto
,
Alberto H. F. Laender
,
Altigran Soares da Silva
: Extracting Semi-Structured Data Through Examples.
CIKM 1999
: 94-101
-
[10]
-
Peter Buneman
: Semistructured Data.
PODS 1997
: 117-121
-
[11]
-
Peter Buneman
,
Susan B. Davidson
,
Gerd G. Hillebrand
,
Dan Suciu
: A Query Language and Optimization Techniques for Unstructured Data.
SIGMOD Conf. 1996
: 505-516
-
[12]
-
David W. Embley
,
Douglas M. Campbell
,
Y. S. Jiang
,
Stephen W. Liddle
,
Yiu-Kai Ng
,
Dallan Quass
,
Randy D. Smith
: Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages.
DKE 31(3)
: 227-251(1999)
-
[13]
-
Jean-Robert Gruser
,
Louiqa Raschid
,
M. E. Vidal
,
Laura Bright
: Wrapper Generation for Web Accessible Data Sources.
CoopIS 1998
: 14-23
-
[14]
-
...
-
[15]
-
Joachim Hammer
,
Hector Garcia-Molina
,
Svetlozar Nestorov
,
Ramana Yerneni
,
Markus M. Breunig
,
Vasilis Vassalos
: Template-Based Wrappers in the TSIMMIS System.
SIGMOD Conference 1997
: 532-535
-
[16]
-
John E. Hopcroft
,
Jeffrey D. Ullman
: Introduction to Automata Theory, Languages and Computation. Addison-Wesley 1979, ISBN 0-201-02888-X
-
[17]
-
...
-
[18]
-
Nicholas Kushmerick
,
Daniel S. Weld
,
Robert B. Doorenbos
: Wrapper Induction for Information Extraction.
IJCAI (1) 1997
: 729-737
-
[19]
-
Harry R. Lewis
,
Christos H. Papadimitriou
: Elements of the Theory of Computation.
Prentice-Hall
1981, ISBN 0-13-273417-6
-
[20]
-
Seung Jin Lim
,
Yiu-Kai Ng
: An Automated Approach for Retrieving Hierarchical Data from HTML Tables.
CIKM 1999
: 466-474
-
[21]
-
Mike Perkowitz
,
Robert B. Doorenbos
,
Oren Etzioni
,
Daniel S. Weld
: Learning to Understand Information on the Internet: An Example-Based Approach.
JIIS 8(2)
: 133-153(1997)
-
[22]
-
Arnaud Sahuguet
,
Fabien Azavant
: Web Ecology: Recycling HTML Pages as XML Documents Using W4F.
WebDB (Informal Proceedings) 1999
: 31-36
 |
|
BIBTEX
|
 |
@inproceedings{DBLP:conf/pods/DavulcuYKR00,
author = {Hasan Davulcu and
Guizhen Yang and
Michael Kifer and
I. V. Ramakrishnan},
title = {Computational Aspects of Resilient Data Extraction from Semistructured
Sources},
booktitle = {Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium
on Principles of Database Systems, May 15-17, 2000, Dallas, Texas,
USA},
publisher = {ACM},
year = {2000},
isbn = {1-58113-214-X},
pages = {136-144},
crossref = {DBLP:conf/pods/00},
bibsource = {DBLP, http://dblp.uni-trier.de} } },
DiSC'01 Copyright ©2002 ACM Inc.
|