NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents
Brad Adelberg
Full Paper (PDF)

Slides (HTML)

Abstract
Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mining algorithms in general, and the mining algorithms that have been developed by the author. The prototype, which is written in Java, is described and experiences parsing a variety of documents are reported.

References

References, where available, link to the DBLP on the World Wide Web.

[Abi97]
Serge Abiteboul: Querying Semi-Structured Data. ICDT 1997: 1-18
[Ade98]
...
[AK97a]
...
[AK97b]
Naveen Ashish, Craig A. Knoblock: Wrapper Generation for Semi-structured Internet Sources. SIGMOD Record 26(4): 8-15(1997)
[CGMH+97]
Sudarshan S. Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, Jennifer Widom: The TSIMMIS Project: Integration of Heterogeneous Information Sources. IPSJ 1994: 7-18
[Gol90]
...
[HGMC+97]
...
[KGP88]
...
[KWD97]
...
[Liv90]
...
BIBTEX

@inproceedings{DBLP:conf/sigmod/Adelberg98,
author = {Brad Adelberg},
editor = {Laura M. Haas and
Ashutosh Tiwary},
title = {NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured
Data from Text Documents},
booktitle = {SIGMOD 1998, Proceedings ACM SIGMOD International Conference
on Management of Data, June 2-4, 1998, Seattle, Washington, USA},
publisher = {ACM Press},
year = {1998},
isbn = {0-89791-955-5},
pages = {283-294},
crossref = {DBLP:conf/sigmod/98},
bibsource = {DBLP, http://dblp.uni-trier.de}
}


DBLP: Copyright ©1999 by Michael Ley (ley@uni-trier.de).