Welcome to D
SIGMOD'00
 = SIGMOD'00 We
 = Plenary Talk
<<< = SIGMOD'00 Pa>>>
PODS'00
SIGMOD Recor
CIKM 2000/CI
COMAD 2000
Data Enginee
DL 2000
DPDJ
EDBT 2000
Hypertext 20
ICDE 2000
KDD 2000
KDD Explorat
KRDB 2000
SBBD 2000
SIGIR 2000
SIGIR Forum
SSDBM 2000
TODS
VLDB'00
VLDBJ

XTRACT: A System for Extracting Document Type Descriptors from XML Documents


Minos N. Garofalakis, Aristides Gionis, Rajeev Rastogi, S. Seshadri, and Kyuseok Shim

  View Paper (PDF)  

Return to Research Sessions


Abstract

XML is rapidly emerging as the new standard for data representation and exchange on the Web. An XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus have a crucial role in the efficient storage of XML data, as well as the effective formulation and optimization of XML queries. In this paper, we propose XTRACT, a novel system for inferring a DTD schema for a database of XML documents. Since the DTD syntax incorporates the full expressive power of regular expressions, naive approaches typically fail to produce concise and intuitive DTDs. Instead, the XTRACT inference algorithms employ a sequence of sophisticated steps that involve: (1) finding patterns in the input sequences and replacing them with regular expressions to generate "general" candidate DTDs, (2) factoring candidate DTDs using adaptations of algorithms from the logic optimization literature, and (3) applying the Minimum Description Length (MDL) principle to find the best DTD among the candidates. The results of our experiments with real-life and synthetic DTDs demonstrate the effectiveness of XTRACT's approach in inferring concise and semantically meaningful DTD schemas for XML databases.


References


Note: References link to DBLP on the Web.

[1]
Serge Abiteboul : Querying Semi-Structured Data. ICDT 1997 : 1-18
[2]
Dana Angluin : On the Complexity of Minimum Inference of Regular Sets. Information and Control 39(3) : 337-350(1978)
[3]
Tim Bray , Jean Paoli , C. M. Sperberg-McQueen : Extensible Markup Language (XML). World Wide Web Journal 2(4) : 27-66(1997)
[4]
...
[5]
Alvis Brazma : Efficient Identification of Regular Expressions from Representative Examples. COLT 1993 : 236-242
[6]
Moses Charikar , Sudipto Guha : Improved Combinatorial Algorithms for the Facility Location and k-Median Problems. FOCS 1999 : 378-388
[7]
Alin Deutsch , Mary F. Fernandez , Dan Suciu : Storing Semistructured Data with STORED. SIGMOD Conference 1999 : 431-442
[8]
Peter Buneman , Susan B. Davidson , Mary F. Fernandez , Dan Suciu : Adding Structure to Unstructured Data. ICDT 1997 : 336-350
[9]
...
[10]
E. Mark Gold : Language Identification in the Limit. Information and Control 10(5) : 447-474(1967)
[11]
E. Mark Gold : Complexity of Automaton Identification from Given Data. Information and Control 37(3) : 302-320(1978)
[12]
Roy Goldman , Jason McHugh , Jennifer Widom : From Semistructured Data to XML: Migrating the Lore Data Model and Query Language. WebDB (Informal Proceedings) 1999 : 25-30
[13]
Roy Goldman , Jennifer Widom : DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. VLDB 1997 : 436-445
[14]
...
[15]
John E. Hopcroft , Jeffrey D. Ullman : Introduction to Automata Theory, Languages and Computation. Addison-Wesley 1979, ISBN 0-201-02888-X
[16]
...
[17]
Manish Mehta , Jorma Rissanen , Rakesh Agrawal : MDL-Based Decision Tree Pruning. KDD 1995 : 216-221
[18]
Svetlozar Nestorov , Serge Abiteboul , Rajeev Motwani : Extracting Schema from Semistructured Data. SIGMOD Conference 1998 : 295-306
[19]
Leonard Pitt : Inductive Inference, DFAs, and Computational Complexity. AII 1989 : 18-44
[20]
J. Ross Quinlan , Ronald L. Rivest : Inferring Decision Trees Using the Minimum Description Length Principle. Information and Computation 80(3) : 227-248(1989)
[21]
...
[22]
...
[23]
Jayavel Shanmugasundaram , Kristin Tufte , Chun Zhang , Gang He , David J. DeWitt , Jeffrey F. Naughton : Relational Databases for Querying XML Documents: Limitations and Opportunities. VLDB 1999 : 302-314
[24]
...
[25]
Jennifer Widom : Data Management for XML: Research Directions. IEEE Data Engineering Bulletin 22(3) : 44-52(1999)

Referenced by

  1. Gerhard Weikum : Review - XTRACT: A System for Extracting Document Type Descriptors from XML Documents. ACM SIGMOD Digital Review 2 : (2000)

BIBTEX


@inproceedings{DBLP:conf/sigmod/GarofalakisGRSS00,
  author    = {Minos N. Garofalakis and
                Aristides Gionis and
                Rajeev Rastogi and
                S. Seshadri and
                Kyuseok Shim},
   editor    = {Weidong Chen and
                Jeffrey F. Naughton and
                Philip A. Bernstein},
   title     = {XTRACT: A System for Extracting Document Type Descriptors from
                XML Documents},
   booktitle = {Proceedings of the 2000 ACM SIGMOD International Conference on
                Management of Data, May 16-18, 2000, Dallas, Texas, USA},
   journal   = {SIGMOD Record},
   publisher = {ACM},
   volume    = {29},
   number    = {2},
   year      = {2000},
   isbn      = {1-58113-218-2},
   pages     = {165-176},
   crossref  = {DBLP:conf/sigmod/2000},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },




DiSC'01 Copyright ©2002 ACM Inc.