2003 Digital Symposium Collection

QXtract: A Building Block for Efficient Information Extraction from Plain-Text Databases

Eugene Agichtein and Luis Gravano
View Paper (PDF)

Return to Text Processing

Abstract

, which automatically generates queries to identify the promising database documents for extraction by an arbitrary information extraction system. By focusing only on potentially useful documents and ignoring the rest, we can dramatically improve the efficiency and scalability of information extraction. QXtract (Figure 1) discovers the characteristics of documents that are useful for extraction of a target relation by sampling the database with tuples for the relation. This document sample is then processed by the information extraction system of choice, resulting in an automatically "labeled" training sample of "useful" and "useless" documents. Machine learning and information retrieval techniques are then used to generate queries for retrieving additional useful documents that are in turn processed by the information extraction system to extract the final relation. We demonstrate a practical and efficient information extraction architecture based on QXtract. Extracting a user-defined relation using our system involves three stages: task specification, QXtract training, and the final extraction of the target relation. Our prototype can incorporate user feedback during all stages of the process.

BIBTEX

@inproceedings       {DBLP:conf/sigmod/AgichteinG03,
  author    = {Eugene Agichtein and
                Luis Gravano},
   booktitle = {SIGMOD Conference},
   title     = {QXtract: A Building Block for Efficient Information Extraction from Plain-Text Databases.},
   pages     = {663},
   year      = {2003},
   url       = {db/conf/sigmod/sigmod2003.html#AgichteinG03},
   ee        = {http://www.acm.org/sigmod/sigmod03/eproceedings/papers/dem07.pdf},
   crossref  = {conf/sigmod/2003},
   bibsource = {DBLP, http://dblp.uni-trier.de}
}