Determining Text Databases to Search in the Internet
Weiyi Meng, King-Lup Liu, Clement T. Yu, Xiaodong Wang, Yuhsi Chang, Naphtali Rishe
Full Paper (PDF)

Abstract
Text data in the Internet can be partitioned into many databases naturally. Efficient retrieval of desired data can be achieved if we can accuratelypredict the usefulness of each database, because with such information, weonly need to retrieve potentially useful documents from useful databases. In this paper, we propose two new methods for estimating the usefulness oftext databases. For a given query, the usefulness of a text database in this paper is defined to be the number of documents in the database that aresufficiently similar to the query. Such a usefulness measure enables naive-users to make informed decision about which databases to search. We also consider the collection fusion problem. Because local databases may employsimilarity functions that are different from that used by the global database, the threshold used by a local database to determine whether a document is potentially useful may be different from that used by the global database. We provide techniques that determine the best threshold for a given local database.

References

References, where available, link to the DBLP on the World Wide Web.

[ALSF97]
...
[BuSA93]
...
[CLBC95]
James P. Callan, Zhihong Lu, W. Bruce Croft: Searching Distributed Collections with Inference Networks. SIGIR 1995: 21-28
[DuHa73]
...
[Gass69]
...
[GrGM95a]
Luis Gravano, Hector Garcia-Molina: Generalizing GIOSS to Vector-Space Databases and Broker Hierarchies. VLDB 1995: 78-89
[GrGM95b]
...
[GrGM97]
Luis Gravano, Hector Garcia-Molina: Merging Ranks from Heterogeneous Internet Sources. VLDB 1997: 196-205
[Harm93]
...
[HoDr97]
...
[KaMe91]
...
[Kost94]
Martijn Koster: ALIWEB - Archie-like Indexing in the WEB. Computer Networks and ISDN Systems 27(2): 175-182(1994)
[Kow97]
...
[LaYu82]
K. Lam, Clement T. Yu: A Clustered Search Algorithm Incorporating Arbitrary Term Dependencies. TODS 7(3): 500-508(1982)
[MaBi97]
...
[MLYW98]
...
[NCS]
...
[SaMc83]
Gerard Salton, Michael McGill: Introduction to Modern Information Retrieval. McGraw-Hill Book Company 1984, ISBN 0-07-054484-0
[Salt89]
Gerard Salton: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley 1989, ISBN 0-201-12227-8
[SeEt95]
...
[SeEt97]
...
[TVGJ95]
...
[VGJL95]
Ellen M. Voorhees, Narendra Kumar Gupta, Ben Johnson-Laird: Learning Collection Fusion Strategies. SIGIR 1995: 172-179
[Widd89]
...
[YaGM95]
Tak W. Yan, Hector Garcia-Molina: SIFT - a Tool for Wide-Area Information Dissemination. USENIX Winter 1995: 177-186
[YuLS78]
Clement T. Yu, W. S. Luk, M. K. Siu: On the Estimation of the Number of Desired Records with Respect to a Given Query. TODS 3(1): 41-56(1978)
[YuLe97]
Budi Yuwono, Dik Lun Lee: Server Ranking for Distributed Text Retrieval Systems on the Internet. DASFAA 1997: 41-50
BIBTEX

@inproceedings{DBLP:conf/vldb/MengLYWCR98,
author = {Weiyi Meng and
King-Lup Liu and
Clement T. Yu and
Xiaodong Wang and
Yuhsi Chang and
Naphtali Rishe},
editor = {Ashish Gupta and
Oded Shmueli and
Jennifer Widom},
title = {Determining Text Databases to Search in the Internet},
booktitle = {VLDB'98, Proceedings of 24rd International Conference on Very
Large Data Bases, August 24-27, 1998, New York City, New York,
USA},
publisher = {Morgan Kaufmann},
year = {1998},
isbn = {1-55860-566-5},
pages = {14-25},
crossref = {DBLP:conf/vldb/98},
bibsource = {DBLP, http://dblp.uni-trier.de}
}


DBLP: Copyright ©1999 by Michael Ley (ley@uni-trier.de).