A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces
Roger Weber, Hans-Jorg Schek, Stephen Blott
Full Paper (PDF)

Abstract
For similarity search in high-dimensional vector spaces (or 'HDVSs'), researchers have proposed a number of new methods (or adaptations of existing methods) based, in the main, on data-space partitioning. However, the performance of these methods generally degrades as dimensionality increases. Although this phenomenon - known as the 'dimensional curse'- is well known, little or no quantitative analysis of the phenomenon is available. In this paper, we provide a detailed analysis of partitioning and clustering techniques for similarity search in HDVSs. We show formally that these methods exhibit linear complexity at high dimensionality, and that existing methods are outperformed on average by a simple sequential scan if the number of dimensions exceeds around 10. Consequently, we come up with an alternative organization based on approximations to make the unavoidable sequential scan as fast as possible. We describe a simple vector approximation scheme, called VA-file, and report on an experimental evaluation of this and of two tree-based index methods (an R*-tree and an X-tree).

References

References, where available, link to the DBLP on the World Wide Web.

[1]
Daniel Barbará, William DuMouchel, Christos Faloutsos, Peter J. Haas, Joseph M. Hellerstein, Yannis E. Ioannidis, H. V. Jagadish, Theodore Johnson, Raymond T. Ng, Viswanath Poosala, Kenneth A. Ross, Kenneth C. Sevcik: The New Jersey Data Reduction Report. Data Engineering Bulletin 20(4): 3-45(1997)
[2]
Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger: The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990: 322-331
[3]
Jon Louis Bentley, Jerome H. Friedman: Data Structures for Range Searching. Computing Surveys 11(4): 397-409(1979)
[4]
Stefan Berchtold, Christian Böhm, Bernhard Braunmüller, Daniel A. Keim, Hans-Peter Kriegel: Fast Similarity Search in Multimedia Databases. SIGMOD Conference 1997: 1-12
[5]
Stefan Berchtold, Christian Böhm, Daniel A. Keim, Hans-Peter Kriegel: A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space. PODS 1997: 78-86
[6]
Stefan Berchtold, Christian Böhm, Hans-Peter Kriegel: Improving the Query Performance of High-Dimensional Index Structures by Bulk-Load Operations. EDBT 1998: 216-230
[7]
Stefan Berchtold, Daniel A. Keim, Hans-Peter Kriegel: The X-tree : An Index Structure for High-Dimensional Data. VLDB 1996: 28-39
[8]
Thomas Brinkhoff, Hans-Peter Kriegel, Ralf Schneider: Comparison of Approximations of Complex Objects Used for Approximation-based Query Processing in Spatial Database Systems. ICDE 1993: 40-49
[9]
Thomas Brinkhoff, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger: Multi-Step Processing of Spatial Joins. SIGMOD Conference 1994: 197-208
[10]
Paolo Ciaccia, Marco Patella, Pavel Zezula: M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. VLDB 1997: 426-435
[11]
...
[12]
...
[13]
...
[14]
Christos Faloutsos: Access Methods for Text. Computing Surveys 17(1): 49-74(1985)
[15]
...
[16]
Christos Faloutsos, Stavros Christodoulakis: Description and Performance Analysis of Signature File Methods for Office Filing. TOIS 5(3): 237-257(1987)
[17]
Christos Faloutsos, Ibrahim Kamel: Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension. PODS 1994: 4-13
[18]
Raphael A. Finkel, Jon Louis Bentley: Quad Trees: A Data Structure for Retrieval on Composite Keys. Acta Informatica 4: 1-9(1974)
[19]
Myron Flickner, Harpreet S. Sawhney, Jonathan Ashley, Qian Huang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele, Peter Yanker: Query by Image and Video Content: The QBIC System. IEEE Computer 28(9): 23-32(1995)
[20]
Jerome H. Friedman, Jon Louis Bentley, Raphael A. Finkel: An Algorithm for Finding Best Matches in Logarithmic Expected Time. TOMS 3(3): 209-226(1977)
[21]
Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57
[22]
Joseph M. Hellerstein, Elias Koutsoupias, Christos H. Papadimitriou: On the Analysis of Indexing Schemes. PODS 1997: 249-256
[23]
Gisli R. Hjaltason, Hanan Samet: Ranking in Spatial Databases. SSD 1995: 83-95
[24]
Norio Katayama, Shin'ichi Satoh: The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries. SIGMOD Conference 1997: 369-380
[25]
King-Ip Lin, H. V. Jagadish, Christos Faloutsos: The TV-Tree: An Index Structure for High-Dimensional Data. VLDB Journal 3(4): 517-542(1994)
[26]
David B. Lomet, Betty Salzberg: The hB-Tree: A Multiattribute Indexing Method with Good Guaranteed Performance. TODS 15(4): 625-658(1990)
[27]
Jürg Nievergelt, Hans Hinterberger, Kenneth C. Sevcik: The Grid File: An Adaptable, Symmetric Multikey File Structure. TODS 9(1): 38-71(1984)
[28]
John T. Robinson: The K-D-B-Tree: A Search Structure For Large Multidimensional Dynamic Indexes. SIGMOD Conference 1981: 10-18
[29]
Hanan Samet: The Design and Analysis of Spatial Data Structures. Addison-Wesley 1990
[30]
Timos K. Sellis, Nick Roussopoulos, Christos Faloutsos: The R+-Tree: A Dynamic Index for Multi-Dimensional Objects. VLDB 1987: 507-518
[31]
Robert F. Sproull: Refinements to Nearest-Neighbor Searching in k-Dimensional Trees. Algorithmica 6(4): 579-589(1991)
[32]
Markus A. Stricker, Markus Orengo: Similarity of Color Images. Storage and Retrieval for Image and Video Databases (SPIE) 1995: 381-392
[33]
Jeffrey D. Ullman: Principles of Database and Knowledge-Base Systems, Volume I. Computer Science Press 1988, ISBN 0-7167-8158-1
[34]
...
BIBTEX

@inproceedings{DBLP:conf/vldb/WeberSB98,
author = {Roger Weber and
Hans-J{\"o}rg Schek and
Stephen Blott},
editor = {Ashish Gupta and
Oded Shmueli and
Jennifer Widom},
title = {A Quantitative Analysis and Performance Study for Similarity-Search
Methods in High-Dimensional Spaces},
booktitle = {VLDB'98, Proceedings of 24rd International Conference on Very
Large Data Bases, August 24-27, 1998, New York City, New York,
USA},
publisher = {Morgan Kaufmann},
year = {1998},
isbn = {1-55860-566-5},
pages = {194-205},
crossref = {DBLP:conf/vldb/98},
bibsource = {DBLP, http://dblp.uni-trier.de}
}


DBLP: Copyright ©1999 by Michael Ley (ley@uni-trier.de).