Welcome to D
SIGMOD 2003
<<< = SIGMOD'03 Pa>>>
PODS 2003
SIGMOD-RECOR
ADBIS
CIDR 2003
CIKM 2003
DASFAA 2003
Data Enginee
DEBS
DMKD 2003
DOLAP 2003
DPDJ 2003
ER
GIS 2003
Hypertext 20
ICDE 2003
ICDM 2003
ICDT 2003
JCDL 2003
KRDB 2003
MIR 2003
MIS 2003
MMDB 2003
RIDE 2003
SBBD 2003
SIGIR 2003
SIGIR-FORUM
SIGKDD 2003
SIGKDD-EXP
SSDBM 2003
TIME 2003
TODS
VLDB 2003
VLDB Journal
WIDM 2003

Efficient similarity search and classification via rank aggregation


Ronald Fagin, Ravi Kumar, and D. Sivakumar

  View Paper (PDF)  

Return to Similarity Queries I


Abstract

We propose a novel approach to performing efficient similarity search and classification in high dimensional data. In this framework, the database elements are vectors in a Euclidean space. Given a query vector in the same space, the goal is to find elements of the database that are similar to the query. In our approach, a small number of in- dependent "voters" rank the database elements based on similarity to the query. These rankings are then combined by a highly efficient aggregation algorithm. Our methodology leads both to techniques for computing approximate nearest neighbors and to a conceptually rich alternative to nearest neighbors. One instantiation of our methodology is as follows. Each voter projects all the vectors (database elements and the query) on a random line (different for each voter), and ranks the database elements based on the proximity of the projections to the projection of the query. The aggregation rule picks the database element that has the best median rank. This combination has several appealing features. On the theoretical side, we prove that with high probability, it produces a result that is a (1 + epsilon)-factor approximation to the Euclidean nearest neighbor. On the practical side, it turns out to be extremely efficient, often exploring no more than 5% of the data to obtain very high-quality results. This method is also database-friendly, in that it accesses data primarily in a pre-defined order without random accesses, and, unlike other methods for approximate nearest neighbors, requires almost no extra storage. Also, we extend our approach to deal with the k nearest neighbors. We conduct two sets of experiments to evaluate the efficacy of our methods. Our experiments include two scenarios where nearest neighbors are typically employed?similarity search and classification problems. In both cases, we study the performance of our methods with respect to several evaluation criteria, and conclude that they are uniformly excellent, both in terms of quality of results and in terms of efficiency.

BIBTEX


@inproceedings       {DBLP:conf/sigmod/FaginKS03,
  author    = {Ronald Fagin and
                Ravi Kumar and
                D. Sivakumar},
   booktitle = {SIGMOD Conference},
   title     = {Efficient similarity search and classification via rank aggregation.},
   pages     = {301-312},
   year      = {2003},
   url       = {db/conf/sigmod/sigmod2003.html#FaginKS03},
   ee        = {http://www.acm.org/sigmod/sigmod03/eproceedings/papers/r12p01.pdf},
   crossref  = {conf/sigmod/2003},
   bibsource = {DBLP, http://dblp.uni-trier.de} 
}



©2004 Association for Computing Machinery