Digital Symposium Collection 2000  

 
 
 
 
 
 

 





















Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

Alexander Hinneburg and Daniel A. Keim

  View Paper (PDF)  

Return to High-Dimensional Queries

Abstract
Many applications require the clustering of large amounts of high-dimensional data. Most clustering algorithms, however, do not work effectively and efficiently in high-dimensional space, which is due to the so-called ``curse of dimensionality''. In addition, the high-dimensional data often contains a significant amount of noise which causes additional effectiveness problems. In this paper, we review and compare the existing algorithms for clustering high-dimensional data and show the impact of the curse of dimensionality on their effectiveness and efficiency. The comparison reveals that condensation-based approaches (such as BIRCH or STING) are the most promising candidates for achieving the necessary efficiency, but it also shows that basically all condension-based approaches have severe weaknesses with respect to their effectiveness in high-dimensional space. To overcome these problems, we develop a new clustering technique called OptiGrid which is based on constructing an optimal grid-partitioning of the data. The optimal grid-partitioning is determined by calculating the best partitioning hyperplanes for each dimension (if such a partitioning exists) using certain projections of the data. The advantages of our new approach are (1) it has a firm mathematical basis (2) it is by far more effective than existing clustering algorithms for high-dimensional data (3) it is very efficient even for large data sets of high dimensionality. To demonstrate the effectiveness and efficiency of our new approach, we perform a series of experiments on a number of different data sets from CAD and molecular biology. A comparison with one of the best known algorithms (BIRCH) shows the superiority of our new approach.


References

Note: References link to DBLP on the Web.

[AGG+98]
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan : Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD Conference 1998 : 94-105
[BGRS99]
Kevin S. Beyer , Jonathan Goldstein , Raghu Ramakrishnan , Uri Shaft : When Is ''Nearest Neighbor'' Meaningful? ICDT 1999 : 217-235
[DJSGM97]
...
[EKSX96]
Martin Ester , Hans-Peter Kriegel , Jörg Sander , Xiaowei Xu : A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996 : 226-231
[DJSGM97]
Martin Ester , Hans-Peter Kriegel , Jörg Sander , Xiaowei Xu : Density-Connected Sets and their Application for Trend Detection in Spatial Databases. KDD 1997 : 10-15
[EKX95]
Martin Ester , Hans-Peter Kriegel , Xiaowei Xu : Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification. SSD 1995 : 67-82
[FL95]
Christos Faloutsos , King-Ip Lin : FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. SIGMOD Conference 1995 : 163-174
[FH75]
...
[HK98]
Alexander Hinneburg , Daniel A. Keim : An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD 1998 : 58-65
[HK99]
...
[Hub85]
...
[Jag91]
H. V. Jagadish : A Retrieval Technique for Similar Shapes. SIGMOD Conference 1991 : 208-217
[Kuk92]
Karen Kukich : Techniques for Automatically Correcting Words in Text. Computing Surveys 24(4) : 377-439(1992)
[MG95]
Rajiv Mehrotra , James E. Gary : Feature-Index-Based Similar Shape Retrieval. VDB 1995 : 46-65
[MN89]
Kurt Mehlhorn , Stefan Näher : LEDA: A Library of Efficient Data Types and Algorithms. MFCS 1989 : 88-106
[NH94]
Raymond T. Ng , Jiawei Han : Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994 : 144-155
[Sch96]
...
[Sco92]
...
[Sil86]
...
[SCZ98]
Gholamhosein Sheikholeslami , Surojit Chatterjee , Aidong Zhang : WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. VLDB 1998 : 428-439
[SH94]
...
[WW80]
...
[WYM97]
Wei Wang , Jiong Yang , Richard R. Muntz : STING: A Statistical Information Grid Approach to Spatial Data Mining. VLDB 1997 : 186-195
[XEKS98]
Xiaowei Xu , Martin Ester , Hans-Peter Kriegel , Jörg Sander : A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases. ICDE 1998 : 324-331
[ZRL96]
Tian Zhang , Raghu Ramakrishnan , Miron Livny : BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conf. 1996 : 103-114

BIBTEX

@inproceedings{DBLP:conf/vldb/KeimH99,
  author    = {Alexander Hinneburg and
                Daniel A. Keim},
   editor    = {Malcolm P. Atkinson and
                Maria E. Orlowska and
                Patrick Valduriez and
                Stanley B. Zdonik and
                Michael L. Brodie},
   title     = {Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality
                in High-Dimensional Clustering},
   booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
                Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
                UK},
   publisher = {Morgan Kaufmann},
   year      = {1999},
   isbn      = {1-55860-615-5},
   pages     = {506-517},
   crossref  = {DBLP:conf/vldb/99},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },


























Copyright(C) 2000 ACM