DiSC - Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

Digital Symposium Collection 2000

Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

Alexander Hinneburg and Daniel A. Keim
View Paper (PDF)

Return to High-Dimensional Queries

Abstract

Many applications require the clustering of large amounts of high-dimensional data. Most clustering algorithms, however, do not work effectively and efficiently in high-dimensional space, which is due to the so-called ``curse of dimensionality''. In addition, the high-dimensional data often contains a significant amount of noise which causes additional effectiveness problems. In this paper, we review and compare the existing algorithms for clustering high-dimensional data and show the impact of the curse of dimensionality on their effectiveness and efficiency. The comparison reveals that condensation-based approaches (such as BIRCH or STING) are the most promising candidates for achieving the necessary efficiency, but it also shows that basically all condension-based approaches have severe weaknesses with respect to their effectiveness in high-dimensional space. To overcome these problems, we develop a new clustering technique called OptiGrid which is based on constructing an optimal grid-partitioning of the data. The optimal grid-partitioning is determined by calculating the best partitioning hyperplanes for each dimension (if such a partitioning exists) using certain projections of the data. The advantages of our new approach are (1) it has a firm mathematical basis (2) it is by far more effective than existing clustering algorithms for high-dimensional data (3) it is very efficient even for large data sets of high dimensionality. To demonstrate the effectiveness and efficiency of our new approach, we perform a series of experiments on a number of different data sets from CAD and molecular biology. A comparison with one of the best known algorithms (BIRCH) shows the superiority of our new approach.

References

Note: References link to DBLP on the Web.

[AGG+98]: Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan : Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD Conference 1998 : 94-105
[BGRS99]: Kevin S. Beyer , Jonathan Goldstein , Raghu Ramakrishnan , Uri Shaft : When Is ''Nearest Neighbor'' Meaningful? ICDT 1999 : 217-235
[DJSGM97]: ...
[EKSX96]: Martin Ester , Hans-Peter Kriegel , Jörg Sander , Xiaowei Xu : A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996 : 226-231
[DJSGM97]: Martin Ester , Hans-Peter Kriegel , Jörg Sander , Xiaowei Xu : Density-Connected Sets and their Application for Trend Detection in Spatial Databases. KDD 1997 : 10-15
[EKX95]: Martin Ester , Hans-Peter Kriegel , Xiaowei Xu : Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification. SSD 1995 : 67-82
[FL95]: Christos Faloutsos , King-Ip Lin : FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. SIGMOD Conference 1995 : 163-174
[FH75]: ...
[HK98]: Alexander Hinneburg , Daniel A. Keim : An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD 1998 : 58-65
[HK99]: ...
[Hub85]: ...
[Jag91]: H. V. Jagadish : A Retrieval Technique for Similar Shapes. SIGMOD Conference 1991 : 208-217
[Kuk92]: Karen Kukich : Techniques for Automatically Correcting Words in Text. Computing Surveys 24(4) : 377-439(1992)
[MG95]: Rajiv Mehrotra , James E. Gary : Feature-Index-Based Similar Shape Retrieval. VDB 1995 : 46-65
[MN89]: Kurt Mehlhorn , Stefan Näher : LEDA: A Library of Efficient Data Types and Algorithms. MFCS 1989 : 88-106
[NH94]: Raymond T. Ng , Jiawei Han : Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994 : 144-155
[Sch96]: ...
[Sco92]: ...
[Sil86]: ...
[SCZ98]: Gholamhosein Sheikholeslami , Surojit Chatterjee , Aidong Zhang : WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. VLDB 1998 : 428-439
[SH94]: ...
[WW80]: ...
[WYM97]: Wei Wang , Jiong Yang , Richard R. Muntz : STING: A Statistical Information Grid Approach to Spatial Data Mining. VLDB 1997 : 186-195
[XEKS98]: Xiaowei Xu , Martin Ester , Hans-Peter Kriegel , Jörg Sander : A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases. ICDE 1998 : 324-331
[ZRL96]: Tian Zhang , Raghu Ramakrishnan , Miron Livny : BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conf. 1996 : 103-114

BIBTEX

@inproceedings{DBLP:conf/vldb/KeimH99,
  author    = {Alexander Hinneburg and
                Daniel A. Keim},
   editor    = {Malcolm P. Atkinson and
                Maria E. Orlowska and
                Patrick Valduriez and
                Stanley B. Zdonik and
                Michael L. Brodie},
   title     = {Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality
                in High-Dimensional Clustering},
   booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
                Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
                UK},
   publisher = {Morgan Kaufmann},
   year      = {1999},
   isbn      = {1-55860-615-5},
   pages     = {506-517},
   crossref  = {DBLP:conf/vldb/99},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },

Copyright(C) 2000 ACM