|




















|
|
 |
|
 |
|
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering
|
Alexander Hinneburg and
Daniel A. Keim
View Paper (PDF)
Return to High-Dimensional Queries
Many applications require the clustering of large amounts of high-dimensional data. Most clustering algorithms, however, do not work effectively and efficiently in high-dimensional space, which is due to the so-called ``curse of dimensionality''. In addition, the high-dimensional data often contains a significant amount of noise which causes additional effectiveness problems. In this paper, we review and compare the existing algorithms for clustering high-dimensional data and show the impact of the curse of dimensionality on their effectiveness and efficiency. The comparison reveals that condensation-based approaches (such as BIRCH or STING) are the most promising candidates for achieving the necessary efficiency, but it also shows that basically all condension-based approaches have severe weaknesses with respect to their effectiveness in high-dimensional space. To overcome these problems, we develop a new clustering technique called
OptiGrid
which is based on constructing an optimal grid-partitioning of the data. The optimal grid-partitioning is determined by calculating the best partitioning hyperplanes for each dimension (if such a partitioning exists) using certain projections of the data. The advantages of our new approach are (1) it has a firm mathematical basis (2) it is by far more effective than existing clustering algorithms for high-dimensional data (3) it is very efficient even for large data sets of high dimensionality. To demonstrate the effectiveness and efficiency of our new approach, we perform a series of experiments on a number of different data sets from CAD and molecular biology. A comparison with one of the best known algorithms (BIRCH) shows the superiority of our new approach.
Note: References link to DBLP on the Web.
-
[AGG+98]
-
Rakesh Agrawal
,
Johannes Gehrke
,
Dimitrios Gunopulos
,
Prabhakar Raghavan
: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications.
SIGMOD Conference 1998
: 94-105
-
[BGRS99]
-
Kevin S. Beyer
,
Jonathan Goldstein
,
Raghu Ramakrishnan
,
Uri Shaft
: When Is ''Nearest Neighbor'' Meaningful?
ICDT 1999
: 217-235
-
[DJSGM97]
-
...
-
[EKSX96]
-
Martin Ester
,
Hans-Peter Kriegel
,
Jörg Sander
,
Xiaowei Xu
: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.
KDD 1996
: 226-231
-
[DJSGM97]
-
Martin Ester
,
Hans-Peter Kriegel
,
Jörg Sander
,
Xiaowei Xu
: Density-Connected Sets and their Application for Trend Detection in Spatial Databases.
KDD 1997
: 10-15
-
[EKX95]
-
Martin Ester
,
Hans-Peter Kriegel
,
Xiaowei Xu
: Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification.
SSD 1995
: 67-82
-
[FL95]
-
Christos Faloutsos
,
King-Ip Lin
: FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets.
SIGMOD Conference 1995
: 163-174
-
[FH75]
-
...
-
[HK98]
-
Alexander Hinneburg
,
Daniel A. Keim
: An Efficient Approach to Clustering in Large Multimedia Databases with Noise.
KDD 1998
: 58-65
-
[HK99]
-
...
-
[Hub85]
-
...
-
[Jag91]
-
H. V. Jagadish
: A Retrieval Technique for Similar Shapes.
SIGMOD Conference 1991
: 208-217
-
[Kuk92]
-
Karen Kukich
: Techniques for Automatically Correcting Words in Text.
Computing Surveys 24(4)
: 377-439(1992)
-
[MG95]
-
Rajiv Mehrotra
,
James E. Gary
: Feature-Index-Based Similar Shape Retrieval.
VDB 1995
: 46-65
-
[MN89]
-
Kurt Mehlhorn
,
Stefan Näher
: LEDA: A Library of Efficient Data Types and Algorithms.
MFCS 1989
: 88-106
-
[NH94]
-
Raymond T. Ng
,
Jiawei Han
: Efficient and Effective Clustering Methods for Spatial Data Mining.
VLDB 1994
: 144-155
-
[Sch96]
-
...
-
[Sco92]
-
...
-
[Sil86]
-
...
-
[SCZ98]
-
Gholamhosein Sheikholeslami
,
Surojit Chatterjee
,
Aidong Zhang
: WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases.
VLDB 1998
: 428-439
-
[SH94]
-
...
-
[WW80]
-
...
-
[WYM97]
-
Wei Wang
,
Jiong Yang
,
Richard R. Muntz
: STING: A Statistical Information Grid Approach to Spatial Data Mining.
VLDB 1997
: 186-195
-
[XEKS98]
-
Xiaowei Xu
,
Martin Ester
,
Hans-Peter Kriegel
,
Jörg Sander
: A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases.
ICDE 1998
: 324-331
-
[ZRL96]
-
Tian Zhang
,
Raghu Ramakrishnan
,
Miron Livny
: BIRCH: An Efficient Data Clustering Method for Very Large Databases.
SIGMOD Conf. 1996
: 103-114
@inproceedings{DBLP:conf/vldb/KeimH99,
author = {Alexander Hinneburg and
Daniel A. Keim},
editor = {Malcolm P. Atkinson and
Maria E. Orlowska and
Patrick Valduriez and
Stanley B. Zdonik and
Michael L. Brodie},
title = {Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality
in High-Dimensional Clustering},
booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
UK},
publisher = {Morgan Kaufmann},
year = {1999},
isbn = {1-55860-615-5},
pages = {506-517},
crossref = {DBLP:conf/vldb/99},
bibsource = {DBLP, http://dblp.uni-trier.de} } },
Copyright(C) 2000 ACM
|
|
|
|
|
|
|