DiSC - SIGMOD Paper Index

CURE: An Efficient Clustering Algorithm for Large Databases

Sudipto Guha, Rajeev Rastogi, Kyuseok Shim

Full Paper (PDF)
Slides (PDF)

Abstract

Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.

References

References, where available, link to the DBLP on the World Wide Web.

[BKSS90]

Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger: The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990: 322-331

[CLR90]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest: Introduction to Algorithms. The MIT Press and McGraw-Hill Book Company 1989, ISBN 0-262-03141-8 and 0-07-013143-0

[EKSX96]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996: 226-231

[EKX95]

Martin Ester, Hans-Peter Kriegel, Xiaowei Xu: A Database Interface for Clustering in Large Spatial Databases. KDD 1995: 94-99

[GRS97]

...

[JD88]

Anil K. Jain, Richard C. Dubes: Algorithms for Clustering Data. Prentice-Hall 1988

[MR95]

Rajeev Motwani, Prabhakar Raghavan: Randomized Algorithms. Cambridge University Press 1995, ISBN 0-521-47465-5

[NH94]

Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994: 144-155

[Ols93]

...

[Sam89]

...

[Sam90]

Hanan Samet: The Design and Analysis of Spatial Data Structures. Addison-Wesley 1990

[SRF87]

Timos K. Sellis, Nick Roussopoulos, Christos Faloutsos: The R+-Tree: A Dynamic Index for Multi-Dimensional Objects. VLDB 1987: 507-518

[Vit85]

Jeffrey Scott Vitter: Random Sampling with a Reservoir. TOMS 11(1): 37-57(1985)

[ZRL96]

Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conf. 1996: 103-114

BIBTEX

@inproceedings{DBLP:conf/sigmod/GuhaRS98,
author = {Sudipto Guha and
Rajeev Rastogi and
Kyuseok Shim},
editor = {Laura M. Haas and
Ashutosh Tiwary},
title = {CURE: An Efficient Clustering Algorithm for Large Databases},
booktitle = {SIGMOD 1998, Proceedings ACM SIGMOD International Conference
on Management of Data, June 2-4, 1998, Seattle, Washington, USA},
publisher = {ACM Press},
year = {1998},
isbn = {0-89791-955-5},
pages = {73-84},
crossref = {DBLP:conf/sigmod/98},
bibsource = {DBLP, http://dblp.uni-trier.de}
}