DiSC - VLDB'98 Papers

Incremental Clustering for Mining in a Data Warehousing Environment

Martin Ester, Hans-Peter Kriegel, Jorg Sander, Michael Wimmer, Xiaowei Xu

Full Paper (PDF)

Abstract

Data warehouses provide a great deal of opportunities for performing data mining tasks such as classification and clustering. Typically, updates are collected and applied to the data warehouse periodically in a batch mode, e.g., during the night. Then, all patterns derived from the warehouse by some data mining algorithm have to be updated as well. Due to the very large size of the databases, it is highly desirable to perform these updates incrementally. In this paper, we present the first incremental clustering algorithm. Our algorithm is based on the clustering algorithm DBSCAN which is applicable to any database containing data from a metric space, e.g., to a spatial database or to a WWW-log database. Due to the density-based nature of DBSCAN, the insertion or deletion of anobject affects the current clustering only in the neighborhood of this object. Thus, efficient algorithms can be given for incremental insertions and deletions to an existing clustering. Based on the formal definition of clusters, it can be proven that the incremental algorithm yields the same result as DBSCAN. A performance evaluation of Incremental DBSCAN on a spatial database as well as on a WWW-log database is presented, demonstrating the efficiency of the proposed algorithm. Incremental DBSCAN yields significant speed-up factors over DBSCAN even for large numbers of daily updates in a data warehouse.

References

References, where available, link to the DBLP on the World Wide Web.

[AF 96]

...

[AS 94]

Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499

[BKSS 90]

Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger: The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990: 322-331

[Bou 96]

Athman Bouguettaya: On-Line Clustering. TKDE 8(2): 333-339(1996)

[CHNW 96]

David Wai-Lok Cheung, Jiawei Han, Vincent Ng, C. Y. Wong: Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique. ICDE 1996: 106-114

[CPZ 97]

Paolo Ciaccia, Marco Patella, Pavel Zezula: M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. VLDB 1997: 426-435

[EKSX 96]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996: 226-231

[EKX 95]

Martin Ester, Hans-Peter Kriegel, Xiaowei Xu: Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification. SSD 1995: 67-82

[EW 98]

Martin Ester, R. Wittmann: Incremental Generalization for Mining in a Data Warehousing Environment. EDBT 1998: 135-149

[FAAM 97]

Ronen Feldman, Yonatan Aumann, Amihood Amir, Heikki Mannila: Efficient Algorithms for Discovering Frequent Sets in Incremental Databases. DMKD 1997: 0-

[FPS 96]

Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth: Knowledge Discovery and Data Mining: Towards a Unifying Framework. KDD 1996: 82-88

[Gue 94]

Ralf Hartmut Güting: An Introduction to Spatial Database Systems. VLDB Journal 3(4): 357-399(1994)

[HCC93]

Jiawei Han, Yandong Cai, Nick Cercone: Data-Driven Discovery of Quantitative Rules in Relational Databases. TKDE 5(1): 29-40(1993)

[Huy 97]

Nam Huyn: Multiple-View Self-Maintenance in Data Warehousing Environments. VLDB 1997: 26-35

[KR 90]

L. Kaufman, P. J. Rousseeuw: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley 1990

[Luo 95]

...

[MJHS 96]

...

[MQM 97]

Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick: Maintenance of Data Cubes and Summary Tables in a Warehouse. SIGMOD Conference 1997: 100-111

[NH 94]

Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994: 144-155

[SEKX 98]

...

[Sib 73]

R. Sibson: SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method. The Computer Journal 16(1): 30-34(1973)

[ZRL 96]

Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conf. 1996: 103-114

BIBTEX

@inproceedings{DBLP:conf/vldb/EsterKSWX98,
author = {Martin Ester and
Hans-Peter Kriegel and
J{\"o}rg Sander and
Michael Wimmer and
Xiaowei Xu},
editor = {Ashish Gupta and
Oded Shmueli and
Jennifer Widom},
title = {Incremental Clustering for Mining in a Data Warehousing Environment},
booktitle = {VLDB'98, Proceedings of 24rd International Conference on Very
Large Data Bases, August 24-27, 1998, New York City, New York,
USA},
publisher = {Morgan Kaufmann},
year = {1998},
isbn = {1-55860-566-5},
pages = {323-333},
crossref = {DBLP:conf/vldb/98},
bibsource = {DBLP, http://dblp.uni-trier.de}
}