Welcome to D
SIGMOD 2003
PODS 2003
SIGMOD-RECOR
ADBIS
CIDR 2003
CIKM 2003
DASFAA 2003
Data Enginee
DEBS
DMKD 2003
DOLAP 2003
DPDJ 2003
ER
GIS 2003
Hypertext 20
ICDE 2003
ICDM 2003
ICDT 2003
JCDL 2003
KRDB 2003
MIR 2003
MIS 2003
MMDB 2003
RIDE 2003
SBBD 2003
SIGIR 2003
SIGIR-FORUM
SIGKDD 2003
SIGKDD-EXP
SSDBM 2003
TIME 2003
TODS
VLDB 2003
<<< = VLDB'03 Pape>>>
 = Plenary Talk
VLDB Journal
WIDM 2003

A Framework for Clustering Evolving Data Streams


Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu

  View Paper (PDF)  

Return to Data Mining/Streams (Session A2)


Abstract

The clustering problem is a diffcult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inef- ficient. In recent years, a few one-pass clus- tering algorithms have been developed for the data stream problem. Although such methods address the scalability issues of the clustering problem, they are generally blind to the evo- lution of the data and do not address the fol- lowing issues: (1) The quality of the clusters is poor when the data evolves considerably over time. (2) A data stream clustering algorithm requires much greater functionality in discov- ering and exploring clusters over different por- tions of the stream. The widely used practice of viewing data stream clustering algorithms as a class of one- pass clustering algorithms is not very use- ful from an application point of view. For example, a simple one-pass clustering algo- rithm over an entire data stream of a few years is dominated by the outdated history of the stream. The exploration of the stream over different time windows can provide the users with a much deeper understanding of the evolving behavior of the clusters. At the same time, it is not possible to simultaneously per- form dynamic clustering over all possible time horizons for a data stream of even moderately large volume. This paper discusses a fundamentally dif- ferent philosophy for data stream clustering which is guided by application-centered re- quirements. The idea is divide the clustering process into an online component which pe- riodically stores detailed summary statistics and an online component which uses only this summary statistics. The online component is utilized by the analyst who can use a wide va- riety of inputs (such as time horizon or num- ber of clusters) in order to provide a quick un- derstanding of the broad clusters in the data stream. The problems of efficient choice, stor- age, and use of this statistical data for a fast data stream turns out to be quite tricky. For this purpose, we use the concepts of a pyrami- dal time frame in conjunction with a micro- clustering approach. Our performance ex- periments over a number of real and synthetic data sets illustrate the effectiveness, efficiency, and insights provided by our approach.


©2004 Association for Computing Machinery