2003 Digital Symposium Collection

A Framework for Clustering Evolving Data Streams

Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu
View Paper (PDF)

Return to Data Mining/Streams (Session A2)

Abstract

The clustering problem is a diffcult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inef- ficient. In recent years, a few one-pass clus- tering algorithms have been developed for the data stream problem. Although such methods address the scalability issues of the clustering problem, they are generally blind to the evo- lution of the data and do not address the fol- lowing issues: (1) The quality of the clusters is poor when the data evolves considerably over time. (2) A data stream clustering algorithm requires much greater functionality in discov- ering and exploring clusters over different por- tions of the stream. The widely used practice of viewing data stream clustering algorithms as a class of one- pass clustering algorithms is not very use- ful from an application point of view. For example, a simple one-pass clustering algo- rithm over an entire data stream of a few years is dominated by the outdated history of the stream. The exploration of the stream over different time windows can provide the users with a much deeper understanding of the evolving behavior of the clusters. At the same time, it is not possible to simultaneously per- form dynamic clustering over all possible time horizons for a data stream of even moderately large volume. This paper discusses a fundamentally dif- ferent philosophy for data stream clustering which is guided by application-centered re- quirements. The idea is divide the clustering process into an online component which pe- riodically stores detailed summary statistics and an online component which uses only this summary statistics. The online component is utilized by the analyst who can use a wide va- riety of inputs (such as time horizon or num- ber of clusters) in order to provide a quick un- derstanding of the broad clusters in the data stream. The problems of efficient choice, stor- age, and use of this statistical data for a fast data stream turns out to be quite tricky. For this purpose, we use the concepts of a pyrami- dal time frame in conjunction with a micro- clustering approach. Our performance ex- periments over a number of real and synthetic data sets illustrate the effectiveness, efficiency, and insights provided by our approach.