2001 Digital Symposium Collection

SQLEM: Fast Clustering in SQL using the EM Algorithm

Carlos Ordonez and Paul Cereghini

Return to Industrial Sessions

Abstract

Clustering is one of the most important tasks performed in Data Mining applications. This paper presents an efficient SQL implementation of the EM algorithm to perform clustering in very large databases. Our version can effectively handle high dimensional data, a high number of clusters and more importantly, a very large number of data records. We present three strategies to implement EM in SQL: horizontal, vertical and a hybrid one. We expect this work to be useful for data mining programmers and users who want to cluster large data sets inside a relational DBMS.

References

Note: References link to DBLP on the Web.

[1]: Charu C. Aggarwal , Cecilia Magdalena Procopiuc , Joel L. Wolf , Philip S. Yu , Jong Soo Park : Fast Algorithms for Projected Clustering. SIGMOD Conference 1999 : 61-72
[2]: Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan : Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD Conference 1998 : 94-105
[3]: P. S. Bradley , Usama M. Fayyad , Cory Reina : Scaling Clustering Algorithms to Large Databases. KDD 1998 : 9-15
[4]: ...
[5]: John Clear , Debbie Dunn , Brad Harvey , Michael L. Heytens , Peter Lohman , Abhay Mehta , Mark Melton , Lars Rohrberg , Ashok Savasere , Robert M. Wehrmeister , Melody Xu : NonStop SQL/MX Primitives for Knowledge Discovery. KDD 1999 : 425-429
[6]: ...
[7]: Richard Dubes , Anil K. Jain : Clustering Methodologies in Exploratory Data Analysis. Advances in Computers 19 : 113-228(1980)
[8]: ...
[9]: William DuMouchel , Chris Volinsky , Theodore Johnson , Corinna Cortes , Daryl Pregibon : Squashing Flat Files Flatter. KDD 1999 : 6-15
[10]: Martin Ester , Hans-Peter Kriegel , Jörg Sander , Xiaowei Xu : A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996 : 226-231
[11]: Alexander Hinneburg , Daniel A. Keim : Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. VLDB 1999 : 506-517
[12]: ...
[13]: F. Murtagh : A Survey of Recent Advances in Hierarchical Clustering Algorithms. The Computer Journal 26(4) : 354-359(1983)
[14]: Raymond T. Ng , Jiawei Han : Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994 : 144-155
[15]: Carlos Ordonez , Edward Omiecinski : Discovering Association Rules Based on Image Content. ADL 1999 : 38-49
[16]: ...
[17]: Tian Zhang , Raghu Ramakrishnan , Miron Livny : BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conf. 1996 : 103-114

BIBTEX

@inproceedings{DBLP:conf/sigmod/OrdonezC00,
  author    = {Carlos Ordonez and
                Paul Cereghini},
   editor    = {Weidong Chen and
                Jeffrey F. Naughton and
                Philip A. Bernstein},
   title     = {SQLEM: Fast Clustering in SQL using the EM Algorithm},
   booktitle = {Proceedings of the 2000 ACM SIGMOD International Conference on
                Management of Data, May 16-18, 2000, Dallas, Texas, USA},
   journal   = {SIGMOD Record},
   publisher = {ACM},
   volume    = {29},
   number    = {2},
   year      = {2000},
   isbn      = {1-58113-218-2},
   pages     = {559-570},
   crossref  = {DBLP:conf/sigmod/2000},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },