
























|
 |
|
Density Biased Sampling: An Improved Method for Data Mining and Clustering
|
 |
Christopher R. Palmer and
Christos Faloutsos
View Paper (PDF)
Return to Research Sessions
 |
|
Abstract
|
 |
Data mining in large data sets often requires a sampling or summarization step to form an in-core representation of the data that can be processed more efficiently. Uniform random sampling is frequently used in practice and also frequently criticized because it will miss small clusters. Many natural phenomena are known to follow Zipf's distribution and the inability of uniform sampling to find small clusters is of practical concern. Density Biased Sampling is proposed to probabilistically under-sample dense regions and over-sample light regions. A weighted sample is used to preserve the densities of the original data. Density biased sampling naturally includes uniform sampling as a special case. A memory efficient algorithm is proposed that approximates density biased sampling using only a single scan of the data. We empirically evaluate density biased sampling using synthetic data sets that exhibit varying cluster size distributions finding up to a factor of six improvement over uniform sampling.
 |
|
References
|
 |
Note: References link to DBLP on the Web.
-
[1]
-
Alfred V. Aho
,
Ravi Sethi
,
Jeffrey D. Ullman
: Compilers: Princiles, Techniques, and Tools. Addison-Wesley 1986, ISBN 0-201-10088-6
-
[2]
-
P. S. Bradley
,
Usama M. Fayyad
,
Cory Reina
: Scaling Clustering Algorithms to Large Databases.
KDD 1998
: 9-15
-
[3]
-
...
-
[4]
-
Chris Buckley
,
Mandar Mitra
,
Janet A. Walz
,
Claire Cardie
: Using Clustering and SuperConcepts Within SMART: TREC 6.
TREC 1997
: 107-124
-
[5]
-
Christos Faloutsos
,
H. V. Jagadish
: On B-Tree Indices for Skewed Distributions.
VLDB 1992
: 363-374
-
[6]
-
Christos Faloutsos
,
Yossi Matias
,
Abraham Silberschatz
: Modeling Skewed Distribution Using Multifractals and the `80-20' Law.
VLDB 1996
: 307-317
-
[7]
-
Usama M. Fayyad
,
Cory Reina
,
P. S. Bradley
: Initialization of Iterative Refinement Clustering Algorithms.
KDD 1998
: 194-198
-
[8]
-
Phillip B. Gibbons
,
Yossi Matias
: New Sampling-Based Summary Statistics for Improving Approximate Query Answers.
SIGMOD Conference 1998
: 331-342
-
[9]
-
Sudipto Guha
,
Rajeev Rastogi
,
Kyuseok Shim
: CURE: An Efficient Clustering Algorithm for Large Databases.
SIGMOD Conference 1998
: 73-84
-
[10]
-
Peter J. Haas
,
Jeffrey F. Naughton
,
S. Seshadri
,
Lynne Stokes
: Sampling-Based Estimation of the Number of Distinct Values of an Attribute.
VLDB 1995
: 311-322
-
[11]
-
...
-
[12]
-
Joseph M. Hellerstein
,
Peter J. Haas
,
Helen Wang
: Online Aggregation.
SIGMOD Conference 1997
: 171-182
-
[13]
-
...
-
[14]
-
...
-
[15]
-
...
-
[16]
-
...
-
[17]
-
Frank Olken
,
Doron Rotem
,
Ping Xu
: Random Sampling from Hash Files.
SIGMOD Conference 1990
: 375-386
-
[18]
-
Edie M. Rasmussen
: Clustering Algorithms.
Information Retrieval: Data Structures & Algorithms 1992
: 419-442
-
[19]
-
...
-
[20]
-
Manfred Schroeder: Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise. W. H. Freeman 1991
-
[21]
-
Jeffrey Scott Vitter
: Random Sampling with a Reservoir.
TOMS 11(1)
: 37-57(1985)
-
[22]
-
Oren Zamir
,
Oren Etzioni
: Web Document Clustering: A Feasibility Demonstration.
SIGIR 1998
: 46-54
-
[23]
-
Tian Zhang
,
Raghu Ramakrishnan
,
Miron Livny
: BIRCH: An Efficient Data Clustering Method for Very Large Databases.
SIGMOD Conf. 1996
: 103-114
-
[24]
-
George Kingsley Zipf: Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley 1949
 |
|
BIBTEX
|
 |
@inproceedings{DBLP:conf/sigmod/PalmerF00,
author = {Christopher R. Palmer and
Christos Faloutsos},
editor = {Weidong Chen and
Jeffrey F. Naughton and
Philip A. Bernstein},
title = {Density Biased Sampling: An Improved Method for Data Mining and
Clustering},
booktitle = {Proceedings of the 2000 ACM SIGMOD International Conference on
Management of Data, May 16-18, 2000, Dallas, Texas, USA},
journal = {SIGMOD Record},
publisher = {ACM},
volume = {29},
number = {2},
year = {2000},
isbn = {1-58113-218-2},
pages = {82-92},
crossref = {DBLP:conf/sigmod/2000},
bibsource = {DBLP, http://dblp.uni-trier.de} } },
DiSC'01 Copyright ©2002 ACM Inc.
|