
























|
 |
|
Towards Estimation Error Guarantees for Distinct Values
|
 |
Moses Charikar,
Surajit Chaudhuri,
Rajeev Motwani, and
Vivek R. Narasayya
View Paper (PDF)
Return to Sampling
 |
|
Abstract
|
 |
We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.
 |
|
References
|
 |
Note: References link to DBLP on the Web.
-
[1]
-
Noga Alon
,
Yossi Matias
,
Mario Szegedy
: The Space Complexity of Approximating the Frequency Moments.
STOC 1996
: 20-29
-
[2]
-
Morton M. Astrahan
,
Mario Schkolnick
,
Kyu-Young Whang
: Approximating the number of unique values of an attribute without sorting.
IS 12(1)
: 11-15(1987)
-
[3]
-
...
-
[4]
-
...
-
[5]
-
...
-
[6]
-
...
-
[7]
-
...
-
[8]
-
Surajit Chaudhuri
,
Rajeev Motwani
,
Vivek R. Narasayya
: Random Sampling for Histogram Construction: How much is enough?
SIGMOD Conference 1998
: 436-447
-
[9]
-
Surajit Chaudhuri
,
Rajeev Motwani
,
Vivek R. Narasayya
: On Random Sampling over Joins.
SIGMOD Conference 1999
: 263-274
-
[10]
-
Surajit Chaudhuri
,
Vivek R. Narasayya
: An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server.
VLDB 1997
: 146-155
-
[11]
-
Sheldon J. Finkelstein
,
Mario Schkolnick
,
Paolo Tiberio
: Physical Database Design for Relational Databases.
TODS 13(1)
: 91-128(1988)
-
[12]
-
Philippe Flajolet
,
G. Nigel Martin
: Probabilistic Counting.
FOCS 1983
: 76-82
-
[13]
-
...
-
[14]
-
...
-
[15]
-
Peter J. Haas
,
Jeffrey F. Naughton
,
S. Seshadri
,
Lynne Stokes
: Sampling-Based Estimation of the Number of Distinct Values of an Attribute.
VLDB 1995
: 311-322
-
[16]
-
...
-
[17]
-
...
-
[18]
-
Wen-Chi Hou
,
Gultekin Özsoyoglu
,
Baldeo K. Taneja
: Statistical Estimators for Relational Algebra Expressions.
PODS 1988
: 276-287
-
[19]
-
Wen-Chi Hou
,
Gultekin Özsoyoglu
,
Baldeo K. Taneja
: Processing Aggregate Relational Queries with Hard Time Constraints.
SIGMOD Conference 1989
: 68-77
-
[20]
-
Donald E. Knuth
: The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley 1973, ISBN 0-201-03803-X
-
[21]
-
Rajeev Motwani
,
Prabhakar Raghavan
: Randomized Algorithms. Cambridge University Press 1995, ISBN 0-521-47465-5
-
[22]
-
Jeffrey F. Naughton
,
S. Seshadri
: On Estimating the Size of Projections.
ICDT 1990
: 499-513
-
[23]
-
Frank Olken
: Random Sampling from Databases. Ph.D. thesis, University of California at Berkeley LBL Technical Report 1993
-
[24]
-
...
-
[25]
-
Gultekin Özsoyoglu
,
Kaizheng Du
,
A. Tjahjana
,
Wen-Chi Hou
,
D. Y. Rowland
: On Estimating COUNT, SUM, and AVERAGE.
DEXA 1991
: 406-412
-
[26]
-
Viswanath Poosala
,
Yannis E. Ioannidis
,
Peter J. Haas
,
Eugene J. Shekita
: Improved Histograms for Selectivity Estimation of Range Predicates.
SIGMOD Conf. 1996
: 294-305
-
[27]
-
...
-
[28]
-
H. S. Sichel
: Anatomy of the Generalized Inverse Gaussian-Poisson Distribution with Special Applications to Bibliometric Studies.
Information Processing and Management 28(1)
: 5-18(1992)
-
[29]
-
...
-
[30]
-
Kyu-Young Whang
,
Brad T. Vander Zanden
,
Howard M. Taylor
: A Linear-Time Probabilistic Counting Algorithm for Database Applications.
TODS 15(2)
: 208-229(1990)
-
[31]
-
George Kingsley Zipf: Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley 1949
 |
|
BIBTEX
|
 |
@inproceedings{DBLP:conf/pods/CharikarCMN00,
author = {Moses Charikar and
Surajit Chaudhuri and
Rajeev Motwani and
Vivek R. Narasayya},
title = {Towards Estimation Error Guarantees for Distinct Values},
booktitle = {Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium
on Principles of Database Systems, May 15-17, 2000, Dallas, Texas,
USA},
publisher = {ACM},
year = {2000},
isbn = {1-58113-214-X},
pages = {268-279},
crossref = {DBLP:conf/pods/00},
bibsource = {DBLP, http://dblp.uni-trier.de} } },
DiSC'01 Copyright ©2002 ACM Inc.
|