Welcome to D
SIGMOD'00
PODS'00
 = PODS'00 Webs
 = Plenary Talk
<<< = PODS'00 Pape>>>
SIGMOD Recor
CIKM 2000/CI
COMAD 2000
Data Enginee
DL 2000
DPDJ
EDBT 2000
Hypertext 20
ICDE 2000
KDD 2000
KDD Explorat
KRDB 2000
SBBD 2000
SIGIR 2000
SIGIR Forum
SSDBM 2000
TODS
VLDB'00
VLDBJ

Towards Estimation Error Guarantees for Distinct Values


Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek R. Narasayya

  View Paper (PDF)  

Return to Sampling


Abstract

We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.


References


Note: References link to DBLP on the Web.

[1]
Noga Alon , Yossi Matias , Mario Szegedy : The Space Complexity of Approximating the Frequency Moments. STOC 1996 : 20-29
[2]
Morton M. Astrahan , Mario Schkolnick , Kyu-Young Whang : Approximating the number of unique values of an attribute without sorting. IS 12(1) : 11-15(1987)
[3]
...
[4]
...
[5]
...
[6]
...
[7]
...
[8]
Surajit Chaudhuri , Rajeev Motwani , Vivek R. Narasayya : Random Sampling for Histogram Construction: How much is enough? SIGMOD Conference 1998 : 436-447
[9]
Surajit Chaudhuri , Rajeev Motwani , Vivek R. Narasayya : On Random Sampling over Joins. SIGMOD Conference 1999 : 263-274
[10]
Surajit Chaudhuri , Vivek R. Narasayya : An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server. VLDB 1997 : 146-155
[11]
Sheldon J. Finkelstein , Mario Schkolnick , Paolo Tiberio : Physical Database Design for Relational Databases. TODS 13(1) : 91-128(1988)
[12]
Philippe Flajolet , G. Nigel Martin : Probabilistic Counting. FOCS 1983 : 76-82
[13]
...
[14]
...
[15]
Peter J. Haas , Jeffrey F. Naughton , S. Seshadri , Lynne Stokes : Sampling-Based Estimation of the Number of Distinct Values of an Attribute. VLDB 1995 : 311-322
[16]
...
[17]
...
[18]
Wen-Chi Hou , Gultekin Özsoyoglu , Baldeo K. Taneja : Statistical Estimators for Relational Algebra Expressions. PODS 1988 : 276-287
[19]
Wen-Chi Hou , Gultekin Özsoyoglu , Baldeo K. Taneja : Processing Aggregate Relational Queries with Hard Time Constraints. SIGMOD Conference 1989 : 68-77
[20]
Donald E. Knuth : The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley 1973, ISBN 0-201-03803-X
[21]
Rajeev Motwani , Prabhakar Raghavan : Randomized Algorithms. Cambridge University Press 1995, ISBN 0-521-47465-5
[22]
Jeffrey F. Naughton , S. Seshadri : On Estimating the Size of Projections. ICDT 1990 : 499-513
[23]
Frank Olken : Random Sampling from Databases. Ph.D. thesis, University of California at Berkeley LBL Technical Report 1993
[24]
...
[25]
Gultekin Özsoyoglu , Kaizheng Du , A. Tjahjana , Wen-Chi Hou , D. Y. Rowland : On Estimating COUNT, SUM, and AVERAGE. DEXA 1991 : 406-412
[26]
Viswanath Poosala , Yannis E. Ioannidis , Peter J. Haas , Eugene J. Shekita : Improved Histograms for Selectivity Estimation of Range Predicates. SIGMOD Conf. 1996 : 294-305
[27]
...
[28]
H. S. Sichel : Anatomy of the Generalized Inverse Gaussian-Poisson Distribution with Special Applications to Bibliometric Studies. Information Processing and Management 28(1) : 5-18(1992)
[29]
...
[30]
Kyu-Young Whang , Brad T. Vander Zanden , Howard M. Taylor : A Linear-Time Probabilistic Counting Algorithm for Database Applications. TODS 15(2) : 208-229(1990)
[31]
George Kingsley Zipf: Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley 1949

BIBTEX


@inproceedings{DBLP:conf/pods/CharikarCMN00,
  author    = {Moses Charikar and
                Surajit Chaudhuri and
                Rajeev Motwani and
                Vivek R. Narasayya},
   title     = {Towards Estimation Error Guarantees for Distinct Values},
   booktitle = {Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium
                on Principles of Database Systems, May 15-17, 2000, Dallas, Texas,
                USA},
   publisher = {ACM},
   year      = {2000},
   isbn      = {1-58113-214-X},
   pages     = {268-279},
   crossref  = {DBLP:conf/pods/00},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },




DiSC'01 Copyright ©2002 ACM Inc.