New Sampling-Based Summary Statistics for Improving Approximate Query Answers
Phillip B. Gibbons, Yossi Matias
Full Paper (PDF)

Abstract
In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. Before DBMSs providing highly-accurate approximate answers can become a reality, many new techniques for summarizing data and for estimating answers from summarized data must be developed. This paper introduces two new sampling-based summary statistics, concise samples and counting samples, and presents new techniques for their fast incremental maintenance regardless of the data distribution. We quantify their advantages over standard sample views in terms of the number of additional sample points for the same view size, and hence in providing more accurate query answers. Finally, we consider their application to providing fast approximate answers to hot list queries. Our algorithms maintain their accuracy in the presence of ongoing insertions to the data warehouse.

References

References, where available, link to the DBLP on the World Wide Web.

[AMS96]
...
[Ant92]
Gennady Antoshenkov: Random Sampling from Pseudo-Ranked B+ Trees. VLDB 1992: 375-382
[AS94]
Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499
[AZ96]
Gennady Antoshenkov, Mohamed Ziauddin: Query Processing and Optimization in Oracle Rdb. VLDB Journal 5(4): 229-237(1996)
[BDF+97]
Daniel Barbará, William DuMouchel, Christos Faloutsos, Peter J. Haas, Joseph M. Hellerstein, Yannis E. Ioannidis, H. V. Jagadish, Theodore Johnson, Raymond T. Ng, Viswanath Poosala, Kenneth A. Ross, Kenneth C. Sevcik: The New Jersey Data Reduction Report. Data Engineering Bulletin 20(4): 3-45(1997)
[BM96]
Roberto J. Bayardo Jr., Daniel P. Miranker: Processing Queries for First Few Answers. CIKM 1996: 45-52
[BMUT97]
Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, Shalom Tsur: Dynamic Itemset Counting and Implication Rules for Market Basket Data. SIGMOD Conference 1997: 255-264
[FJS97]
Christos Faloutsos, H. V. Jagadish, Nikolaos Sidiropoulos: Recovering Information from Summary Data. VLDB 1997: 36-45
[Fla85]
Philippe Flajolet: Approximate Counting: A Detailed Analysis. BIT 25(1): 113-134(1985)
[FM83]
Philippe Flajolet, G. Nigel Martin: Probabilistic Counting. FOCS 1983: 76-82
[FM85]
Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. JCSS 31(2): 182-209(1985)
[GM95]
...
[GM97]
...
[GMP97a]
...
[GMP97b]
Phillip B. Gibbons, Yossi Matias, Viswanath Poosala: Fast Incremental Maintenance of Approximate Histograms. VLDB 1997: 466-475
[GPA+98]
...
[HHW97]
Joseph M. Hellerstein, Peter J. Haas, Helen Wang: Online Aggregation. SIGMOD Conference 1997: 171-182
[HK95]
...
[HNSS95]
Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, Lynne Stokes: Sampling-Based Estimation of the Number of Distinct Values of an Attribute. VLDB 1995: 311-322
[IC93]
Yannis E. Ioannidis, Stavros Christodoulakis: Optimal Histograms for Limiting Worst-Case Error Propagation in the Size of Join Results. TODS 18(4): 709-748(1993)
[Ioa93]
Yannis E. Ioannidis: Universality of Serial Histograms. VLDB 1993: 256-267
[IP95]
Yannis E. Ioannidis, Viswanath Poosala: Balancing Histogram Optimality and Practicality for Query Result Size Estimation. SIGMOD Conference 1995: 233-244
[Mat92]
...
[Mor78]
Robert Morris: Counting Large Numbers of Events in Small Registers. CACM 21(10): 840-842(1978)
[MSY96]
...
[MVN93]
...
[MVY94]
...
[OR89]
Frank Olken, Doron Rotem: Random Sampling from B+ Trees. VLDB 1989: 269-277
[OR92]
Frank Olken, Doron Rotem: Maintenance of Materialized Views of Sampling Queries. ICDE 1992: 632-641
[PIHS96]
Viswanath Poosala, Yannis E. Ioannidis, Peter J. Haas, Eugene J. Shekita: Improved Histograms for Selectivity Estimation of Range Predicates. SIGMOD Conf. 1996: 294-305
[Pre97]
...
[Vit85]
Jeffrey Scott Vitter: Random Sampling with a Reservoir. TOMS 11(1): 37-57(1985)
[VL93]
Susan V. Vrbsky, Jane W.-S. Liu: APPROXIMATE - A Query Processor that Produces Monotonically Improving Approximate Answers. TKDE 5(6): 1056-1068(1993)
[WVZT90]
Kyu-Young Whang, Brad T. Vander Zanden, Howard M. Taylor: A Linear-Time Probabilistic Counting Algorithm for Database Applications. TODS 15(2): 208-229(1990) Referenced By:
  1. Yossi Matias, Jeffrey Scott Vitter, Min Wang: Wavelet-Based Histograms for Selectivity Estimation. SIGMOD Conference 1998: 448-459
BIBTEX

@inproceedings{DBLP:conf/sigmod/GibbonsM98,
author = {Phillip B. Gibbons and
Yossi Matias},
editor = {Laura M. Haas and
Ashutosh Tiwary},
title = {New Sampling-Based Summary Statistics for Improving Approximate
Query Answers},
booktitle = {SIGMOD 1998, Proceedings ACM SIGMOD International Conference
on Management of Data, June 2-4, 1998, Seattle, Washington, USA},
publisher = {ACM Press},
year = {1998},
isbn = {0-89791-955-5},
pages = {331-342},
crossref = {DBLP:conf/sigmod/98},
bibsource = {DBLP, http://dblp.uni-trier.de}
}


DBLP: Copyright ©1999 by Michael Ley (ley@uni-trier.de).