topics in probabilistic and statistical databases lecture 10
TRANSCRIPT
![Page 1: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/1.jpg)
1
Topics in Probabilistic and Statistical Databases
Lecture 10: Sampling and Review
Dan Suciu University of Washington
![Page 2: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/2.jpg)
References
• Towards Estimation Error Guarantees for Distinct Values, Charikar, Chaudhuri, Motwani, Narasayya, PODS 2000
• Sampling-Based Estimation of the Number of Distinct Values of an Attribute, Haas, Naughton, Seshadri, Stokes, VLDB 1995
2
![Page 3: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/3.jpg)
Distinct Values
• Problem definition: • Data set with n tuples • Column of interest has values{1,…,D} • Let ni = number of times value i occurs • n = Σi=1,D ni • Goal: estimate D, denote the estimate Ď • Error is Ď/D, or D/Ď, whichever is > 1
3
![Page 4: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/4.jpg)
Negative Result
4
Theorem [Charikar’00] Consider any (possibly adaptive and randomized) estimator Ď for the number of distinct values D that examines at most r rows in a table with n rows. Then, for any γ > exp(-r), there exist a choice of the input data such that with probability at least γ:
error(Ď) ≥ sqrt((n-r)/2r * ln(1/γ))
Proof in class
![Page 5: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/5.jpg)
Estimators
• Goodman’s unbiased estimator • Many specialized estimators from the
statistics literature (won’t discuss; see [Haas’95])
• GEE [Charikar’95]; will discuss because it matches the lower bound
5
![Page 6: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/6.jpg)
Notations
• Select random sample of size r • d=number of distinct values in the sample • fi=number of distinct values that occur
exactly i times
• Thus: d = Σi=1,r fi r = Σi=1,r i*fi
6
![Page 7: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/7.jpg)
Goodman’s Unbiased Estimator
Goodman proved in 1949 that: • If r ≥ max(n1, …, nD) then there exists only
one unbiased estimator:
• If r < max(n1, …, nD) then there exists no unbiased estimator
Very unstable, with errors of 20,000% 7
r r r
r r
![Page 8: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/8.jpg)
The GEE Estimator
8
Definition The GEE is: Ď = sqrt(n/r) f1 + Σi=2,r fi
Theorem. Expected ratio error is O(sqrt(n/r))
![Page 9: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/9.jpg)
Review of this Course
Three areas in Probabilistic and Statistical Databases
• Explicit probabilities • Implicit probabilities • Statistics
9
![Page 10: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/10.jpg)
Explicit Probabilistic Data
• “Classical” probabilistic databases • Each tuple has a probability value
– “maybe-tuple” – “x-tuple”
• Possible worlds semantics
10
![Page 11: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/11.jpg)
Explicit Probabilistic Data
• What are some key applications ?
• What is lineage and why is it important ?
11
![Page 12: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/12.jpg)
Explicit Probabilistic Data
• Rule of thumb 1: – ProbDB = IncompleteDB + Probabilities
• Rule of thumb 2: – ProbDB = Disjoint/IndependentDB + Joins
• Rule of thumb 3: – GM Factorization = DB-normalization + prob-
identities
12
![Page 13: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/13.jpg)
Explicit Probabilistic Data
Query Evaluation is #P hard in general:
• General methods: Monte Carlo, OBDDs, …
• Safe queries and safe plans
• Top k query answering 13
![Page 14: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/14.jpg)
Explicit Probabilistic Data
• Major Open Research Problems [IN CLASS]
14
![Page 15: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/15.jpg)
Implicit Probabilistic Data
• All tuples have the same probability
• What are the major differences from explicit probabilistic data ?
15
![Page 16: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/16.jpg)
Implicit Probabilistic Data
• Dense random graphs – Pr(t) = ½
• Fagin’s 0/1 law for FO – For every sentence φ, lim Pr(φ) = 0 or =1
• “Theory of almost certain sentences” = ? • “THE random graph” = ?
16
![Page 17: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/17.jpg)
Implicit Probabilistic Data
• Material random graphs: – Pr(t) = β / narity(R)
• Every conjunctive query has an explicit asymptotic formula: – Pr(q) = C(q) / nexp(q) + O(nexp(q)+1)
17
![Page 18: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/18.jpg)
Implicit Probabilistic Data
• General Random Graphs: G(n,p) [WHAT IS THAT ?]
• Erdos and Renyi’s theorem
• Random graphs G(n, β/nα): – Threshold values for α (no 0/1 laws):
2, 1+1/2, 1+1/3, …, 1+1/k, … 1, [rationals], 0 – Everywhere else: 0/1 Law for FO
18
![Page 19: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/19.jpg)
Implicit Probabilistic Data
• The major applications today: – ?
• … but great theory ! 19
![Page 20: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/20.jpg)
Implicit Probabilistic Data
• Research topics: [IN CLASS]
20
![Page 21: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/21.jpg)
Data Statistics
• What is their main usage in database systems ?
21
![Page 22: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/22.jpg)
Data Statistics
• Histograms – Eqwidth, eqdepth, V-optimal
• Sampling – Sequential sampling techniques – Join synopses
22
![Page 23: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/23.jpg)
Data Statistics
• Limitations of how data statistics are used today: [IN CLASS]
• Major research topics in data statistics: [IN CLASS]
23
![Page 24: Topics in Probabilistic and Statistical Databases Lecture 10](https://reader033.vdocuments.site/reader033/viewer/2022052300/586925811a28ab9c568bf31f/html5/thumbnails/24.jpg)
Final Thoughts
• Computer Science in the past: – Driven by better algorithms
• Computer Science today: – Driven by massive amounts of data – Processed with approximate methods – Data itself is often imprecise
• Computer Science tomorrow: – Probabilistic databases
24