probabilistic/uncertain data management -- iv

27
1 Probabilistic/ Uncertain Data Management -- IV 1. Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’2004. 2. Sen, Deshpande. “Representing and Querying Correlated Tuples in Probabilistic DBs”, ICDE’2007.

Upload: norina

Post on 25-Feb-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Probabilistic/Uncertain Data Management -- IV. Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’2004. Sen, Deshpande. “Representing and Querying Correlated Tuples in Probabilistic DBs”, ICDE’2007. INST = P (TUP) N = 2 M. TUP = {t 1 , t 2 , …, t M }. = all tuples. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic/Uncertain Data Management -- IV

1

Probabilistic/Uncertain Data Management -- IV

1. Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’2004.

2. Sen, Deshpande. “Representing and Querying Correlated Tuples in Probabilistic DBs”, ICDE’2007.

Page 2: Probabilistic/Uncertain Data Management -- IV

2

A Restricted Formalism: Explicit Independent Tuples

Pr(I) = t 2 I pr(t) £ t I (1-pr(t))

No restrictionspr : TUP ! [0,1]

Tuple independent probabilistic database

INST = P(TUP)N = 2M

TUP = {t1, t2, …, tM} = all tuples

Page 3: Probabilistic/Uncertain Data Management -- IV

3

Tuple Prob. ) Possible WorldsName City pr

John Seattle p1 = 0.8

Sue Boston p2 = 0.6

Fred Boston p3 = 0.9Ip = Name City

John Seattl

Sue Bosto

Fred Bosto

Name City

Sue Bosto

Fred Bosto

Name City

John Seattl

Fred Bosto

Name City

John Seattl

Sue Bosto

Name City

Fred Bosto

Name City

Sue Bosto

Name City

John Seattl;

I1

(1-p1)(1-p2)(1-p3)

I2

p1(1-p2)(1-p3)

I3

(1-p1)p2(1-p3)

I4

(1-p1)(1-p2)p3

I5

p1p2(1-p3)

I6

p1(1-p2)p3

I7

(1-p1)p2p3

I8

p1p2p3

= 1

J = E[ size(Ip) ] =

2.3 tuples

Page 4: Probabilistic/Uncertain Data Management -- IV

4

Tuple-Independent DBs are IncompleteName Address pr

John Seattle p1

Sue Seattle p2

Name AddressJohn SeattleSue Seattle

Name AddressJohn Seattle p1

p1p2

=Ip

; 1-p1 - p1p2

Very limited – cannot capture correlations across tuples

Not Closed• Query operators can introduce

complex correlations!

Page 5: Probabilistic/Uncertain Data Management -- IV

5

Query Evaluation on Probabilistic DBs

• Focus on possible tuple semantics– Compute likelihood of individual answer tuples

• Probability of Boolean expressions– Key operation for Intensional Query Evaluation

• Complexity of query evaluation

Page 6: Probabilistic/Uncertain Data Management -- IV

6

Complexity of Boolean Expression Probability

Theorem [Valiant:1979]For a boolean expression E, computing Pr(E) is #P-complete

NP = class of problems of the form “is there a witness ?” SAT#P = class of problems of the form “how many witnesses ?” #SAT

The decision problem for 2CNF is in PTIMEThe counting problem for 2CNF is #P-complete

[Valiant:1979]

Page 7: Probabilistic/Uncertain Data Management -- IV

7

Query Complexity

Data complexity of a query Q:• Compute Q(Ip), for probabilistic database Ip

Simplest scenario only:• Possible tuples semantics for Q• Independent tuples for Ip

Page 8: Probabilistic/Uncertain Data Management -- IV

8

Extensional Query EvaluationRelational ops compute probabilities

v p

v p

£

v1 p1

v1 v2 p1 p2

v2 p2

v p1

v p2

v 1-(1-p1)(1-p2)…

[Fuhr&Roellke:1997,Dalvi&Suciu:2004]

-

v p1

v p1(1-p2)

v p2

Unlike intensional evaluation, data complexity: PTIME

Page 9: Probabilistic/Uncertain Data Management -- IV

9

Jon Sea p1

Jon q1

Jon q2

Jon q3

SELECT DISTINCT x.CityFROM Personp x, Purchasep yWHERE x.Name = y.Cust and y.Product = ‘Gadget’

Jon Sea p1q1

Jon Sea p1q2

Jon Sea p1q3

Sea 1-(1-p1q1)(1- p1q2)(1- p1q3)

£

Jon Sea p1

Jon q1

Jon q2

Jon q3

Jon 1-(1-q1)(1-q2)(1-q3)

£

Jon Sea p1(1-(1-q1)(1-q2)(1-q3))

[Dalvi&Suciu:2004]

Wrong !

Correct

Depends on plan !!!

Page 10: Probabilistic/Uncertain Data Management -- IV

10

Query ComplexitySometimes @ correct (“safe”) extensional plan

Qbad :- R(x), S(x,y), T(y) Data complexityis #P complete

Theorem The following are equivalent• Q has PTIME data complexity• Q admits an extensional plan (and one finds it in PTIME)• Q does not have Qbad as a subquery

[Dalvi&Suciu:2004]

Page 11: Probabilistic/Uncertain Data Management -- IV

11

Computing a Safe SPJ Extensional Plan

Problem is due to projection operations• An “unsafe” extensional projection combines

tuples that are correlated assuming independence

Projection over a join that projects away at least one of the join attrs Unsafe projection!

• Intuitive: Joins create correlated output tuples

Page 12: Probabilistic/Uncertain Data Management -- IV

12

Computing a Safe SPJ Extensional Plan

Algorithm for Safe Extensional SPJ Evaluation• Apply safe projections as late as possible in the

plan• If no more safe projections exist, look for joins

where all attributes are included in the output– Recurse on the LHS, RHS of the join

Sound and complete safe SPJ evaluation algorithm• If a safe plan exists, the algo finds it!

Page 13: Probabilistic/Uncertain Data Management -- IV

13

Summary on Query ComplexityExtensional query evaluation:• Very popular• Guarantees polynomial complexity• However, result depends on query plan and

correctness not always possible!

General query complexity• #P complete (not surprising, given #SAT)• Already #P hard for very simple query (Qbad)

Probabilistic databases have high query complexity

Page 14: Probabilistic/Uncertain Data Management -- IV

14

Efficient Approximate Evaluation: Monte-Carlo Simulation

Run evaluation with no projection/dup elimination till the very final step

• Intermediate tuples carry all attributes• Each result tuple = group t1,…,tn of tuples with the

same projection attribute values– Prob(group) = Prob( C1 OR C2 OR… Cn) , where each

Ci= e1 AND e2 … AND ek

– Evaluate the probability of a large DNF expression– Can be efficiently approximated through MC

simulation (a.k.a. sampling)

Page 15: Probabilistic/Uncertain Data Management -- IV

15

Monte Carlo Simulation

[Karp,Luby&Madras:1989]

E = X1X2 Ç X1X3 Ç X2X3

Cnt à 0repeat N times randomly choose X1, X2, X3 2 {0,1} if E(X1, X2, X3) = 1 then Cnt = Cnt+1P = Cnt/Nreturn P /* ' Pr(E) */

Theorem. If N ¸ (1/ Pr(E)) £ (4ln(2/)/2) then: Pr[ | P/Pr(E) - 1 | > ] <

May be very big

X1X2 X1X3

X2X3

Naïve:

0/1-estimatortheorem

Works for any ENot in PTIME

Page 16: Probabilistic/Uncertain Data Management -- IV

16

Monte Carlo Simulation

[Karp,Luby&Madras:1989]

Cnt à 0; S à Pr(C1) + … + Pr(Cm);repeat N times randomly choose i 2 {1,2,…, m}, with prob. Pr(Ci) / S randomly choose X1, …, Xn 2 {0,1} s.t. Ci = 1 if C1=0 and C2=0 and … and Ci-1 = 0 then Cnt = Cnt+1P = Cnt/N * 1/return P /* ' Pr(E) */

Theorem. If N ¸ (1/ m) £ (4ln(2/)/2) then: Pr[ | P/Pr(E) - 1 | > ] <

E = C1 Ç C2 Ç . . . Ç Cm

Improved:

Now it’s better

Only for E in DNFIn PTIME

Page 17: Probabilistic/Uncertain Data Management -- IV

17

Summary on Monte CarloSome form of simulation is needed in probabilistic

databases, to cope with the #P-hardness bottleneck– Naïve MC: works well when Prob is big– Improved MC: needed when Prob is small

Recent work [Re,Dalvi,Suciu, ICDE’07] describes optimized MC for top-k tuple evaluation

Page 18: Probabilistic/Uncertain Data Management -- IV

18

Handling Tuple CorrelationsTuple correlations/dependencies arise naturally

– Sensor networks: Temporal/spatial correlations– During query evaluation (even starting with

independent tuples)

Need representation formalism that can capture and evaluate queries over such correlated tuples

Page 19: Probabilistic/Uncertain Data Management -- IV

19

Capturing Tuple Correlations: Basic IdeasUse key ideas of Probabilistic Graphical Models

(PGMs) – Bayes and Markov networks are special cases

Tuple-based random variables– Each tuple t corresponds to a Boolean RV Xt

Factors capturing correlations across subsets of RVs– f(X) is a function of a (small) subset X of the Xt RVs

[Sen, Deshpande:2007]

Page 20: Probabilistic/Uncertain Data Management -- IV

20

Capturing Tuple Correlations: Basic IdeasAssociate each probabilistic tuple with a binomial RV

Define PGM factors capturing correlations across subsets of tuple RVs

Probability of a possible world = product of all PGM factors– PGM = factored, economical representation of possible

worlds distribution– Closed & complete representation formalism

Page 21: Probabilistic/Uncertain Data Management -- IV

21

Example: Mutual ExclusionWant to capture mutual exclusion (XOR) between

tuples s1 and t1

Page 22: Probabilistic/Uncertain Data Management -- IV

22

Example: Positive CorrelationWant to capture positive correlation between tuples

s1 and t1

Page 23: Probabilistic/Uncertain Data Management -- IV

23

PGM RepresentationDefinition: A PGM is a graph whose nodes represent

RVs and edges represent correlations

Factors correspond to the cliques of the PGM graph– Graph structure encodes conditional independencies

– Joint pdf = clique factors – Economical representation (O(2^k), k=|max clique|)

Page 24: Probabilistic/Uncertain Data Management -- IV

24

Query Evaluation: Basic IdeasCarefully represent correlations between base, intermediate,

and result tuples to generate a PGM for the query result distribution– Each relational op generates Boolean factors capturing the

dependencies of its input/output tuples

Final model = product of all generated factors

Cast probabilistic computations in query evaluation as a probabilistic inference problem over the resulting (factored) PGM– Can import ML techniques and optimizations

Page 25: Probabilistic/Uncertain Data Management -- IV

25

Query Evaluation: Example

Page 26: Probabilistic/Uncertain Data Management -- IV

26

Probabilistic DBs: Summary• Principled framework for managing uncertainties

– Uncertainty management: ML, AI, Stats– Benefits of DB world: declarative QL, optimization,

scale to large data, physical access structs, …

• Prob DBs = “Marriage” of DBs and ML/AI/Stats– ML folks have also been moving our way: Relational

extensions to ML models (PRMs, FO models, inductive logic programming, …)

Page 27: Probabilistic/Uncertain Data Management -- IV

27

Probabilistic DBs: Future• Importing more sophisticated ML techniques and

tools inside the DBMS– Inference as queries, FO models and optimizations,

access structs for relational queries + inference, …

• More on the algorithmic front: Probabilstic DBs and possible worlds semantics brings new challenges– E.g., approximate query processing, probabilistic data

streams (e.g., sketching), …