representation formalisms for uncertain data

35
Representation Formalisms Representation Formalisms for Uncertain Data for Uncertain Data Jennifer Widom with Anish Das Sarma Omar Benjelloun Alon Halevy and other participants in the Trio Trio Project

Upload: misha

Post on 25-Feb-2016

55 views

Category:

Documents


5 download

DESCRIPTION

Representation Formalisms for Uncertain Data. Jennifer Widom with Anish Das Sarma Omar Benjelloun Alon Halevy and other participants in the Trio Project. !! Warning !!. This work is preliminary and in flux Ditto for these slides Luckily I’m among friends…. Some Context. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Representation Formalisms for    Uncertain Data

Representation Formalisms for Representation Formalisms for Uncertain DataUncertain Data

Jennifer Widomwith

Anish Das SarmaOmar Benjelloun

Alon Halevyand other participants in the TrioTrio Project

Page 2: Representation Formalisms for    Uncertain Data

2

!! Warning !!!! Warning !!

This work is preliminarypreliminary and in fluxin flux

Ditto for these slides

Luckily I’m among friends…

Page 3: Representation Formalisms for    Uncertain Data

3

Some ContextSome Context

TrioTrio Project ProjectWe’re building a new kind of DBMS in which:

1.1. DataData2.2. AccuracyAccuracy3.3. LineageLineage

are all first-class interrelated concepts

Potential applications– Scientific and sensor databases– Data cleaning and integration– Approximate query processing– And others…

Page 4: Representation Formalisms for    Uncertain Data

4

Context (cont’d)Context (cont’d)

We began by investigating the accuracyaccuracy component= uncertainty= uncertainty (more on terminology coming)

Recently we’ve made progress tying togetheruncertainty +uncertainty + lineagelineage

Page 5: Representation Formalisms for    Uncertain Data

5

Approaching a New ProblemApproaching a New Problem

1) Work in a void for a while

2) Then see what others have done

3) Adjust and proceed

Page 6: Representation Formalisms for    Uncertain Data

6

Void Part 1Void Part 1

Defined initial Trio Data ModelTrio Data Model (TDMTDM) [CIDR ’05]

Based primarily on applications and intuition

AccuracyAccuracy component of initial TDM A sub-trio:

1. Attribute-level approximationapproximation2. Tuple-level (or relation-level) confidenceconfidence3. Relation-level coveragecoverage

Page 7: Representation Formalisms for    Uncertain Data

7

Terminology WarsTerminology Wars

Page 8: Representation Formalisms for    Uncertain Data

8

TDM: Approximation TDM: Approximation (Attributes)(Attributes)

Broadly, an approximate value is a set ofpossible valuespossible values along with a probability probability distribution distribution over them

Page 9: Representation Formalisms for    Uncertain Data

9

TDM: Approximation TDM: Approximation (Attributes)(Attributes)

Specifically, each Trio attribute value is either:1) Exact value (default)2) Set of values, each with prob [0,1] (∑=1)3) Min + Max for a range (uniform distribution)4) Mean + Deviation for Gaussian distribution

Type 2 sets may include “unknown” (┴)

Independence of approximate values withina tuple

Page 10: Representation Formalisms for    Uncertain Data

10

TDM: Confidence TDM: Confidence (Tuples)(Tuples)

Each tuple t has confidenceconfidence [0,1]– Informally: chance of t correctly belonging in relation– Default: confidence=1– Can also define at relation level

Page 11: Representation Formalisms for    Uncertain Data

11

TDM: Coverage TDM: Coverage (Relations)(Relations)

Each relation R has coveragecoverage [0,1]– Informally: percentage of correct R that is present– Default: coverage=1

Page 12: Representation Formalisms for    Uncertain Data

12

Void Part 2Void Part 2

Started fiddling around with TDM accuracy– Suitability for applicationsSuitability for applications– Expressiveness in generalExpressiveness in general– Operations on dataOperations on data

Immediately encountered interesting issues– Modeling is nontrivialModeling is nontrivial– Operation behavior is nonobviousOperation behavior is nonobvious– Completeness and closureCompleteness and closure

Page 13: Representation Formalisms for    Uncertain Data

13

End VoidEnd Void

Time to…– Read up on other work– Study a simplified accuracy model– Get formal– Change terminology ☻☻

Page 14: Representation Formalisms for    Uncertain Data

14

Uncertain Database: SemanticsUncertain Database: Semantics

Definition:Definition: An uncertain database represents a set of possible (certain) databases

a.k.a. “possible worlds” “possible instances”

Example:Example: Jennifer attends workshop on Monday; Mike attends on Monday, Tuesday, or not at all

person day

Jennifer Monday

Mike Monday

person day

Jennifer Monday

Mike Tuesday

person day

Jennifer Monday

Instance1 Instance2Instance3

Page 15: Representation Formalisms for    Uncertain Data

15

Restricted TDM AccuracyRestricted TDM Accuracy

1. Approximation: or-setsor-sets

2. Confidence: maybe-tuples maybe-tuples (denoted “??”)

3. Coverage: omit

Straightforward mapping to possible-instances

person day

Jennifer Monday

Mike {Monday,Tuesday}??

maps to the threepossible-instances onprevious slide

Page 16: Representation Formalisms for    Uncertain Data

16

Properties of RepsentationsProperties of Repsentations

• Restricted-TDM is one possible representationrepresentation for uncertain databases

• A representation is well-definedwell-defined if we know how to map any database in the representation to its set of possible instances

• A representation is completecomplete if every set of possible instances can be represented

Unfortunately, TDM (restricted or not) is incomplete

Page 17: Representation Formalisms for    Uncertain Data

17

IncompletenessIncompleteness

person day

Jennifer Monday

Mike Tuesday

person day

Jennifer Monday

Instance1Instance2

person day

Mike Tuesday

Instance3

person day

Jennifer Monday

Mike Tuesday???? generates 4th instance:

empty relation

Page 18: Representation Formalisms for    Uncertain Data

18

Completeness vs. ClosureCompleteness vs. Closure

All sets-of-instances

Representablesets-of-instances

Op1

Op2

Proposition:Proposition: An incomplete representationis still interesting if it’s expressive enoughand closed under all required operations

Completeness:Completeness:blue=pink

Closure:Closure:arrow stays

in blue

Page 19: Representation Formalisms for    Uncertain Data

19

Operations: SemanticsOperations: Semantics

Easy and natural (re)definition for any standard database operation (call it Op)

Closure:Closure:up-arrow

always exists

Note: Completeness Completeness Closure Closure

D

I1, I2, …, In J1, J2, …, Jm

D’

possibleinstances

Op on eachinstance

rep. ofinstances

Op’ directimplementation

Page 20: Representation Formalisms for    Uncertain Data

20

Closure in TDMClosure in TDM

Unfortunately, TDM (restricted or not) is not closednot closed under many standard operations

Next:1. Examples of non-closure in TDM2. Suggest possible extensions to the representation

(hereafter “model”)3. Hierarchy of models based on expressiveness

Page 21: Representation Formalisms for    Uncertain Data

21

Non-Closure Under Join (Ex. 1)Non-Closure Under Join (Ex. 1)

person day

Mike {Monday,Tuesday}

day food

Monday chicken

Tuesday fish⋈

Result has two possible instances:person day food

Mike Monday chicken

Instance1

person day food

Mike Tuesday fish

Instance2

Not representable with or-sets and ?

Page 22: Representation Formalisms for    Uncertain Data

22

Need for XorNeed for Xor

Result has two possible instances:person day food

Mike Monday chicken

Instance1

person day food

Mike Tuesday fish

Instance2

Representable with Xor constraintXor constraint

person day food

Mike Monday chicken

Mike Tuesday fish

t1t2

Constraint: t1 XOR t2

Page 23: Representation Formalisms for    Uncertain Data

23

Non-Closure Under Join (Ex. 2)Non-Closure Under Join (Ex. 2)

person day

Mike {Monday,Tuesday}

day food

Monday chicken

Monday pie⋈

Result has two possible instances:person day food

Mike Monday chickenMike Monday pie

Instance1

person day food

Instance2

Not representable with or-sets and ?

Page 24: Representation Formalisms for    Uncertain Data

24

Need for IffNeed for Iff

Result has two possible instances:person day food

Mike Monday chickenMike Monday pie

Instance1

person day food

Instance2

Representable with ≡≡ (IffIff) constraintconstraint

person day food

Mike Monday chicken

Mike Monday pie

t1t2

Constraint: t1 ≡ t2

Page 25: Representation Formalisms for    Uncertain Data

25

Constraints Constraints Completeness? Completeness?

Full propositional logic: YESYES

Xor and Iff: NONO

General 2-clauses: NO NO

How about “vertical or” (tuple-sets)?NOPENOPE

Page 26: Representation Formalisms for    Uncertain Data

26

Hierarchy of Incomplete ModelsHierarchy of Incomplete Models

R relations

A or-sets

? maybe-tuples

2 2-clauses

sets tuple-sets

Page 27: Representation Formalisms for    Uncertain Data

27

ClosureClosure

But remember:– Completeness may not be necessary– Closure may be good enough

Are any of these models closed under standard relational operations?

Page 28: Representation Formalisms for    Uncertain Data

28

Closure TableClosure Table

Page 29: Representation Formalisms for    Uncertain Data

29

Closure DiagramClosure Diagram

Omitted:– Self-loops– Subsumed arrows to root

Page 30: Representation Formalisms for    Uncertain Data

30

Some Final Theory Some Final Theory (for today)(for today)

Instance membership:Instance membership: Given instance I and uncertain relation R, is I an instance of R?

Instance certainty: Instance certainty: Given instance I and uncertain relation R, is I R’s only instance?

Tuple membership:Tuple membership: Given tuple t and uncertain relation R, is t in any of R’s instances?

Tuple certainty: Tuple certainty: Given tuple t and uncertain relation R, is t in all of R’s instances?

Many of these problems are NP-Hard in completemodels but polynomial in our incomplete models

Page 31: Representation Formalisms for    Uncertain Data

31

Back to RealityBack to Reality

What does all of this mean for the Trio project?

Fundamental dilemma:– Restricted-TDM:Restricted-TDM: intuitive, understandable, incomplete– Unrestricted-TDM:Unrestricted-TDM: more complex, still incomplete– Complete models:Complete models: even more complex, nonintuitive

Page 32: Representation Formalisms for    Uncertain Data

32

Uses of Incomplete ModelsUses of Incomplete Models

Sufficient for some applicationsSufficient for some applications– Incomplete model can represent data– Closed under required operations

Two-layer approachTwo-layer approach– Underlying complete model– Incomplete “working” model for users (recall Mike’s

chicken and fish) – Challenge: approximate approximationapproximate approximation

Page 33: Representation Formalisms for    Uncertain Data

33

Lineage and UncertaintyLineage and Uncertainty

Trio:Trio: Data + Accuracy + Lineage

Surprise:Surprise: Restricted-TDM (v2) + Lineage is complete and (therefore) closed

person day

Mike {Monday,Tuesday}

day food

Monday chicken

Tuesday fish⋈

person day food

Mike Monday chicken

Mike Tuesday fish

A1 A2

?? Lineage: A2, …

Lineage: A1, …

Page 34: Representation Formalisms for    Uncertain Data

34

Current ChallengesCurrent Challenges

Pursue uncertainty+lineage

Remainder of accuracy model– Probability distributions– Intervals, Gaussians– Confidence values– Coverage

Querying uncerrtaintyEx: Find all people with ≥ 3 alternate days

Can we generalize the possible-instances semantics?

Page 35: Representation Formalisms for    Uncertain Data

Search term: Search term: stanford trio