managing uncertain data

Download Managing Uncertain Data

Post on 12-Jan-2016

23 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Managing Uncertain Data. Anish Das Sarma Stanford University. What is Uncertain Data?. Why Does It Arise?. Precision of devices. Lack of information. Uncertainty about the future. Anonymization. Applications: Information Extraction. Applications: Information Integration. name, hPhone, - PowerPoint PPT Presentation

TRANSCRIPT

  • Managing Uncertain DataAnish Das SarmaStanford University**Anish Das Sarma

    Anish Das Sarma

  • What is Uncertain Data?**Anish Das Sarma

    (Certain) DataUncertain DataTemperature is 74.634589 FSensor reported 75 0.5 FBob works for YahooBob works for either Yahoo or MicrosoftMary sighted a FinchMary sighted either a Finch (80%) or a Sparrow (20%)It will rain in Stanford tomorrowThere is a 60% chance of rain in Stanford tomorrowYahoo stocks will be at 100 in a monthYahoo stock will be between 60 and 120 in a monthJohns age is 23Johns age is in [20,30]

    Anish Das Sarma

  • Why Does It Arise?**Anish Das SarmaPrecision of devicesLack of information

    Uncertainty about the future

    Anonymization

    (Certain) DataUncertain DataTemperature is 74.634589 FSensor reported 75 0.5 FBob works for YahooBob works for either Yahoo or MicrosoftMary sighted a FinchMary sighted either a Finch (80%) or a Sparrow (20%)It will rain in Stanford tomorrowThere is a 60% chance of rain in Stanford tomorrowYahoo stocks will be at 100 in a monthYahoo stock will be between 60 and 120 in a monthJohns age is 23Johns age is in [20,30]

    Anish Das Sarma

  • *Anish Das Sarma*Applications: Information Extraction

    RestaurantZipHard Rock Cafe94111 9413394109

    Anish Das Sarma

  • *Anish Das Sarma*Applications: Information Integrationname,hPhone,oPhone,hAddr,oAddrname,phone,addressCombined View

    Anish Das Sarma

  • *Anish Das Sarma*Applications: Deduplication?80% match

    NameJohn DoeJ. Doe

    Anish Das Sarma

  • *Anish Das Sarma*Applications: Scientific & Medical ExperimentsProbably not cancer

    Anish Das Sarma

  • How Do Database Management Systems (DBMS) Handle Uncertainty?They dont **Anish Das Sarma

    Anish Das Sarma

  • What Do (Most) Applications Do?Clean: turn into data that DBMSs can handle**Anish Das SarmaLoss of information Errors compound insidiously

    ObserverBird-1MaryFinch: 80%Sparrow: 20%SusanDove: 70%Sparrow: 30%JaneHummingbird: 65%Sparrow: 35%

    Bird-1FinchDoveHummingbird

    Anish Das Sarma

  • Outline of The TalkPart 1: Managing Uncertainty in a DBMStheory systemsPart 2: Handling Uncertainty in Data Integrationsystems theoryOther Research (trailer)

    Future Plans**Anish Das Sarma

    Anish Das Sarma

  • Part 1: Managing Uncertain DataPrimarily in the context of the Trio projectDataUncertaintyLineageTodays focus: how lineage helps**Anish Das Sarma

    Anish Das Sarma

  • Uncertain Data*Anish Das Sarma*An uncertain database represents a set of possible instances (or, possible worlds)Our work: finite sets of possible instances

    Uncertain DataSensor reported 75 0.5 FBob works for either Yahoo or MicrosoftMary sighted either a Finch (80%) or a Sparrow (20%)There is a 60% chance of rain in Stanford tomorrow

    Anish Das Sarma

  • *Representing Uncertain Data20+ years of work (mostly theoretical)Appears to be fundamental trade-off between expressiveness & intuitivenessWe spent some time exploring the space of models for uncertainty

    *Anish Das Sarma

    Anish Das Sarma

  • *Hierarchy of Models [ICDE 06]*Anish Das Sarma+ Expressive- Complex+ Intuitive- InexpressiveNextConsider a model MIsolate inexpressivenessSolve problem with lineage

    RrelationsAor-sets?maybe-tuples22-clausespropFull propositional logicsetstuple-sets

    Anish Das Sarma

  • *Running Example: Crime-SolverSaw (witness, color, car) // may be uncertainDrives (person, color, car) // may be uncertainSuspects (person) = person(Saw Drives)

    *Anish Das Sarma

    Anish Das Sarma

  • *Simple Model M1. Alternatives: uncertainty about value2. ? (Maybe) AnnotationsThree possibleinstances*Anish Das Sarma

    Saw (witness, color, car)Amyred, Honda red, Toyota orange, Mazda

    Anish Das Sarma

  • *Six possibleinstancesSimple Model M1. Alternatives2. ? (Maybe): uncertainty about presence?*Anish Das Sarma

    Saw (witness, color, car)Amyred, Honda red, Toyota orange, MazdaBettyblue, Acura

    Anish Das Sarma

  • *Anish Das Sarma*Review: Relational QueriesDSQperson(color=red)

    Saw(witness, color, car)Amy, red, HondaBetty, blue, Acura

    W (witness)Amy

    Anish Das Sarma

  • *Queries on Uncertain DataClosure:up-arrowalways existsCompleteness: All sets of possible instances can be representedDI1, I2, , InJ1, J2, , JmDpossibleinstancesQ on eachinstancerep. ofinstancesdirectimplementation*Anish Das Sarma

    Anish Das Sarma

  • *Model M is Not ClosedSuspects = person(Saw Drives)???Does not correctlycapture possibleinstances in theresultCANNOT*Anish Das Sarma

    Saw (witness, car)CathyHonda Mazda

    Drives (person, car)Jimmy, Toyota Jimmy, MazdaBilly, Honda Frank, HondaHank, Honda

    SuspectsJimmyBilly FrankHank

    Anish Das Sarma

  • * to the RescueLineageModel M + Lineage = Completeness*Anish Das Sarma

    Anish Das Sarma

  • *Example with LineageSuspects = person(Saw Drives)???*Anish Das Sarma

    IDSaw (witness, car)11CathyHonda Mazda

    IDDrives (person, car)21Jimmy, Toyota Jimmy, Mazda22Billy, Honda Frank, Honda23Hank, Honda

    IDSuspects31Jimmy32Billy Frank33Hank

    Anish Das Sarma

  • *Example with LineageSuspects = person(Saw Drives)???(31) = (11,2) (21,2)(32,1) = (11,1) (22,1); (32,2) = (11,1) (22,2)(33) = (11,1) 23

    IDSaw (witness, car)11CathyHonda Mazda

    IDDrives (person, car)21Jimmy, Toyota Jimmy, Mazda22Billy, Honda Frank, Honda23Hank, Honda

    IDSuspects31Jimmy32Billy Frank33Hank

    Anish Das Sarma

  • *Trios Data Model Alternatives? (Maybe) AnnotationsConfidence values (next)LineageUncertainty-Lineage Databases (ULDBs)Theorem: ULDBs are closed and complete [VLDB 06]*Anish Das SarmaFormally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08]

    Anish Das Sarma

  • *Confidence Values in TrioConfidence values supplied with base dataDefault probabilistic interpretationProblem: Compute confidence values on result data [ICDE 08]5-minute DBClipSearch confidence computation on YouTube.*Anish Das Sarma

    Anish Das Sarma

  • *Problem DescriptionCars = car(Saw Drives): ?: ?*Anish Das Sarma

    IDSaw (witness,car)11(Amy, Honda) : 0.512(Betty, Acura) : 0.6

    IDDrives (person,car)21(Jimmy, Honda) : 0.922(Billy, Honda) : 0.823(Hank, Acura) : 1.0

    IDCars41Honda42Acura

    Anish Das Sarma

  • *Operator-by-Operator SawDrivescar: 0.5*0.9: 0.45: 0.4: 0.60.45 + 0.4 - (0.45*0.4): 0.67Wrong!!*Anish Das Sarma

    IDSaw (witness,car)11(Amy, Honda) : 0.512(Betty, Acura) : 0.6

    IDDrives (person,car)21(Jimmy, Honda) : 0.922(Billy, Honda) : 0.823(Hank, Acura) : 1.0

    IDCars41Honda42Acura

    31(Amy,Jimmy,Honda)32(Amy,Billy,Honda)33(Betty,Hank,Acura)

    Anish Das Sarma

  • *Operator-by-Operator: 0.45: 0.4: 0.60.45 + 0.4 - (0.45*0.4)Not independent!*Anish Das Sarma

    IDSaw (witness,car)11(Amy, Honda) : 0.512(Betty, Acura) : 0.6

    IDDrives (person,car)21(Jimmy, Honda) : 0.922(Billy, Honda) : 0.823(Hank, Acura) : 1.0

    IDCars41Honda42Acura

    31(Amy,Jimmy,Honda)32(Amy,Billy,Honda)33(Betty,Hank,Acura)

    Anish Das Sarma

  • *Database Query Processing 101*Anish Das SarmaQQueryExecution PlansPick and execute best planStatistics, indexes

    Anish Das Sarma

  • *Operator-by-Operator Confidence Computation*Anish Das SarmaQQueryPlansCan be much smaller or empty

    Anish Das Sarma

  • *Decouple Data and Confidence Computation*Anish Das SarmaQQueryPlansCompute dataUse lineage to compute confidences (on demand)Theorem: Arbitrary improvement. [ICDE 08]

    Anish Das Sarma

  • *Our Approach: ?: ?(41) = 11 (21 V 22)(42) = 12 230.5 * (0.9 + 0.8 - 0.9*0.8): 0.49: 0.6Correct!!*Anish Das Sarma

    IDSaw (witness,car)11(Amy, Honda) : 0.512(Betty, Acura) : 0.6

    IDDrives (person,car)21(Jimmy, Honda) : 0.922(Billy, Honda) : 0.823(Hank, Acura) : 1.0

    IDCars41Honda42Acura

    Anish Das Sarma

  • Algorithm*Anish Das Sarma*Rtt1t2t4t5t6t7(t) = f(t4,t5,t6,t7)0.70.91.00.40.8231. Expand lineage to base data2. Get confidence of base data3. Evaluate the probability (t)Detecting independenceMemoizationBatch computation0.4

    Anish Das Sarma

  • Some Other Trio Work**Anish Das SarmaModifications and Versioning [TR 08]Stored derived relationsModifications versionsIndexes and Statistics [MUD 08]Specialized indexes, histogramsFunctional Dependencies & Schema Design [TR 07]Definitions, sound and complete axiomatization of FDsLossless decompositionFD testing, finding, and inference

    Anish Das Sarma

  • *Related Work (sample)Modeling Uncertainty: Plenty, covered in textbooksSystems: Avatar, BayesStore, MayBMS, MYSTIQ, ORION, PrDB, ProbView, Trio, others?*Anish Das Sarma

    Anish Das Sarma

  • Part 2: Data IntegrationReboot!**Anish Das Sarmaor, wake up!

    Anish Das Sarma

  • Traditional Data Integration: SetupBib(title, authors, conf, year)Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)Publication(title, author, conf, year)1. Mediated Schema2. Schema MappingsMapping SELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS BWHERE A.aid=B.aid AND P.pid=B.pid3. Query AnsweringSignificant up-front effort*Who authored the most SIGMOD papers in the 90s?Mike Carey

    Anish Das Sarma

  • Pay-As-You-Go Data IntegrationAutomated best-effort integration from the outsetFurther improve the system over time with feedback*How advanced a starting point can we provide?*Anish Das Sarma

    Anish Das Sarma