Managing Uncertain Data

Download Managing Uncertain Data

Post on 12-Jan-2016

23 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Managing Uncertain Data. Anish Das Sarma Stanford University. What is Uncertain Data?. Why Does It Arise?. Precision of devices. Lack of information. Uncertainty about the future. Anonymization. Applications: Information Extraction. Applications: Information Integration. name, hPhone, - PowerPoint PPT Presentation

TRANSCRIPT

<ul><li><p>Managing Uncertain DataAnish Das SarmaStanford University**Anish Das Sarma</p><p>Anish Das Sarma</p></li><li><p>What is Uncertain Data?**Anish Das Sarma</p><p>(Certain) DataUncertain DataTemperature is 74.634589 FSensor reported 75 0.5 FBob works for YahooBob works for either Yahoo or MicrosoftMary sighted a FinchMary sighted either a Finch (80%) or a Sparrow (20%)It will rain in Stanford tomorrowThere is a 60% chance of rain in Stanford tomorrowYahoo stocks will be at 100 in a monthYahoo stock will be between 60 and 120 in a monthJohns age is 23Johns age is in [20,30]</p><p>Anish Das Sarma</p></li><li><p>Why Does It Arise?**Anish Das SarmaPrecision of devicesLack of information</p><p>Uncertainty about the future</p><p>Anonymization</p><p>(Certain) DataUncertain DataTemperature is 74.634589 FSensor reported 75 0.5 FBob works for YahooBob works for either Yahoo or MicrosoftMary sighted a FinchMary sighted either a Finch (80%) or a Sparrow (20%)It will rain in Stanford tomorrowThere is a 60% chance of rain in Stanford tomorrowYahoo stocks will be at 100 in a monthYahoo stock will be between 60 and 120 in a monthJohns age is 23Johns age is in [20,30]</p><p>Anish Das Sarma</p></li><li><p>*Anish Das Sarma*Applications: Information Extraction</p><p>RestaurantZipHard Rock Cafe94111 9413394109</p><p>Anish Das Sarma</p></li><li><p>*Anish Das Sarma*Applications: Information Integrationname,hPhone,oPhone,hAddr,oAddrname,phone,addressCombined View</p><p>Anish Das Sarma</p></li><li><p>*Anish Das Sarma*Applications: Deduplication?80% match</p><p>NameJohn DoeJ. Doe</p><p>Anish Das Sarma</p></li><li><p>*Anish Das Sarma*Applications: Scientific &amp; Medical ExperimentsProbably not cancer</p><p>Anish Das Sarma</p></li><li><p>How Do Database Management Systems (DBMS) Handle Uncertainty?They dont **Anish Das Sarma</p><p>Anish Das Sarma</p></li><li><p>What Do (Most) Applications Do?Clean: turn into data that DBMSs can handle**Anish Das SarmaLoss of information Errors compound insidiously</p><p>ObserverBird-1MaryFinch: 80%Sparrow: 20%SusanDove: 70%Sparrow: 30%JaneHummingbird: 65%Sparrow: 35%</p><p>Bird-1FinchDoveHummingbird</p><p>Anish Das Sarma</p></li><li><p>Outline of The TalkPart 1: Managing Uncertainty in a DBMStheory systemsPart 2: Handling Uncertainty in Data Integrationsystems theoryOther Research (trailer)</p><p>Future Plans**Anish Das Sarma</p><p>Anish Das Sarma</p></li><li><p>Part 1: Managing Uncertain DataPrimarily in the context of the Trio projectDataUncertaintyLineageTodays focus: how lineage helps**Anish Das Sarma</p><p>Anish Das Sarma</p></li><li><p>Uncertain Data*Anish Das Sarma*An uncertain database represents a set of possible instances (or, possible worlds)Our work: finite sets of possible instances</p><p>Uncertain DataSensor reported 75 0.5 FBob works for either Yahoo or MicrosoftMary sighted either a Finch (80%) or a Sparrow (20%)There is a 60% chance of rain in Stanford tomorrow</p><p>Anish Das Sarma</p></li><li><p>*Representing Uncertain Data20+ years of work (mostly theoretical)Appears to be fundamental trade-off between expressiveness &amp; intuitivenessWe spent some time exploring the space of models for uncertainty</p><p>*Anish Das Sarma</p><p>Anish Das Sarma</p></li><li><p>*Hierarchy of Models [ICDE 06]*Anish Das Sarma+ Expressive- Complex+ Intuitive- InexpressiveNextConsider a model MIsolate inexpressivenessSolve problem with lineage</p><p>RrelationsAor-sets?maybe-tuples22-clausespropFull propositional logicsetstuple-sets</p><p>Anish Das Sarma</p></li><li><p>*Running Example: Crime-SolverSaw (witness, color, car) // may be uncertainDrives (person, color, car) // may be uncertainSuspects (person) = person(Saw Drives)</p><p>*Anish Das Sarma</p><p>Anish Das Sarma</p></li><li><p>*Simple Model M1. Alternatives: uncertainty about value2. ? (Maybe) AnnotationsThree possibleinstances*Anish Das Sarma</p><p>Saw (witness, color, car)Amyred, Honda red, Toyota orange, Mazda</p><p>Anish Das Sarma</p></li><li><p>*Six possibleinstancesSimple Model M1. Alternatives2. ? (Maybe): uncertainty about presence?*Anish Das Sarma</p><p>Saw (witness, color, car)Amyred, Honda red, Toyota orange, MazdaBettyblue, Acura</p><p>Anish Das Sarma</p></li><li><p>*Anish Das Sarma*Review: Relational QueriesDSQperson(color=red)</p><p>Saw(witness, color, car)Amy, red, HondaBetty, blue, Acura</p><p>W (witness)Amy</p><p>Anish Das Sarma</p></li><li><p>*Queries on Uncertain DataClosure:up-arrowalways existsCompleteness: All sets of possible instances can be representedDI1, I2, , InJ1, J2, , JmDpossibleinstancesQ on eachinstancerep. ofinstancesdirectimplementation*Anish Das Sarma</p><p>Anish Das Sarma</p></li><li><p>*Model M is Not ClosedSuspects = person(Saw Drives)???Does not correctlycapture possibleinstances in theresultCANNOT*Anish Das Sarma</p><p>Saw (witness, car)CathyHonda Mazda </p><p>Drives (person, car)Jimmy, Toyota Jimmy, MazdaBilly, Honda Frank, HondaHank, Honda</p><p>SuspectsJimmyBilly FrankHank</p><p>Anish Das Sarma</p></li><li><p>* to the RescueLineageModel M + Lineage = Completeness*Anish Das Sarma</p><p>Anish Das Sarma</p></li><li><p>*Example with LineageSuspects = person(Saw Drives)???*Anish Das Sarma</p><p>IDSaw (witness, car)11CathyHonda Mazda </p><p>IDDrives (person, car)21Jimmy, Toyota Jimmy, Mazda22Billy, Honda Frank, Honda23Hank, Honda</p><p>IDSuspects31Jimmy32Billy Frank33Hank</p><p>Anish Das Sarma</p></li><li><p>*Example with LineageSuspects = person(Saw Drives)???(31) = (11,2) (21,2)(32,1) = (11,1) (22,1); (32,2) = (11,1) (22,2)(33) = (11,1) 23</p><p>IDSaw (witness, car)11CathyHonda Mazda </p><p>IDDrives (person, car)21Jimmy, Toyota Jimmy, Mazda22Billy, Honda Frank, Honda23Hank, Honda</p><p>IDSuspects31Jimmy32Billy Frank33Hank</p><p>Anish Das Sarma</p></li><li><p>*Trios Data Model Alternatives? (Maybe) AnnotationsConfidence values (next)LineageUncertainty-Lineage Databases (ULDBs)Theorem: ULDBs are closed and complete [VLDB 06]*Anish Das SarmaFormally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08]</p><p>Anish Das Sarma</p></li><li><p>*Confidence Values in TrioConfidence values supplied with base dataDefault probabilistic interpretationProblem: Compute confidence values on result data [ICDE 08]5-minute DBClipSearch confidence computation on YouTube.*Anish Das Sarma</p><p>Anish Das Sarma</p></li><li><p>*Problem DescriptionCars = car(Saw Drives): ?: ?*Anish Das Sarma</p><p>IDSaw (witness,car)11(Amy, Honda) : 0.512(Betty, Acura) : 0.6</p><p>IDDrives (person,car)21(Jimmy, Honda) : 0.922(Billy, Honda) : 0.823(Hank, Acura) : 1.0</p><p>IDCars41Honda42Acura</p><p>Anish Das Sarma</p></li><li><p>*Operator-by-Operator SawDrivescar: 0.5*0.9: 0.45: 0.4: 0.60.45 + 0.4 - (0.45*0.4): 0.67Wrong!!*Anish Das Sarma</p><p>IDSaw (witness,car)11(Amy, Honda) : 0.512(Betty, Acura) : 0.6</p><p>IDDrives (person,car)21(Jimmy, Honda) : 0.922(Billy, Honda) : 0.823(Hank, Acura) : 1.0</p><p>IDCars41Honda42Acura</p><p>31(Amy,Jimmy,Honda)32(Amy,Billy,Honda)33(Betty,Hank,Acura)</p><p>Anish Das Sarma</p></li><li><p>*Operator-by-Operator: 0.45: 0.4: 0.60.45 + 0.4 - (0.45*0.4)Not independent!*Anish Das Sarma</p><p>IDSaw (witness,car)11(Amy, Honda) : 0.512(Betty, Acura) : 0.6</p><p>IDDrives (person,car)21(Jimmy, Honda) : 0.922(Billy, Honda) : 0.823(Hank, Acura) : 1.0</p><p>IDCars41Honda42Acura</p><p>31(Amy,Jimmy,Honda)32(Amy,Billy,Honda)33(Betty,Hank,Acura)</p><p>Anish Das Sarma</p></li><li><p>*Database Query Processing 101*Anish Das SarmaQQueryExecution PlansPick and execute best planStatistics, indexes</p><p>Anish Das Sarma</p></li><li><p>*Operator-by-Operator Confidence Computation*Anish Das SarmaQQueryPlansCan be much smaller or empty</p><p>Anish Das Sarma</p></li><li><p>*Decouple Data and Confidence Computation*Anish Das SarmaQQueryPlansCompute dataUse lineage to compute confidences (on demand)Theorem: Arbitrary improvement. [ICDE 08]</p><p>Anish Das Sarma</p></li><li><p>*Our Approach: ?: ?(41) = 11 (21 V 22)(42) = 12 230.5 * (0.9 + 0.8 - 0.9*0.8): 0.49: 0.6Correct!!*Anish Das Sarma</p><p>IDSaw (witness,car)11(Amy, Honda) : 0.512(Betty, Acura) : 0.6</p><p>IDDrives (person,car)21(Jimmy, Honda) : 0.922(Billy, Honda) : 0.823(Hank, Acura) : 1.0</p><p>IDCars41Honda42Acura</p><p>Anish Das Sarma</p></li><li><p>Algorithm*Anish Das Sarma*Rtt1t2t4t5t6t7(t) = f(t4,t5,t6,t7)0.70.91.00.40.8231. Expand lineage to base data2. Get confidence of base data3. Evaluate the probability (t)Detecting independenceMemoizationBatch computation0.4</p><p>Anish Das Sarma</p></li><li><p>Some Other Trio Work**Anish Das SarmaModifications and Versioning [TR 08]Stored derived relationsModifications versionsIndexes and Statistics [MUD 08]Specialized indexes, histogramsFunctional Dependencies &amp; Schema Design [TR 07]Definitions, sound and complete axiomatization of FDsLossless decompositionFD testing, finding, and inference</p><p>Anish Das Sarma</p></li><li><p>*Related Work (sample)Modeling Uncertainty: Plenty, covered in textbooksSystems: Avatar, BayesStore, MayBMS, MYSTIQ, ORION, PrDB, ProbView, Trio, others?*Anish Das Sarma</p><p>Anish Das Sarma</p></li><li><p>Part 2: Data IntegrationReboot!**Anish Das Sarmaor, wake up!</p><p>Anish Das Sarma</p></li><li><p>Traditional Data Integration: SetupBib(title, authors, conf, year)Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)Publication(title, author, conf, year)1. Mediated Schema2. Schema MappingsMapping SELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS BWHERE A.aid=B.aid AND P.pid=B.pid3. Query AnsweringSignificant up-front effort*Who authored the most SIGMOD papers in the 90s?Mike Carey</p><p>Anish Das Sarma</p></li><li><p>Pay-As-You-Go Data IntegrationAutomated best-effort integration from the outsetFurther improve the system over time with feedback*How advanced a starting point can we provide?*Anish Das Sarma</p><p>Anish Das Sarma</p></li><li><p>Automatic integrationMake guessesModel probabilitiesSpecificallyProbabilistic schema mappingsProbabilistic mediated-schema Anish Das Sarma** to the RescueUncertainty&gt;90% accuracy in automatically integrating 50-800 data sources for several domains [SIGMOD 08]</p><p>Anish Das Sarma</p></li><li><p>NextProbabilistic mediated schemasProbabilistic schema mappingsExperimental resultsAnish Das Sarma**</p><p>Anish Das Sarma</p></li><li><p>Mediated SchemaS1(name, email, phone-num, address)S2(person-name,phone,mailing-addr)Med-S (name, email, phone, addr){name, person-name}{phone-num, phone}{address,mailing-addr}{email}A mediated schema is a clustering of a subset of the set of all attributes appearing in source schemas.*Anish Das Sarma*</p><p>Anish Das Sarma</p></li><li><p>Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})ExampleS1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address)?Q: SELECT name, hPhone, oPhone FROM Med*</p><p>Anish Das Sarma</p></li><li><p>S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address)Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})Q: SELECT name, phone, address FROM Med*Example</p><p>Anish Das Sarma</p></li><li><p>Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address)Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})Q: SELECT name, phone, address FROM Med*Example</p><p>Anish Das Sarma</p></li><li><p>Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address)Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})Q: SELECT name, phone, address FROM Med*Example</p><p>Anish Das Sarma</p></li><li><p>Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address)Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})Q: SELECT name, phone, address FROM Med*Example</p><p>Anish Das Sarma</p></li><li><p>Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr})Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address)Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr})Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr})Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})Q: SELECT name, phone, address FROM Med*Example</p><p>Anish Das Sarma</p></li><li><p>Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr})Probabilistic Mediated SchemaS1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address)Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr})Pr=0.5*Anish Das Sarma*Pr=0.5Probabilistic Mediated Schema (p-med-schema) is a set M = {(M1,Pr(M1)), , (Mk,Pr(Mk))} whereMi is a med-schema; ij =&gt; Mi MjPr(Mi)(0,1]; Pr(Mi) = 1</p><p>Anish Das Sarma</p></li><li><p>P-Mappings*Anish Das Sarma*</p><p>Anish Das Sarma</p></li><li><p>Expressive Power of P-Med-Schema &amp; P-MappingTheorem 1. For one-to-many mappings: (p-med-schema + p-mappings) = (mediated schema + p-mapping) &gt; (p-med-schema + mappings)Theorem 2. When restricted to one-to-one mappings: (p-med-schema + p-mappings) = (p-med-schema + mappings) &gt; (mediated schema + p-mapping)*Anish Das Sarma*</p><p>Anish Das Sarma</p></li><li><p>NextCreating p-med-schemas (briefly) Creating p-mappings (briefly)Experimental ResultsAnish Das Sarma**</p><p>Anish Das Sarma</p></li><li><p>P-med-schema Creation1.6.6.2**1. Certain/uncertain edges</p><p>Anish Das Sarma</p></li><li><p>*P-med-schema Creation2. Clustering</p><p>Anish Das Sarma</p></li><li><p>Pr=1/6Pr=1/6Pr=1/3Pr=1/3*P-med-schema Creation3. Assign probabilities</p><p>Anish Das Sarma</p></li><li><p>P-mapping Creation S=(num, pname, home-addr, office-addr)</p><p> T=(name, mailing-addr)0.80.90.90.2*Goal: find a p-mapping that is consistent with a set of weighted correspondencesTheorem: There exists a p-mapping consistent if and only if for every source/target attribute a, the sum of the weights of all correspondences that involve a is at most 1. </p><p>Anish Das Sarma</p></li><li><p>ExperimentsData: tables extracted from HTML tables on the web</p><p>*Anish Das Sarma*</p><p>Domain#SourcesSearch KeywordsMovie161movie, yearCar817make, modelPeople49job/title, organization/company/employerCourse647course/class, instructor/teacher/lecturer, subject/department/titleBib649author, title, year, journal/conference</p><p>Anish Das Sarma</p></li><li><p>Gold standard: manual Approximate standard: semi-automaticPrecision, recall, F-measure for several SQL queries varying attributes, selectivities</p><p>*Experiments</p><p>Anish Das Sarma</p></li><li><p>Quality of Query Answering*</p><p>DomainPrecisionRecallF-measureGolden StandardPeople1.849.918Course1.852.92Approximate Golden StandardMovie.951.924Car1.917.957People.958.984.971Course111Bib1.955.977</p><p>Anish Das Sarma</p></li><li><p>Comparison with Other ApproachesKeyword search obtained low precision and low recall. Querying the sources directly or considering only the highest probability mapping obtained low recall.We obtained highest F-measure in all domains.*</p><p>Anish Das Sarma</p></li><li><p>Comparison with Other Mediated-Schema Generation MethodsUsing p-med-schema obtained highest F-measure in all domains.*</p><p>Anish Das Sarma</p></li><li><p>System Setup Time (one domain)*</p><p>Anish Das Sarma</p></li><li><p>Brief Related WorkApproximate schema mappings [Magnani et. al. 2007], [Gal 2007], [Dong. et. al. 2007]Automatic generation of mediated schemas [He et. al. 2003],More (see paper)Anish Das Sarma**</p><p>Anish Das Sarma</p></li><li><p>FinallyOther ResearchData Integration (2)Deduplication (2)Quality Estimation of Sensor/RFID Streams [IQIS 06]Future Plans**Anish Das Sarma</p><p>Anish Das Sarma</p></li><li><p>Data Integration**Anish Das SarmaProblem: Foundations for integration of uncertain dataSolution [TR 08]: Define open- and closed-containment for uncertain dataAlgorithms, complexity of consistency checking and finding...</p></li></ul>