data fusion – resolving data conflicts in integration

Download Data Fusion –  Resolving Data Conflicts in Integration

If you can't read please download the document

Upload: lumina

Post on 25-Feb-2016

67 views

Category:

Documents


0 download

DESCRIPTION

Data Fusion – Resolving Data Conflicts in Integration. Tutorial at VLDB 2009. Xin Luna Dong – AT&T Labs-Research Felix Naumann – Hasso Plattner Institute (HPI). Scanned. Original. Origins of Data Conflicts. ACM Computing Survey [BN08]. Origins of Data Conflicts. - PowerPoint PPT Presentation

TRANSCRIPT

Data Fusion Resolving Data Conflicts in Integration

Data Fusion Resolving Data Conflicts in IntegrationXin Luna Dong AT&T Labs-ResearchFelix Naumann Hasso Plattner Institute (HPI)Tutorial at VLDB 20091Origins of Data ConflictsData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann2

Original

ScannedACM Computing Survey [BN08]2Origins of Data ConflictsData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann3

3Origins of Data ConflictsData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann4Schering CRMBayer CRMIntegrated data

4Origins of Data Conflicts: German NamesData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann5

5Origins of Data Conflicts: Difficult NamesData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann6

6Origins of Inter-Source ConflictsData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann8Locally consistent but globally inconsistentDifferent data typesLocal spelling variations and conventionsAddressesSt Street, Ave Avenue, etc.R.-Breitscheid-Str. 72 a Rudolf-Breitscheid.-Str. 72A128 spellings for Frankfurt am MainFrankfurt a.M., Frankfurt/M, Frankfurt, Frankfurt a. Main, NamesDr. Ing. h.c. F. Porsche AGHewlett-Packard Development Company, L.P.Numerical data10.000 = 10T EURO = 10k EUR = 10.000,00 = 10,000.- Phone numbers, birth dates, etc.

Felix Naumann, HU Berlin17.2.20048Resolution of Data Conflicts?Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann9 focus is on fusing data management and collaboration: merging multiple data sources, discussion of the data, querying, visualization, and Web publishing.The power of data is truly harnessed when you combine data from multiple sources. Fusion Tables enables you to fuse multiple sets of data when they are about the same entities. In database speak, we call this a join on a primary key but the data originates from multiple independent sources.

tables.googlelabs.com9Web IntegrationGoogle Fusion TablesData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann10Allows discussion of values between users

10OverviewData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann12Data fusion in the integration processFoundations of data fusionConflict resolution strategies and functionsConflict resolution operatorsAdvanced truth-discovery techniquesExisting data fusion systemsOpen problems12Information Integration Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann13Source ASource B

Federated Database Systems Amit Sheth James Larson

Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases Scheth & Larson 1990

13Information Integration Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann14Source ASource B

Federated Database Systems Amit Sheth James Larson

Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases Scheth & Larson 1990

Schema MappingSchema Integration14Information Integration Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann15Source ASource B

Federated Database Systems Amit Sheth James Larson

Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases Scheth & Larson 1990

Federated Database Systems Amit Sheth James Larson

Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases Scheth & Larson 1990

XQueryXQueryTransformation queries or views15Information Integration Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann16Source ASource B

Federated Database Systems Amit Sheth James Larson

Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases Scheth & Larson 1990

Federated Database Systems Amit Sheth James Larson

Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases Scheth & Larson 1990

16Information Integration Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann17Source ASource B

Federated Database Systems Amit Sheth James Larson

Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases Scheth & Larson 1990

Federated Database Systems Amit Sheth James Larson

Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases Scheth & Larson 1990

17Information Integration Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann18Source ASource B

Federated Database Systems Amit Sheth James Larson

Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases Scheth & Larson 1990

Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases Amit Sheth James Larson 1990

Preserve lineage18Completeness, Conciseness, and CorrectnessData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann19Schema Matching:Same attribute semantics19Completeness, Conciseness, and CorrectnessData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann20Duplicate detection:Same real-world entities20Completeness, Conciseness, and CorrectnessData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann21Data Fusion: Resolve uncertainties and contradictionsExtensional completenessIntensional completenessIntensional concisenessExtensional conciseness21Schema MatchingData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann22ProblemGiven two schemata, find all correspondences between their attributesDifficultiesSchematic heterogeneity (synonyms & homonyms)Data heterogeneityn:m mappingsTransformation functionsUser interactionThen: Derive a schema mapping

22Duplicate DetectionData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann23ProblemGiven one or more data sets, find all sets of objects that represent the same real-world entity.DifficultiesDuplicates are not identicalSimilarity measures Levenshtein, Soundex, Jaccard, etc.Large volume, cannot compare all pairsPartitioning strategies Sorted neighborhood, Blocking, etc.CRM1CRM2CRM1 x CRM2PartitioningSimilarity measureDuplicatesNon-Duplicates???sim < 1sim > 223Ironically, Duplicate Detection has many DuplicatesData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann24DoublesDuplicate detectionRecord linkageDeduplicationObject identificationObject consolidationEntity resolutionEntity clusteringReference reconciliationReference matchingHouseholdingHousehold matchingMatchFuzzy matchApproximate matchMerge/purgeHardening soft databasesIdentity uncertaintyMixed and split citation problem24Data FusionData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann25ProblemGiven a duplicate, create a single object representation while resolving conflicting data values.DifficultiesNull values: Subsumption and complementationContradictions in data valuesUncertainty & truth: Discover the true value and model uncertainty in this processMetadata: Preferences, recency, correctnessLineage: Keep original values and their originImplementation in DBMS: SQL, extended SQL, UDFs, etc.25The Field of Data FusionData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann26Data FusionOperatorsResolution strategiesConflict typesIgnoranceConsistent answersUncertaintyContradictionAvoidanceResolutionInstance-basedMetadata-basedResolution functionsPossible worldsJoin-basedUnion-basedInstance-basedMetadata-basedComplementationAdvanced functionsAggregationSubsumption26OverviewData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann27Data fusion in the integration processFoundations of data fusionConflict resolution strategies and functionsConflict resolution operatorsAdvanced truth-discovery techniquesExisting data fusion systemsOpen problems27Uncertainty and ContradictionData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann28UncertaintyNULL value vs. non-NULL valueEasy caseContradictionNon-NULL value vs. (different) non-NULL value

Contra-dictionUncer-taintyUncer-tainty28Semantics of NULLData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann29unknownThere is a value, but I do not know it.E.g.: Unknown date-of-birthnot applicableThere is no meaningful value.E.g.: Spouse for singleswithheldThere is a value, but we are not authorized to see it.E.g.: Private phone lineFelix Naumann, HU Berlin17.2.200429Classification of StrategiesData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann30conflictignoranceconflictavoidanceconflictresolutionconflict resolutionstrategiesinstancebasedinstancebasedmetadatabasedmetadatabaseddecidingmediatingdecidingmediatingPass it onTake the informationNo GossipingTrust your friendsCry with the wolvesRoll the diceMeet in the middleNothing is older than the news from yesterdayConflict Resolution FunctionsData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann31FunctionDescriptionExamplesMin, Max, Sum, Count, AvgStandard aggregationNumChildren, Salary, HeightRandomRandom choiceShoe sizeLongest, ShortestLongest/shortest valueFirst_nameChoose(source)Value from a particular sourceDoB (DMV), CEO (SEC)ChooseDepending(val, col)Value depends on value chosen in other columncity & zip, e-mail & employerVoteMajority decisionRatingCoalesceFirst non-null valueFirst_nameGroup, ConcatGroup or concatenate all valuesBook_reviewsMostRecentMost recent (up-to-date) valueAddressMostAbstract, MostSpecific, CommonAncestorUse a taxonomy / ontologyLocationEscalateExport conflicting valuesgender31Strategie und funktion trennenEigene folie zu den funktionenKatalog der funktionenClassification of FunctionsData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann32conflictignoranceconflictavoidanceconflictresolutionconflict resolutionstrategiesinstancebasedinstancebasedmetadatabasedmetadatabaseddecidingmediatingdecidingmediatingCoalesceChooseDependingConcatAVG, SUMMIN, MAXRandomVoteChooseMostRecentMostAbstractMostSpecificEscalateCommonAncestorData Fusion in MS Outlook 2007Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann33

33OverviewData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann34Data fusion in the integration processFoundations of data fusionConflict resolution strategies and functionsConflict resolution operatorsAdvanced truth-discovery techniquesExisting data fusion systemsOpen problems34Data Fusion GoalsData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann35a, b, ca, b, c, dSource 1(A,B,C) a, b, dSource 2(A,B,D) a, b, c, -a, b, -, da, b, -a, b, -, -a, b, -a, b, -, -a, b, -, -a, b, ca, f(b,e), c, da, e, da, b, c, -a, e, -, da, b, ca, b, c, -a, b, -a, b, c, -a, b, -, -Identical tuplesSubsumed tuplesConflicting tuplesComplementing tuples35Relational Operators Overview Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann36Identical tuplesUNION, OUTER UNIONSubsumed tuples (uncertainty)MINIMUM UNIONComplementing tuples (uncertainty)COMPLEMENT UNION, MERGEConflicting tuples (contradiction)Relational approaches: Match, Group, Fuse, Other approachesPossible worlds, probabilistic answers, consistent answers36Minimum UnionData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann37Union: Elimination of exact duplicatesMinimum Union: Elimination of subsumed tuplesOuter unionSubsumptionABCabcefgmnoABDabefhmpABCDabcabefgefhmnomp+=RA tuple t1 subsumes a tuple t2, if it has same schema, has less NULL-values, and coincides in all non-NULL-values.ABCDabcefgefhmnompRewriting in SQL using DWH extensions (Windows) and assuming existence of favorable ordering [RPZ04]Felix Naumann, HU Berlin17.2.200437Full DisjunctionData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann38Represents all possible combinations of source tuples Full outer join on all common attributesAll combinations for more than two sourcesMinimum union over resultsCombines complementing tuples (only inter-source)Algorithms: [GL94,RU96,CS05]ABCabcefgkokmABDabefhmpkqrABCDabcefghmpkokmkqr||=RABCDabcefghmpkokmkqr38Complement Union ProposalData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann39Elimination of complementing tuplesOuter unionComplementationNo known SQL rewritingABCabcefgmnoABDabefhmpABCDabcabefgefhmnomp+=RA tuple t1 complements a tuple t2, if it has same schema and coincides in all non-NULL-values.ABCDabcefghmnompIncludes duplicate removal and subsumptionFelix Naumann, HU Berlin17.2.200439Merge and Prioritized MergeData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann40Mixes Join and Union to a new operator [GPZ01]Idea: Build two versions for each common attribute, one favoring S1, the other favoring S2. Nulls in a source are replaced using COALESCE.Fuses complementing tuples, but only for inter-source duplicatesPriorization possible: Removes conflicting tuples from right relation.( SELECT R.A, COALESCE(R.B, S.B), R.C, S.D FROM R LEFT OUTER JOIN S ON R.A = S.A )UNION( SELECT S.A, COALESCE(S.B, R.B), R.C, S.D FROM R RIGHT OUTER JOIN S ON R.A = S.A )Felix Naumann, HU Berlin17.2.200440Merge and Prioritized MergeData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann41ABCabcefgmnomnqrsABDabefhmp|ABCabcefgmnomnqrsABDabefhmp|ABCDaCOAL(b,b)ceCOAL(f,f)ghmCOAL(n,p)omCOAL(n,p)qrsABCDaCOAL(b,b)ceCOAL(f,f)ghmCOAL(p,n)omCOAL(p,n)ABCDabcefghmnomnqrs====ABCDabcefghmpompA is real-world ID41Match JoinData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann42Context: AURORA Project [Y99]Handles columns individually using projections (with IDs)Performs UNION on each column across all sourcesReassembles using FULL OUTER JOINSUses conflict-tolerant query model to query these possible worlds.WITH OU(A,B,C,D) AS ( ( SELECT A, B, C, NULL AS D FROM U1 ) UNION ( SELECT A, B, NULL AS C, D FROM U2 ) ), B_V (A,B) AS ( SELECT DISTINCT A, B FROM OU ), C_V (A,C) AS ( SELECT DISTINCT A, C FROM OU ), D_V (A,D) AS ( SELECT DISTINCT A, D FROM OU ),SELECT A, B, C, DFROM B_V FULL OUTER JOIN C_V FULL OUTER JOIN D_VON B_V.A=C_V.A AND C_V.A=D_V.AFelix Naumann, HU Berlin17.2.200442Match JoinData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann43Conflict-tolerant query modelChooses tuples from result of MatchJoinThree semanticsHighConfidence, RandomEvidence, PossibleAtAllResolution functions SUM, AVG, MAX, MIN, ANY, DISCARD

SELECTID, Name[ANY], Age[MAX] FROMMatchJoin(U1,U2)WHEREAge>22WITHPossibleAtAll 43Grouping and AggregationData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann44Outer union then group by real-world IDAggregate all other columns using conflict resolving aggregate functionEfficient implementationsCatches inter- and intra-source duplicatesRestricted to built-in aggregate-functionsMAX, MIN, AVG, VAR, STDDEV, SUM, COUNT

WITH OU AS ( ( SELECT A, B, C, NULL AS D FROM U1 ) UNION (ALL) ( SELECT A, B, NULL AS C, D FROM U2 ) ),

SELECT A, MAX(B), MIN(C), SUM(D)FROM OUGROUP BY AFelix Naumann, HU Berlin17.2.200444FUSE BYData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann45SQL extensions to resolve uncertainties and contradictions [BN05,BBB+05]FUSE FROM implies OUTER UNIONRemoves subsumed and duplicate tuples by defaultFUSE BY declares real-world IDRESOLVE specifies conflict resolution function from catalogDefault: COALESCEImplemented on top of relational DBMS XXL

SELECTID,RESOLVE(Title, Choose(IMDB)), RESOLVE(Year, Max), RESOLVE(Director),RESOLVE(Rating), RESOLVE(Genre, Concat)FUSE FROM IMDB, FilmdienstFUSE BY (ID) ON ORDER Year DESC45Summary of OperatorsData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann46DuplicatesSubsumed tuplesComplementing tuplesContradictionsUnion, Outer UnionMinimum UnionFull Disjunction(inter-source)Complement UnionMerge (inter-source)(inter-source)MatchJoin+ CTQMGroup ByFuse By46FuSemData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann47Tool to query and fuse data from diverse data sources [BDN07]Based on HumMer project [BBB+05].http://www.hpi.uni-potsdam.de/naumann/sites/fusem/Explore data and find interesting subsetsExecute, explore and compare five different data fusion semantics, specified in their respective syntax:SQL (and extensions, such as Subsumption)MergeMatchJoinFuseByConQuer

47What else is there?Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann48Consistent Query Answering Avoid conflicts and report only certain tuplesThose that appear in every repair [FFM05]Possible worlds modelsBuild all possible solutions, annotated with likelihoodYes/No/Maybe [DeM89]Probability value [LSS94]Probabilistic databases [SD05]Extend algebra to produce probabilitiesExtend query language to query and export probabilitiesFelix Naumann, HU Berlin17.2.200448OverviewData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann49Data fusion in the integration processFoundations of data fusionConflict resolution strategies and functionsConflict resolution operators

Advanced truth-discovery techniquesExisting data fusion systemsOpen problems49OutlineData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann50Data fusion in the integration processFoundations of data fusionConflict resolution strategies and functionsConflict resolution operatorsAdvanced truth-discovery techniquesData fusion in existing integration systemsOpen problems50Basic StrategiesData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann51conflictignoranceconflictavoidanceconflictresolutionconflict resolutionstrategiesinstancebasedinstancebasedmetadatabasedmetadatabaseddecidingmediatingdecidingmediatingPass it onTake the informationNo GossipingTrust your friendsCry with the wolvesRoll the diceMeet in the middleNothing is older than the news from yesterdayIntuitionsData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann52Data sources are of different quality and we trust data from accurate sources more

52IntuitionsData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann53Data sources are of different quality and we trust data from accurate sources more

53Basic StrategiesData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann54conflictignoranceconflictavoidanceconflictresolutionconflict resolutionstrategiesinstancebasedinstancebasedmetadatabasedmetadatabaseddecidingmediatingdecidingmediatingPass it onTake the informationNo GossipingTrust your friendsCry with the wolvesRoll the diceMeet in the middleNothing is older than the news from yesterdayIntuitionsData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann55Data sources are of different quality and we trust data from accurate sources moreThe real world is dynamic and the true value often evolves over timeE.g., person affiliation, business contact phone

55IntuitionsData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann56Data sources are of different quality and we trust data from accurate sources moreThe real world is dynamic and the true value often evolves over timeE.g., person affiliation, business contact phoneData sources can copy from each other and errors can be propagated quickly

56Advanced Truth-Discovery TechniquesData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann57Data sources are of different quality and we trust data from accurate sources moreThe real world is dynamic and the true value often evolves over timeE.g., person affiliation, business contact phoneData sources can copy from each other and errors can be propagated quickly57Advanced Truth-Discovery TechniquesData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann58Data sources are of different quality and we trust data from accurate sources moreThe real world is dynamic and the true value often evolves over timeE.g., person affiliation, business contact phoneData sources can copy from each other and errors can be propagated quickly58Trust Accurate SourcesS1S2S3StonebrakerMITBerkeleyMITDewittMSRMSRUWiscBernsteinMSRMSRMSRCareyUCIAT&TBEAHalevyGoogleGoogleUWConsidering accuracy can often improve truth discovery (American short-story Writer and Novelist, Nobel Prize for Literature in 1949, 1897-1962)59Trust Accurate SourcesConsidering accuracy can often improve truth discovery S1S2S3StonebrakerMITBerkeleyMITDewittMSRMSRUWiscBernsteinMSRMSRMSRCareyUCIAT&TBEAHalevyGoogleGoogleUWS1 is more accurate; trusting it more can help find the correct affiliation for Carey(American short-story Writer and Novelist, Nobel Prize for Literature in 1949, 1897-1962)60Trust Accurate SourcesConsidering accuracy can often improve truth discovery S1S2S3StonebrakerMITBerkeleyMITDewittMSRMSRUWiscBernsteinMSRMSRMSRCareyUCIAT&TBEAHalevyGoogleGoogleUWS1 is more accurate; trusting it more can help find the correct affiliation for Carey(American short-story Writer and Novelist, Nobel Prize for Literature in 1949, 1897-1962)61Find Trustable Sources (I)Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann62Deciding authority based on link analysis and source popularitySurvey: Link analysis ranking: algorithms, theory, and experiments [Borodin et al., 05]PageRank [Brin and Page, 98] Authority-hub analysis [Kleinberg, 98]

S62Find Trustable Sources (II)Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann63Assign a global trust rating to each data source based on its behavior in a P2P networkTrustMe [Singh and Liu, 03]EigenTrust [Kamvar et al., 03] Peer i&j:

SS63Find Trustable Sources (III)Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann64Compute accuracy of sourcesCorroborating answers from web sources [Wu and Marian, 07] TruthFinder [Yin et al., 07]Solomon [Dong et al., 09a]

-values provided by S; P(v)-pr of value v being trueHow to compute P(v)?

64Apply Source Accuracy in Truth Discovery[Yin et al., 07] [Dong et al., 09a]Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann65Input: Object ODom(O)={v0,v1,,vn}Observation on OOutput: Pr(vi true|) for each i=0,, n (sum up to 1)According to the Bayes Rule, we need to knowPr(|vi true)Assuming independence of sources, we need to know Pr((S) |vi true)If S provides vi : Pr((S) |vi true) =A(S)If S does not provide vi : Pr((S) |vi true) =(1-A(S))/n

65Consider value similarityModel and Algorithm [Dong et al., 09a]

Continue until source accuracy converges

PropertiesA value provided by more accurate sources has a higher probability to be trueAssuming uniform accuracy, a value provided by more sources has a higher probability to be true66An ExampleAccuracyS1S2S3Round 1.69.57.45Round 2.81.63.41Round 3.87.65.40Round 4.90.64.39Round 5.93.63.40Round 6.95.62.40Round 7.96.62.40Round 8.97.61.40Value ConfidenceCareyUCIAT&TBEARound 11.611.611.61Round 22.401.891.42Round 33.052.161.26Round 43.512.231.19Round 53.862.201.18Round 64.172.151.19Round 74.472.111.20Round 84.762.091.20S1S2S3StonebrakerMITBerkeleyMITDewittMSRMSRUWiscBernsteinMSRMSRMSRCareyUCIAT&TBEAHalevyGoogleGoogleUW(American short-story Writer and Novelist, Nobel Prize for Literature in 1949, 1897-1962)67Advanced Truth-Discovery TechniquesData Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann68Data sources are of different quality and we trust data from accurate sources moreThe real world is dynamic and the true value often evolves over timeE.g., person affiliation, business contact phoneData sources can copy from each other and errors can be propagated quickly68A Dynamic WorldTrue values can evolve over timeA subtle third case: out-of-date S1S2S3Stonebraker (03, MIT)(00, Berkeley)(06, MIT)Dewitt(09, MSR)(08, MSR)(01, UWisc)Bernstein(00, MSR)(00, MSR)(01, MSR)Carey(09, UCI)(05, AT&T)(06, BEA)Halevy(07, Google)(05, Google)(06, UW)69A Dynamic WorldTrue values can evolve over timeA subtle third case: out-of-date Low-quality data can be caused by different reasonsS1S2S3Stonebraker (, Berkeley), (02, MIT)(03, MIT)(00, Berkeley)(01, Berkeley)(06, MIT)Dewitt(, UWisc), (08, MSR)(00, UWisc)(09, MSR)(00, UW)(01, UWisc)(08, MSR)(01, UWisc)Bernstein (, MSR)(00, MSR)(00, MSR)(01, MSR)Carey (, Propell),(02, BEA), (08, UCI)(04, BEA)(09, UCI)(05, AT&T)(06, BEA)Halevy(, UW), (05, Google)(00, UW)(07, Google)(00, UWisc)(02, UW)(05, Google)(01, UWisc)(06, UW)ERR!OUT-OF-DATE!OUT-OF-DATE!SLOW!OUT-OF-DATE!70Refine Accuracy of Sources [Dong et al., 09b]How many transitions are capturedHow many transitions are not mis-capturedHow quickly transitions are capturedDewittS(2000)2008200320052007UWiscMSRUWiscUWCapturableCapturableCapturableCapturableMis-capturableMis-capturableMis-capturableMis-capturableMis-capturableCapturedCoverage = #Captured/#Capturable (e.g., =.25)Mis-capturedMis-capturedExactness = 1-#Mis-Captured/#Mis-Capturable (e.g., 1-2/5=.6)Freshness() = #(Captured w. length