integrating conflicting data_pverconf_may2011
DESCRIPTION
May 2011 Personal ValiTRANSCRIPT
Xin Luna Dong (AT&T Labs—Research)
Laure Berti (Universite de Rennes 1, visiting AT&T)
Divesh Srivastava (AT&T Labs—Research)
The WWW is GreatThe WWW is Great
False Information on the Web False Information on the Web (I)(I)
Maurice Jarre (1924-2009) French Conductor and Composer
“One could say my life itself has been one long soundtrack. Music was my life, music brought me to life, and music is how I will be remembered long after I leave this life. When I die there will be a final waltz playing in my head and that only I can hear.”
2:29, 30 March 2009
False Information on the Web False Information on the Web (II)(II)
Posted by Andrew BreitbartIn his blog
…
The Internet needs a way to help people separate rumor from real science.
– Tim Berners-Lee
We now live in this media culture where something goes up on YouTube or a blog and everybody scrambles. - Barack Obama
Why is the Problem Hard?Why is the Problem Hard?(A Well-Predicted Problem)(A Well-Predicted Problem)Facts and truth really don’t have much to do with each other.
— William Faulkner
S1 S2 S3
Stonebraker
MIT Berkeley
MIT
Dewitt MSR MSR UWisc
Bernstein MSR MSR MSR
Carey UCI AT&T BEA
Halevy Google Google UW
Why is the Problem Hard?Why is the Problem Hard?(A Well-Predicted Problem)(A Well-Predicted Problem)Facts and truth really don’t have much to do with each other.
— William Faulkner
S1 S2 S3
Stonebraker
MIT Berkeley
MIT
Dewitt MSR MSR UWisc
Bernstein MSR MSR MSR
Carey UCI AT&T BEA
Halevy Google Google UW
Naïve voting works
Why is the Problem Hard?Why is the Problem Hard?(A Well-Predicted Problem)(A Well-Predicted Problem)
A lie told often enough becomes the truth. — Vladimir Lenin
S1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWisc
Bernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEA
Halevy Google Google UW UW UW
Naïve voting works only if data sources are independent.
Our Goal: Truth Discovery w. Our Goal: Truth Discovery w. Awareness of Dependence Awareness of Dependence Between SourcesBetween SourcesYou can fool some of the people all the time, and all of the people some of the time, but you cannot fool all of the people all the time.
– Abraham Lincoln S1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWisc
Bernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEA
Halevy Google Google UW UW UW
Naïve voting works only if data sources are independent.
Challenges in Dependence Challenges in Dependence DiscoveryDiscovery
1. Sharing common data does not in itself imply copying.
S1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWisc
Bernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEA
Halevy Google Google UW UW UW
2. With only a snapshot it is hard to decide which source is a copier.
3. A copier can also provide or verify some data by itself, so it is inappropriate to ignore all of its data.
High-Level Intuitions for High-Level Intuitions for Dependence DetectionDependence Detection
Intuition I: decide dependence (w/o direction)
Let D1, D2 be data from two sources. D1 and D2 are dependent if
Pr(D1, D2) <> Pr(D1) * Pr(D2).
Dependence?Dependence?
Source 1 on USA Presidents:
1st : George Washington
2nd : John Adams
3rd : Thomas Jefferson
4th : James Madison
…
41st : George H.W. Bush
42nd : William J. Clinton
43rd : George W. Bush
44th: Barack Obama
Source 2 on USA Presidents:
1st : George Washington
2nd : John Adams
3rd : Thomas Jefferson
4th : James Madison
…
41st : George H.W. Bush
42nd : William J. Clinton
43rd : George W. Bush
44th: Barack Obama
Are Source 1 and Source 2 dependent?
Not necessarily
Dependence? Dependence?
Source 1 on USA Presidents:
1st : George Washington
2nd : Benjamin Franklin
3rd : Tom Jefferson
4th : Abraham Lincoln
…
41st : George W. Bush
42nd : Hillary Clinton
43rd : Mickey Mouse
44th: Barack Obama
Source 2 on USA Presidents:
1st : George Washington
2nd : Benjamin Franklin
3rd : Tom Jefferson
4th : Abraham Lincoln
…
41st : George W. Bush
42nd : Hillary Clinton
43rd : Mickey Mouse
44th: John McCain
Are Source 1 and Source 2 dependent?
-- Common -- Common ErrorsErrors
Very likely
High-Level Intuitions for High-Level Intuitions for Dependence DetectionDependence DetectionIntuition I: decide dependence (w/o direction)
Let D1, D2 be data from two sources. D1 and D2 are dependent if
Pr(D1, D2) <> Pr(D1) * Pr(D2).
Intuition II: decide copying directionLet F be a property function of the data; e.g.,
accuracy of data. D1 is likely to be dependent on D2 if |F(D1 D2)-F(D1-D2)| > |F(D1 D2)-F(D2-
D1)| .
Dependence? Dependence?
Source 2 on USA Presidents:
1st : George Washington
2nd : Benjamin Franklin
3rd : Tom Jefferson
4th : Abraham Lincoln
…
41st : George W. Bush
42nd : Hillary Clinton
43rd : Mickey Mouse
44th: John McCain
Are Source 1 and Source 2 dependent?
-- Different -- Different AccuracyAccuracy
Source 1 on USA Presidents:
1st : George Washington
2nd : John Adams
3rd : Thomas Jefferson
4th : Abraham Lincoln
…
41st : George W. Bush
42nd : Hillary Clinton
43rd : George W. Bush
44th: John McCain
S1 more likely to be a copier
OutlineOutline
Motivation and intuitions for solutionTechniques
Problem definitionCopying detectionTruth discovery
Experimental ResultsFramework of the Solomon project
Problem DefinitionProblem DefinitionINPUT
Objects: an aspect of a real-world entity E.g., director of a movie, author list of
a book Each associated with one true value
Sources: each providing values for a subset of objects
OUTPUT: the true value for each object
Source DependenceSource DependenceSource dependence: two sources S and T deriving the same part of data directly or transitively from a common source (can be one of S or T).
Independent sourceCopier
copying part (or all) of data from other sources may verify or revise some of the copied valuesmay add additional values
AssumptionsIndependent valuesIndependent copyingNo loop copying
Models for a Static WorldModels for a Static WorldCore case
Conditions1. Same source accuracy2. Uniform false-value distribution3. Categorical value
Proposition: W. independent “good” sources, Naïve voting selects values with highest probability to be true.
ModelsDepe
n
AccuPR
Consider value probabilitiesin dependence analysis
AccuRemove Cond 1
SimRemove Cond 3
NonUni
Remove Cond 2
Bayesian Analysis – BasicBayesian Analysis – BasicDifferent Values O.Ad
TRUE O.At
S1 S2
FALSE O.Af
Same Values
Observation: ФGoal: Pr(S1S2| Ф), Pr(S1S2| Ф) (sum up to 1)According to the Bayes Rule, we need to know
Pr(Ф|S1S2), Pr(Ф|S1S2)Key: computing Pr(ФO.A|S1S2), Pr(ФO.A|S1S2)
for each O.AS1 S2
Bayesian Analysis – Bayesian Analysis – Probability Probability ComputationComputation
Pr Independence Copying
O.At
O.Af
O.Ad
nnn
22
21
n
Pd2
211
)1(11 2 cc
)1(2
cn
c
)1( cPd
ε-error rate; n-#wrong-values; c-copy rate
>
Different Values O.Ad
TRUE O.At
S1 S2
FALSE O.Af
Same Values
Considering Source AccuracyConsidering Source Accuracy
Pr Independence S1 Copies S2 S2 Copies S1
O.At
O.Af
O.Ad
n
SSPf
21
ftd PPP 1
)1(1 1 cPcS t
)1(1 cPcS f
)1( cPd
21 11 SSPt )1(1 2 cPcS t
)1(2 cPcS f
)1( cPd
≠
≠
Different Values O.Ad
TRUE O.At
S1 S2
FALSE O.Af
Same Values
II. Finding the True ValueII. Finding the True Value
Consider dependence
)()(')()(
SISAvCvSS
)()()(
vPAvgSASVv
)(1
)(ln)('
SA
SnASA
)(
)(')(vSS
SAvC
)(
)(
)(
0
0)(
ODv
vC
vC
e
evP
Solution on the Motivating Solution on the Motivating ExampleExample
S1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWisc
Bernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEA
Halevy Google Google UW UW UW
Copying Relationship
UCI AT&T
BEA
Truth Discovery
(1-.99*.8=.2)
(.22)
S1
S2
S4
S3
S5
.87 .2
.2
.99
.99.99
S1 S2
S3
S4 S5
Round 1
Solution on the Motivating Solution on the Motivating ExampleExample
S1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWisc
Bernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEA
Halevy Google Google UW UW UW
Copying Relationship
S1
S2
S4
S3
S5
.14
.49
.49.49
.08
.49.49
.49
AT&T
BEA
Truth Discovery
S2
S3
S4 S5
UCI
S1
Round 2
Solution on the Motivating Solution on the Motivating ExampleExample
S1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWisc
Bernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEA
Halevy Google Google UW UW UW
Copying Relationship
S1
S2
S4
S3
S5
.12
.49
.49.49
.06
.49.49
.49
AT&T
BEA
Truth Discovery
S2
S3
S4 S5
UCI
S1
Round 3
Solution on the Motivating Solution on the Motivating ExampleExample
S1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWisc
Bernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEA
Halevy Google Google UW UW UW
Copying Relationship
S1
S2
S4
S3
S5
.10
.48
.49.50
.05
.49.48
.50
AT&T
BEA
Truth Discovery
S2
UCI
S1
Round 4
S3
S4 S5
Solution on the Motivating Solution on the Motivating ExampleExample
S1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWisc
Bernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEA
Halevy Google Google UW UW UW
Copying Relationship
AT&T
BEA
Truth Discovery
S2
UCI
S1
Round 5
S3
S4 S5
S1
S2
S4
S3
S5
.09
.47
.49.51
.04
.49.47
.51
Solution on the Motivating Solution on the Motivating ExampleExample
S1 S2 S3 S4 S5
Stonebraker
MIT Berkeley
MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWisc
Bernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEA
Halevy Google Google UW UW UW
Copying Relationship
AT&T
BEA
Truth Discovery
S2
UCI
S1
Round 13
S3
S4 S5
S1
S2
S4
S3
S5
.55
.49.55.49.44
.44
Combining Accuracy and Combining Accuracy and DependenceDependence
Truth Discovery
Source-accuracy
Computation
DependenceDetection
Step 1Step 3
Step 2
Theorem: w/o accuracy, converges Observation: w. accuracy, converges when #objs >> #srcs
OutlineOutline
Motivation and intuitions for solutionTechniques
Problem definitionCopying detectionTruth discovery
Experimental ResultsFramework of the Solomon project
Experimental SetupExperimental Setup
Dataset: AbeBooks877 bookstores1263 CS books24364 listings, w. ISBN, author-listAfter pre-cleaning, each book on avg has
19 listings and 4 author lists (ranges from 1-23)
Golden standard: 100 random booksManually check author list from book cover
Measure: Precision=#(Corr author lists)/#(All lists)
Naïve Voting and Types of Naïve Voting and Types of ErrorsErrorsNaïve voting has precision .71
Contributions of Various Contributions of Various ComponentsComponents
Methods Prec#Rnds
Time(s)
Naïve .71 1 .2
Only value similarity .74 1 .2
Only source accuracy
.79 23 1.1
Only source dependence
.83 3 28.3
Depen+accu .87 22 185.8
Depen+accu+sim .89 18 197.5
Precision improves by 25.4% over Naïve
Considering dependence improves the results most
Reasonably fast
OutlineOutline
Motivation and intuitions for solutionTechniques
Problem definitionCopying detectionTruth discovery
Experimental ResultsFramework of the Solomon project
Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges
Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges
Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges
Scissors
Paper Scissors
Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges
Scissors
Glue
Data Integration Faces 3 Data Integration Faces 3 ChallengesChallenges
•Schema matching•Model management•Query answering using views•Information extraction
•String matching (edit distance, token-based, etc.)•Object matching (aka. record linkage, reference reconciliation, …)
•Data fusion•Truth discovery
Assume INDEPENDENCEof data sources
Source Copying Adds A New Source Copying Adds A New Dimension to Data IntegrationDimension to Data Integration
Copying Can Be Large Scaled Copying Can Be Large Scaled [VLDB’10a][VLDB’10a](Copying of AbeBooks Data) (Copying of AbeBooks Data)
Data collected from AbeBooks[Yin et al., 2007]
Related WorkRelated Work
Data provenance [Buneman et al., PODS’08] Focus on effective presentation and retrieval Assume knowledge of provenance/lineage
Opinion pooling [Clemen&Winkler, 1985] Combine pr distributions from multiple experts Again, assume knowledge of dependence
Detect plagiarism of text, image/video, programs, etc. [Dong, Sigmod’11 tutorial]
http://www2.research.att.com/~yifanhu/SourceCopying/