global detection of complex copying relationships between sources
DESCRIPTION
Global Detection of Complex Copying Relationships Between Sources. Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille , Yifan Hu , Divesh Srivastava @VLDB’2010. Information Propagation Becomes Much Easier with the Web Technologies. False Information Can Be Propagated. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/1.jpg)
GLOBAL DETECTION OF COMPLEX COPYING
RELATIONSHIPS BETWEEN SOURCES
Xin Luna Dong
AT&T Labs-ResearchJoint work w. Laure Berti-Equille, Yifan Hu, Divesh
Srivastava
@VLDB’2010
![Page 2: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/2.jpg)
Information Propagation Becomes Much Easier with the Web Technologies
![Page 3: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/3.jpg)
False Information Can Be Propagated
Posted by Andrew BreitbartIn his blog
…
![Page 4: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/4.jpg)
The Internet needs a way to help people separate rumor from real science.
– Tim Berners-Lee
We now live in this media culture where something goes up on YouTube or a blog and everybody scrambles. - Barack Obama
![Page 5: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/5.jpg)
Large-Scaled Copying on Structured Data(Copying of AbeBooks Data)
Data collected from AbeBooks[Yin et al., 2007]
![Page 6: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/6.jpg)
Observation I. Intuitively Meaningful Clusters According to the Copying Relationships
![Page 7: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/7.jpg)
Observation I. Intuitively Meaningful Clusters According to the Copying Relationships
![Page 8: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/8.jpg)
Observation II. Complex Copying Relationships
Co-copying
![Page 9: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/9.jpg)
Observation II. Complex Copying Relationships
Transitive copying
Multi-sourcecopying
![Page 10: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/10.jpg)
Understanding Complex Copying RelationshipsBenefits
Business purpose: data are valuableIn-depth data analysis: information
disseminationImprove data integration: truth discovery,
entity resolution, schema mapping, query optimization
Current techniques make local decisions [Dong et al., 09a][Dong et al., 09b][Blanco et al., 10]
Cannot distinguish co-copying, transitive copying, direct copying from multiple sources
![Page 11: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/11.jpg)
Our Contributions
More accurate decisions on copying direction (important for global detection)
Glean information from completeness, formatting
Consider correlated copying: e.g., a source copying the name of a book can also copy its author list
Local Detection
Global Detection
Global detection of copying
Discovering co-copying and transitive copying
![Page 12: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/12.jpg)
Outline
Motivation and contributionsProblem definition and techniques
Experimental resultsRelated work and conclusions
Local Detection
Global Detection
Intuitions Techniques
![Page 13: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/13.jpg)
Problem Definition—Input
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S41 IPV6: Theory, Protocol, and
Practice Loshin
2 Web Usability: A User Lazar
Missing values
Different formats
Incorrectvalues
Objects: a real-world entity, described by a set of attributes
Each associated w. a true valueSources: each providing data for a subset of objects
Input
![Page 14: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/14.jpg)
Problem Definition—OutputFor each S1, S2, decide pr of S1 copying directly from S2
A copier copies all or a subset of data A copier can add values and verify/modify copied values—
independent contribution A copier can re-format copied values—still considered as copied
S1 S2
S3
S4
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S41 IPV6: Theory, Protocol, and
Practice Loshin
2 Web Usability: A User Lazar
![Page 15: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/15.jpg)
Intuitions for Local Copying Detection
Overlap on unpopular values CopyingChanges in quality of different parts of data Copying direction[VLDB’09]
Consider correctness of
data
Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2
![Page 16: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/16.jpg)
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S41 IPV6: Theory, Protocol, and
Practice Loshin
2 Web Usability: A User Lazar
Correctness of Data as Evidence for Copying
S1 S2
S3
S4
![Page 17: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/17.jpg)
Intuitions for Local Copying Detection
Overlap on unpopular values CopyingChanges in quality of different parts of data Copying direction[VLDB’09]
Consider correctness of
data
Consider additionalevidence
Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2
![Page 18: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/18.jpg)
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S41 IPV6: Theory, Protocol, and
Practice Loshin
2 Web Usability: A User Lazar
Formatting as Evidence for Copying
S1 S2
S3
S4
Different formats
SubValues
![Page 19: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/19.jpg)
Intuitions for Local Copying Detection
Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1┴S2) S1->S2Overlap on unpopular values CopyingChanges in quality of different parts of data Copying direction[VLDB’09]
Consider correctness of
data
Consider additionalevidence
Consider correlated copying
![Page 20: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/20.jpg)
Correlated Copying
K A1 A2 A3 A4
O1 S S S D D
O2 S D S S D
O3 S S D S D
O4 S S S D S
O5 S D S S S
K A1 A2 A3 A4
O1 S S S S S
O2 S S S S S
O3 S S S S S
O4 S D D D D
O5 S D D D D
17 same values, and 8 different values17 same values, and 8 different values
Copying
S: Two sources providing the same valueD: Two sources providing different values
![Page 21: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/21.jpg)
Intuitions for Local Copying Detection
Pr(Ф(S1)|S1->S2) >> Pr(Ф(S1)|S1┴S2) S1->S2Overlap on unpopular values CopyingChanges in quality of different parts of data Copying direction[VLDB’09]
Consider correctness of
data
Consider additionalevidence
Consider correlated copying
![Page 22: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/22.jpg)
Experimental Results for Local Copying Detection on Synthetic Data
![Page 23: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/23.jpg)
Outline
Motivation and contributionsProblem definition and techniques
Experimental resultsRelated work and conclusions
Local Detection
Global Detection
Intuitions Techniques
![Page 24: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/24.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?
S1{V1-V100}
S2 S3
Multi-source copying
Co-copying
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
![Page 25: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/25.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?
S1{V1-V100}
S2 S3
Multi-source copying
Co-copying
Local copying detection results
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
![Page 26: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/26.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?
S1{V1-V100}
S2 S3
Multi-source copying
Co-copying
- Looking at the copying probabilities?
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
![Page 27: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/27.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?
S1{V1-V100}
S2 S3
Multi-source copying
Co-copying
1
X Looking at the copying probabilities? - Counting shared values?
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
1
1
1 1
1
1 1
1
![Page 28: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/28.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?
S1{V1-V100}
S2 S3
Multi-source copying
Co-copying
50
X Looking at the copying probabilities?X Counting shared values? - Comparing the set of shared values?
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
50
30
50 50
30
50 50
30
![Page 29: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/29.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?
S1{V1-V100}
S2 S3
Multi-source copying
Co-copying
V1-V50
V101-V130
X Looking at the copying probabilities?X Counting shared values? - Comparing the set of shared values?
V51-V100
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3
V1-V50
V21-V50
V21-V70
{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3
V1-V50
V21-V50
V21-V50, V81-V100
{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
![Page 30: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/30.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?
S1{V1-V100}
S2 S3
Multi-source copying
Co-copying
V1-V50
V101-V130
X Looking at the copying probabilities?X Counting shared values?X Comparing the set of shared values?
V51-V100
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3
V1-V50
V21-V50
V21-V70
{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3
V1-V50
V21-V50
V21-V50, V80-V100
{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
V21-V50 shared by 3 sources
We need to reason for each data item in a principled way!
![Page 31: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/31.jpg)
Global Copying Detection
1. First find a set of copyings R that significantly influence the rest of the copyings How to find such R?
2. Adjust copying probability for the rest of the copyings: P(S1S2|R) How to compute P(S1S2|R)?
![Page 32: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/32.jpg)
Computing P(S1S2|R)
Replace Pr(Ф(S1)|S1S2) everywhere with Pr(Ф(S1)|S1S2, R)
For each O.A, consider sources associated with S1 in R Sf(O.A)—sources providing the same value in the
same format on O.A as S1 Sv(O.A)—sources providing the same value in a
different format on O.A as S1 Pf/Pv – Probability that S1 does not copy O.A from any
source in Sf(O.A)/Sv(O.A)
Pr(Ф O.A(S1)|S1->S2, R)=(1-PfPv)+PfPv Pr(ФO.A (S1)|S1S2)
Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2
![Page 33: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/33.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?
S1{V1-V100}
S2 S3
Multi-source copying
Co-copying
V1-V50
V101-V130
V51-V100
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3
V1-V50
V21-V50
V21-V70
{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3
V1-V50
V21-V50
V21-V50, V81-V100
{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
R={S3S1}, Pr(Ф(S3))= Pr(Ф(S3)|R) for V101-V130
R={S3S1}, Pr(Ф(S3))<<Pr(Ф(S3)|R) for V21-V50
R={S3S2}, Pr(Ф(S3))<<Pr(Ф(S3)|R) for V21-V50Pr(Ф(S3)) is high for V81-V100
XX
?
??
![Page 34: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/34.jpg)
Finding R
R (most influential copying relationships)Maximize
Finding R is NP-complete(Reduction from HITTING SET problem)
We need a fast greedy algorithm
![Page 35: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/35.jpg)
Greedy Algorithm for Finding R Goal: Maximize
Intuitions For each source, find the most “influential”
sources from which it copies Order the original sources by their accumulated
influence on others, and iteratively add each corresponding copying to R unless one of the following holds
Prune copyings that have less accumulated influence on others than being affected by others
Prune copyings that can be significantly influenced by the already selected copyings
E.g., P(S4S1)-P(S4S1|S4S3)=.8,P(S4S2)-P(S4S2|S4S3)=.8P(S4S3)-P(S4S3|S4S1)=.5, P(S4S3)-P(S4S3|S4S2)=.5
S1 S2
S3
S4
Accumulated influence: .8+.8=1.
6
X X
![Page 36: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/36.jpg)
Experimental Results for Global Detection on Synthetic Data
Sensitivity: Percentage of copying that are identified w. correct direction
Specificity: Percentage of non-copying that are identified as so
![Page 37: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/37.jpg)
Outline
Motivation and contributionsProblem definition and techniques
Experimental resultsRelated work and conclusions
Local Detection
Global Detection
Intuitions Techniques
![Page 38: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/38.jpg)
Experimental Setup
Dataset: Weather data18 weather websitesfor 30 major USA citiescollected every 45 minutes for a day33 collections, so 990 objects28 distinct attributes
ChallengesNo true/false notion, only popularityFrequent updates—up-to-date data may not
have been copied at crawlingComplete data and standard formatting—lack
evidence from completeness & formatting
![Page 39: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/39.jpg)
Golden Standard
![Page 40: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/40.jpg)
Silver Standard
![Page 41: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/41.jpg)
Results of Global Detection
![Page 42: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/42.jpg)
Results of Local Detection
![Page 43: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/43.jpg)
Experiment Results
Measure: Precision, Recall, F-measureC: real copying; D: detected copying
RP
PRF
C
DCR
D
DCP
2,,
Methods Precision
Recall
F-measur
eCorr (Only correctness) .5 .43 .46
Enriched (More evidence)
1 .14 .25
Local (correlated copying)
.33 .86 .48
Global (global detection)
.79 .79 .79
Transitive/co-copying not removed
Ignoring evidence from
correlated copying
Enriched improves over Corr when true/false notion
does apply
![Page 44: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/44.jpg)
Related WorkCopying detection
Texts/Programs [Schleimer et al., 03][Buneman, 71]
Videos [Law-To et al., 07]Structured sources
[Dong et al., 09a] [Dong et al., 09b]: Local decision[Blanco et al., 10]: Assume a copier must copy all
attribute values of an object
Data provenance [Buneman et al., PODS’08]Focus on effective presentation and retrievalAssume knowledge of provenance/lineage
![Page 45: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/45.jpg)
Conclusions and Future WorkConclusions
Improve previous techniques for pairwise copying detection byplugging in different types of copying evidenceconsidering correlations between copying
Global detection for eliminating co-copying and transitive copying
Ongoing and future workCategorization and summarization of the
copied instancesVisualization of copying relationships
[VLDB’10 demo]
![Page 46: Global Detection of Complex Copying Relationships Between Sources](https://reader035.vdocuments.site/reader035/viewer/2022062516/56812bdd550346895d904d9a/html5/thumbnails/46.jpg)
GLOBAL DETECTION OF COMPLEX COPYING
RELATIONSHIPS BETWEEN SOURCES
http://www2.research.att.com/~yifanhu/SourceCopying/