corroborating facts from affirmative statements
TRANSCRIPT
Corroborating
Information from
Affirmative Statements
Minji Wu, Rutgers University
Amélie Marian, Rutgers University
Background
• Information is often untrustworthy
• Erroneous (e.g, news site at breaking events)
• Misleading (e.g., malicious sources)
• Biased (e.g., political domains)
• Outdated (e.g., knowledge base that doesn’t update
frequently)
• This phenomenon is amplified by the widespread
information dependency (copy-paste)
• It is difficult for the user to discern the correctness of
information and the trustworthiness of the sources
2
Conflicting Information
3
Data Corroboration
• Early Corroboration
• Frequency-based approach
• Recent work on Corroboration techniques
• Trustworthiness of sourcesA measure s(s) that quantify the precision of a source s
• Probability of information (facts)A measure that s(f) quantify the probability that a fact f is true
• Starting with a default s(s), iteratively compute the probabilities for the facts and the trustworthiness of the sources
• Machine-Learning approaches
• Some corroboration problems can be seen as a ML classification problem
4
What if there is no conflicts?
• Does the presence of information without
contradictions means it is correct?
5
Our Problem: Corroborating Information
with only Affirmative Statements
• We focus on scenarios in which sources have little
or no dissention
• Frequent real-world problem (rumors, hard-to-rebut
claims)
• Difficult to identify incorrect information since all
reported information is consistent
• Existing corroboration approaches do not work
well
• Rely on conflicting information to differentiate the
trustworthiness of the sources
6
Contributions
• Novel corroboration approach:
• Assigns multiple trust scores to each sources
• Considers the trustworthiness of the source for a group of
facts
• Corroboration algorithm incrementally evaluates facts
• Groups unknown facts based on the sources reporting
them
• Makes decisions based on information entropy
• Extensive real world and synthetic experiments that
demonstrate the benefits of our method
7
Evaluation Setting
• Corroboration task:
• Sources for restaurant address: Citysearch, Foursquare, Menupages, Opentable, Yellowpages, Yelp
• Golden set
• Selected restaurants in 3 zip codes: 601 listings
• Verified their legitimacy in person (Apr 2012)
• 340 true and 261 false
Identify legitimate restaurant listings in NYC given
the listing information from a set of sources
8
Motivating Example
Opentable Yelp Menupages Citysearch Yellowpag
es
Correct
value
M Bar T T true
Sam’s T T T T true
27 Sunshine T T T true
Crepe
CreationsT T false
El Portal T T false
Holy Basil F T false
Papatzul T T true
Wine Spot T T true
Vbar T T true
Wai Cafe T T false
Tomoe Sushi T T T true
Khushie 139 F F T false
9
State-of-the-art
Corroboration Strategies
Approaches
• TwoEstimate [Galland WSDM’10]
• Iteratively estimates the trust score of the sources
and the probability of the facts
• BayesEstimate [Zhao VLDB’12]
• Uses a Bayesian graphical model
• Considers a two-sided errors (false positives and
false negatives)Precisio
n
Recall Accurac
y
Computed trust scores
TwoEstimate .64 1 .67 (1, 1, 0.8, 0.9, 1)
BayesEstima
te
.58 1 .58 (1, 0.8, 0.6, 1, 1)
used to evaluate each fact!10
Key Observation
• Using the same trust score to judge the correctness
of all information is too coarse
• Each source may exhibit different accuracy towards
different group of facts
• The corroboration result could be greatly improved if
we could derive finer-grained trust scores for each
source
11
Multi-value trust scores for sources
Trust Scores
• Single-value trust scores (s(s))
• A single measure for each source
• Each fact is evaluated using the same value from each source
• Multi-value trust scores
• A group of values assigned to each source
s(s) = < s1(s), s2(s), …>
• Each (group of) fact is evaluated using one of the trust values from each source
12
Multi-Value Trust Scores
• Two major challenges
• How to calculate the trust values for each source
• How to decide which sources’ trust values to consider for each fact
• Solution: an incremental evaluation mechanism
• Select a subset of facts to process
• Update the trust values based on the already processed facts
• Facts are assigned a truth value when they are processed
13
How to Select Facts?
• Model each fact f as a random variable
• Objective: compute the probability s(f) that f is true
• Information Entropy approach:
• Consider the entropy H(f) of each fact f
• The entropy of a random variable measures its uncertainty
• Our solution: select facts such that the entropy of unknown facts are maximized
• Existing corroboration techniques normalize their results to attain a probability of 1 (or 0) for each fact, i.e., entropy of 0
• Reducing uncertainty leads to (too) early consensus
14
Heuristics for Selecting Facts
• Group facts based on the votes from sources
• At each step i:
• Calculate the entropy of each fact group using si(s)
• Calculate ΔH(FG) for each fact group FG
(Represents the change of entropy if FG is selected)
• Select both positive and negative fact groups with highest
ΔH(FG)
• Assign positive and negative values to the same number of
facts
15
Revisiting the running
example
Positive: {r7}, {r2}, {r3}, {r5, r8}, {r11}, {r9}, {r4, r10}, {r6}, {r1}
Negative: {r12}
Positive: {r3}, {r11}, {r5, r8}, {r2}, {r9}, {r1}
s(S)={0.9, 0.9, 0.9, 0.9,
0.9}s(S)={1, 1, 1, 0, 0.9}
Negative: {r4, r10}, {r6}
F1={r7, r12}F2={r3, {r4, r10}}
Positive: {r9}, {r5, r8}, {r1}, {r11}, {r2}
s(S)={1, 1, 1, 0, 0.5}
Negative: {r10}, {r6}
F2={r3, r4}F3={r9, r10}
Positive: {r5, r8}, {r1}, {r11}, {r2}
s(S)={1, 1, 1, 0, 0.5}
Negative: {r6}
F4={r5, r6}
Positive: {r8}, {r3}, {r11}, {r2}
s(S)={1, 1, 1, 0, 0.5}
Negative:
True facts: r7
False facts:r12
r3
r4
r9
r10
r5
r6
r3 r8 r2 r11 Precision Recall Accurac
y
0.86 1 0.92
16
Precisio
n
Recall Accurac
y
Computed trust scores
TwoEstimate .64 1 .67 (1, 1, 0.8, 0.9, 1)
BayesEstima
te
.58 1 .58 (1, 0.8, 0.6, 1, 1)
IncEstHeu .86 1 .92 (0.9,0.9,0.9,0.9,0.9)
(1,1,1,0,0.9)
(1,1,1,0,0.5)
Experimental Setting
• Algorithms
• We implemented two strategies (IncEstPS, IncEstHeu) using
Java
• Frequency-based: Voting and Counting
• Existing Corroboration Techniques: TwoEstimate, BayesEstimate
• Machine Learning based: ML-SVM, ML-Logistic
• 36916 listings from 6 sources
• Metrics
• Precision, Recall, Accuracy
• Mean Square error (MSE) of trust score
17
Corroboration Results
Precision Recall Accuracy F-1
Voting 0.65 1.00 0.66 0.79
Counting 0.94 0.65 0.76 0.77
BayesEstimate 0.63 1.00 0.67 0.77
TwoEstimate 0.65 1.00 0.66 0.79
ML-SVM 0.98 0.74 0.77 0.84
ML-Logistic 0.86 0.85 0.82 0.82
IncEstPS 0.66 1.00 0.68 0.79
IncEstHeu 0.86 0.86 0.83 0.86
18
MSE on the sources
Yellowpag
es
Foursquar
e
Menupage
s
Opentabl
e
Citysearc
h
Yel
p
MSE
Accuracy 0.59 0.78 0.93 0.96 0.62 0.84 -
TwoEstimate 1.00 1.00 0.98 1.00 1.00 0.98 0.063
BayesEstimat
e
1.00 1.00 1.00 1.00 1.00 1.00 0.066
ML-Logistic 0.62 0.85 0.98 0.92 0.65 0.95 0.004
IncEstHeu 0.51 0.70 0.90 0.93 0.51 0.89 0.005
19
Multi-value Trust Score
• Simple Fact Selection • Entropy-based Fact
Selection
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Tru
st
sco
re
Time point
YellowpagesFoursquareMenupages
OpentableCitysearch
Yelp
0.8
0.85
0.9
0.95
1
1.05
0 20 40 60 80 100
Tru
st
sco
re
Time point
YellowpagesFoursquareMenupages
OpentableCitysearch
Yelp
20
Conclusion
• Proposed techniques for corroborating facts with mostly affirmative statements
• Designed a novel algorithm that adopts a multi-value trust score for the sources
• Incrementally selects facts by leveraging the information entropy of unknown facts
• Uses different sets of sources’ trust scores to evaluate ach sets of facts
• Performed experiments using both real world and synthetic (see paper) data
21