corroborating facts from affirmative statements

Corroborating

Information from

Affirmative Statements

Minji Wu, Rutgers University

Amélie Marian, Rutgers University

Background

• Information is often untrustworthy

• Erroneous (e.g, news site at breaking events)

• Misleading (e.g., malicious sources)

• Biased (e.g., political domains)

• Outdated (e.g., knowledge base that doesn’t update

frequently)

• This phenomenon is amplified by the widespread

information dependency (copy-paste)

• It is difficult for the user to discern the correctness of

information and the trustworthiness of the sources

2

Conflicting Information

3

Data Corroboration

• Early Corroboration

• Frequency-based approach

• Recent work on Corroboration techniques

• Trustworthiness of sourcesA measure s(s) that quantify the precision of a source s

• Probability of information (facts)A measure that s(f) quantify the probability that a fact f is true

• Starting with a default s(s), iteratively compute the probabilities for the facts and the trustworthiness of the sources

• Machine-Learning approaches

• Some corroboration problems can be seen as a ML classification problem

4

What if there is no conflicts?

• Does the presence of information without

contradictions means it is correct?

5

Our Problem: Corroborating Information

with only Affirmative Statements

• We focus on scenarios in which sources have little

or no dissention

• Frequent real-world problem (rumors, hard-to-rebut

claims)

• Difficult to identify incorrect information since all

reported information is consistent

• Existing corroboration approaches do not work

well

• Rely on conflicting information to differentiate the

trustworthiness of the sources

6

Contributions

• Novel corroboration approach:

• Assigns multiple trust scores to each sources

• Considers the trustworthiness of the source for a group of

facts

• Corroboration algorithm incrementally evaluates facts

• Groups unknown facts based on the sources reporting

them

• Makes decisions based on information entropy

• Extensive real world and synthetic experiments that

demonstrate the benefits of our method

7

Evaluation Setting

• Corroboration task:

• Sources for restaurant address: Citysearch, Foursquare, Menupages, Opentable, Yellowpages, Yelp

• Golden set

• Selected restaurants in 3 zip codes: 601 listings

• Verified their legitimacy in person (Apr 2012)

• 340 true and 261 false

Identify legitimate restaurant listings in NYC given

the listing information from a set of sources

8

Motivating Example

Opentable Yelp Menupages Citysearch Yellowpag

es

Correct

value

M Bar T T true

Sam’s T T T T true

27 Sunshine T T T true

Crepe

CreationsT T false

El Portal T T false

Holy Basil F T false

Papatzul T T true

Wine Spot T T true

Vbar T T true

Wai Cafe T T false

Tomoe Sushi T T T true

Khushie 139 F F T false

9

State-of-the-art

Corroboration Strategies

Approaches

• TwoEstimate [Galland WSDM’10]

• Iteratively estimates the trust score of the sources

and the probability of the facts

• BayesEstimate [Zhao VLDB’12]

• Uses a Bayesian graphical model

• Considers a two-sided errors (false positives and

false negatives)Precisio

n

Recall Accurac

y

Computed trust scores

TwoEstimate .64 1 .67 (1, 1, 0.8, 0.9, 1)

BayesEstima

te

.58 1 .58 (1, 0.8, 0.6, 1, 1)

used to evaluate each fact!10

Key Observation

• Using the same trust score to judge the correctness

of all information is too coarse

• Each source may exhibit different accuracy towards

different group of facts

• The corroboration result could be greatly improved if

we could derive finer-grained trust scores for each

source

11

Multi-value trust scores for sources

Trust Scores

• Single-value trust scores (s(s))

• A single measure for each source

• Each fact is evaluated using the same value from each source

• Multi-value trust scores

• A group of values assigned to each source

s(s) = < s1(s), s2(s), …>

• Each (group of) fact is evaluated using one of the trust values from each source

12

Multi-Value Trust Scores

• Two major challenges

• How to calculate the trust values for each source

• How to decide which sources’ trust values to consider for each fact

• Solution: an incremental evaluation mechanism

• Select a subset of facts to process

• Update the trust values based on the already processed facts

• Facts are assigned a truth value when they are processed

13

How to Select Facts?

• Model each fact f as a random variable

• Objective: compute the probability s(f) that f is true

• Information Entropy approach:

• Consider the entropy H(f) of each fact f

• The entropy of a random variable measures its uncertainty

• Our solution: select facts such that the entropy of unknown facts are maximized

• Existing corroboration techniques normalize their results to attain a probability of 1 (or 0) for each fact, i.e., entropy of 0

• Reducing uncertainty leads to (too) early consensus

14

Heuristics for Selecting Facts

• Group facts based on the votes from sources

• At each step i:

• Calculate the entropy of each fact group using si(s)

• Calculate ΔH(FG) for each fact group FG

(Represents the change of entropy if FG is selected)

• Select both positive and negative fact groups with highest

ΔH(FG)

• Assign positive and negative values to the same number of

facts

15

Revisiting the running

example

Positive: {r7}, {r2}, {r3}, {r5, r8}, {r11}, {r9}, {r4, r10}, {r6}, {r1}

Negative: {r12}

Positive: {r3}, {r11}, {r5, r8}, {r2}, {r9}, {r1}

s(S)={0.9, 0.9, 0.9, 0.9,

0.9}s(S)={1, 1, 1, 0, 0.9}

Negative: {r4, r10}, {r6}

F1={r7, r12}F2={r3, {r4, r10}}

Positive: {r9}, {r5, r8}, {r1}, {r11}, {r2}

s(S)={1, 1, 1, 0, 0.5}

Negative: {r10}, {r6}

F2={r3, r4}F3={r9, r10}

Positive: {r5, r8}, {r1}, {r11}, {r2}

s(S)={1, 1, 1, 0, 0.5}

Negative: {r6}

F4={r5, r6}

Positive: {r8}, {r3}, {r11}, {r2}

s(S)={1, 1, 1, 0, 0.5}

Negative:

True facts: r7

False facts:r12

r3

r4

r9

r10

r5

r6

r3 r8 r2 r11 Precision Recall Accurac

y

0.86 1 0.92

16

Precisio

n

Recall Accurac

y

Computed trust scores

TwoEstimate .64 1 .67 (1, 1, 0.8, 0.9, 1)

BayesEstima

te

.58 1 .58 (1, 0.8, 0.6, 1, 1)

IncEstHeu .86 1 .92 (0.9,0.9,0.9,0.9,0.9)

(1,1,1,0,0.9)

(1,1,1,0,0.5)

Experimental Setting

• Algorithms

• We implemented two strategies (IncEstPS, IncEstHeu) using

Java

• Frequency-based: Voting and Counting

• Existing Corroboration Techniques: TwoEstimate, BayesEstimate

• Machine Learning based: ML-SVM, ML-Logistic

• 36916 listings from 6 sources

• Metrics

• Precision, Recall, Accuracy

• Mean Square error (MSE) of trust score

17

Corroboration Results

Precision Recall Accuracy F-1

Voting 0.65 1.00 0.66 0.79

Counting 0.94 0.65 0.76 0.77

BayesEstimate 0.63 1.00 0.67 0.77

TwoEstimate 0.65 1.00 0.66 0.79

ML-SVM 0.98 0.74 0.77 0.84

ML-Logistic 0.86 0.85 0.82 0.82

IncEstPS 0.66 1.00 0.68 0.79

IncEstHeu 0.86 0.86 0.83 0.86

18

MSE on the sources

Yellowpag

es

Foursquar

e

Menupage

s

Opentabl

e

Citysearc

h

Yel

p

MSE

Accuracy 0.59 0.78 0.93 0.96 0.62 0.84 -

TwoEstimate 1.00 1.00 0.98 1.00 1.00 0.98 0.063

BayesEstimat

e

1.00 1.00 1.00 1.00 1.00 1.00 0.066

ML-Logistic 0.62 0.85 0.98 0.92 0.65 0.95 0.004

IncEstHeu 0.51 0.70 0.90 0.93 0.51 0.89 0.005

19

Multi-value Trust Score

• Simple Fact Selection • Entropy-based Fact

Selection

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

Tru

st

sco

re

Time point

YellowpagesFoursquareMenupages

OpentableCitysearch

Yelp

0.8

0.85

0.9

0.95

1

1.05

0 20 40 60 80 100

Tru

st

sco

re

Time point

YellowpagesFoursquareMenupages

OpentableCitysearch

Yelp

20

Conclusion

• Proposed techniques for corroborating facts with mostly affirmative statements

• Designed a novel algorithm that adopts a multi-value trust score for the sources

• Incrementally selects facts by leveraging the information entropy of unknown facts

• Uses different sets of sources’ trust scores to evaluate ach sets of facts

• Performed experiments using both real world and synthetic (see paper) data

21

corroborating facts from affirmative statements

Technology

correctness of information

listing information

reported information

incorrect information

presence of information

different group of fact

multiple trust scores

facts groups unknown