dispute finder
DESCRIPTION
Slides about the Dispute Finder project. Slides are taken from the talks presented at WWW 2010 and WICOW 2010.TRANSCRIPT
1/44
Is there another side to this?
Identifying Disputed Information on the Web
Rob Ennals, Intel Research Berkeley - [email protected] done in collaboration with:
John Mark Agosta, Dan Byler, Beth Trushkowsky, Barbara Rosario, Tad Hirsch, Tye Rattenbury
About Me: Rob Ennals
• Senior Research Scientist at Intel Research
• Represent Intel at W3C for HTML and Web Apps.
• PhD from University of Cambridge(advised by Simon Peyton Jones – Microsoft Research)
• Diverse interests: PL, Concurrency, Systems, Web,
Mashups, HCI, NLP, Politics, etc
Not everything on the web is true, balanced, and objective
3/44
Not everything on the web is true, balanced, and objective
4/44
People increasingly rely on the web for information
source: Pew Research
Old Model: small number of known sourcesTV, Radio, Newspaper, Book Publishers
New Model: huge number of unknown sourcesBlogs, random websites, foreign newspapers
5/44
Not just an issue of source credibility.
If we ignore untrusted sources then we ignore a lot of the information on the web.
6/44
inform users when information that
they encounter in their lives is disputed by
a source that they might trust
Dispute Finder:
7/44
Browser extension
Firefox extension examines every page you browse (including email, intranet pages, etc).
Highlights claims that are disputed.
8/44
Click a dispute for more information
Show sources that support or oppose the claim.
9/44
Search Engine Front-End
Built with Yahoo BOSS.
Examines text on all linked pages.
Early Work:Mobile Voice Interface
Currently an early prototype, running on a laptop, based on Dragon NaturallySpeaking.
Listen to everything people say around you. Keep a list of disputed things you may have heard.
Vibrate when you hear something disputed.
10/44
11/44
Future: Disputed Claims on TV
12/44
Future: Mail, Books, News, etc ...
People seem to like it
Covered by: NPR, New Scientist, Fast Company,
Christian Science Monitor, Wall Street Journal, NY
Times Bay Area, San Jose Mercury, SF Chronicle,
The Guardian, ACM TechNews, CBC (Canadian
Public Radio), Cnet, Sacramento Bee, + many
others
TG Daily: “This is hands down, the most amazing idea
I’ve ever heard of when it comes to using the web”
Paper accepted for WWW 2010 + WICOW 2010.
Overall structure:
14/44
Related Work: Social Annotation
15/44
DiigoDiigo Videolyzer
SpinSpotter
Need to mark every instance
individually
Related Work: Fact Checker Sites
16/44
Need to suspect something
may be disputed.
Related Work: Source Rating
17/44
But: Non-credible sources still have useful information.
But: Credible sources still get stuff wrong.
Automatic quality metrics.
Related Work: Wiki Source Tracking
18/44
Who wrote this, and are
they credible/biased?
Great if your content is on
wikipedia.
WikiTrust WikiScanner
Overall structure:
19/44
20/44
Compare Observed Text to Known Disputes
Glenn Beck falsely claimed that the moon is made of cheese, despite clear evidence to the contrary.
False claim: "the moon is made of cheese"Disputed by: Huffington Post, New York TimesContext: ...
Entailment: "We should mine the moon because it ismade of cheese"
21/44
Contradiction detection via dispute detection
22/44
Contradiction detection vs Dispute Detection
Contradiction detection: Does statement X logically contradict statement Y.Hard: need lots of real-world knowledge.
Dispute detection:Does author A believe that statement X is disputed or misleading.Humans determine what is actually disputed.Humans determine which disputes are interesting.Only detects contradictions that humans find.Detects statements that are misleading without being wrong.
Once we have determined that a dispute is real, could use contradiction detection and sentiment analysis to see who is on each side.
23/44
A statement can be misleading without being wrong
GM's misleading claim that the Chevrolet Volt gets 230 miles per gallon
deceptively claimed that fast food could be nutritious
Logical truth isn't all that interesting.
We want to know if there is a different way of looking at the subject. A different frame.
24/44
Mining claims from the web
25/44
Use Patterns to Find Disputed Claims
the false claim that Himalayan glaciers could melt away by 2035it is not true that anyone aged over 59 cannot receive heart repairsthe misconception that everyone in the south are stupidthe delusion that scientists in different countries do science differentlyinto believing that Van Morrison had a new babythe myth that we can't afford good working conditions for everyonemisleadingly claimed that unemployment is lower than the '70s
We built a simple grammar for such prefixes.Currently 1293 patterns, identified on ~ 35 million web pages.of which we have downloaded and processed 2 million.
Restricting to prefixes allows us to search for them using Yahoo BOSS.
Future: automatically infer a larger grammar of patterns
26/44
Some Disputes I Wasn’t Aware of
Estimates from Yahoo BOSS. Not all URLs downloaded.
The Niger-Iraq Uranium connection has been discredited
Medieval Europeans thought the world was flat
Dinosaurs looked sleek and reptilian.
Dietary Cholesterol is a problem.
“Wear and Tear” causes arthritis
Specific foods cause ulcers
Most Disputed Nouns
1. God
2. Iraq
3. Government
4. Obama
5. War
6. Israel
7. President
8. Women
9. Money
10. Jesus
28/44
Search for all patterns on Yahoo BOSS
Yahoo BOSS is an API for Yahoo search.
BOSS API has a limit of 1000 hits per query, so salt with year and month.
+"falsely claimed that" +2010+"falsely claimed that" -2010 +2009+"falsely claimed that" -2010 -2009 +2008+"falsely claimed that" -2010 -2009 -2008 +2007
We talked to Yahoo first...
Needed for 197 patterns.
Future: get direct access to complete results for a pattern
29/44
Claims need to be filtered
the false claim that won't go away
falsely claimed that he didn't do itwrongly think that the bill will passwrongly think that Great Britain doesn't
the myth that Elvis is alive has a long history
falsely claim that full commentary below
ambiguous
fragment
suffix
extractionerror
30/44
Labeled data from Mechanical Turk
$0.04 to label 10 claims, two of which are known.If a turker gets known items wrong, reject their work.Each claim labeled by two turkers.
31/44
Problem: text may not be a statement
the false claim that won't go awaythe belief that works bestthe lie that people fell for
Current approach: Is the first word a verb?finds 71% of bad claims mistakenly drops 2% of good claims
Works for first two, but not last.
32/44
Problem: ambiguous claims
he didn't do itthe union was a party in the proceedingsthe other parent is abusive
our troops have committed atrocities
property taxes are regressive Obama is a communist
If two pages say X, do they mean the same thing?
Turk: 61.9% agreement - often very subjective
Bad
Good
Maybe
Future: associate claim with page topic
33/44
Wikipedia links tell us what is unambiguous
Obama is a communist
property taxes are regressive
Is this word always linked to the same thing?
Precision: 73% Recall: 73%(vs gold data + word features)
Overall structure:
34/44
Users enter that claims they disagree with
35/44
Users add paraphrases for claims
36/44
Alternative ways to phrase the same claim.
Teach Dispute Finder to recognize claims
37/44
Users add evidence to support claims
38/44
A claim will not be shown to others unless the user finds
a source that argues against it.
Users identify a disputed claim on a page
39/44
Define a new disputed claim, or add paraphrase for
existing disputed claim.
40/44
User Study Results
Future: use users to improve mined claims
Frustrated by low number of claims that were highlighted
- motivated text mining approach
Did not appreciate that a claim should apply to multiple pages
- particularly when using context menu approach
Confused about how specific a claim should be
E.g. “Global temperatures will rise by X degrees”
Users created claims with ambiguous meanings
E.g. saying “wood” to mean “Ronnie Wood”
Confused by double-negatives when adding evidence
E.g. opposes global warming does not exist
41/44
Entailment
42/44
Entailment is resource constrained
Must compare many sentences against a huge number of claims
in a fraction of a second.
43/44
Simple lexical entailment
All non-stopwords present, and in the correct order.
Very simple but:• it can be done very efficiently• if you have a big enough corpus then it works ok
I think that global warming is just a hoax
global warming is a hoax
Future: better entailment that still scales
Future: look at context, and other places same text appears
What is Disputed?
44/44
Anything disputed by anyone?
- we get overwhelmed with claims disputed by nutcases
Anything disputed by a “reliable source”?
- what is a “reliable source”? (Wikipedia rules?)
- do we end up enforcing “orthodox” beliefs and stifling debate?
Anything disputed by a source that I would trust?- we reinforce existing echo-chamber problem
Anything disputed by my friends?- do I agree with my friends
- should I be encouraged to agree with them
Future: learn what to show a user by analyzing their behavior
Interviews: Do people want this?
45/44
Hard to change established opinionsThey think they already understand the issue.
They would have to publically back down
So focus on issues they don’t yet have an opinion on?
Hard to make someone accept the other sideSocial identity in “us” vs “them”
Not willing to listen to “other side”
So give sources from their “own” side?
Sometimes people may not careReading just for entertainment and conversation material
Don’t care much if they are wrong
Not interested in challenging opinions of others
Focus on issues that affect them personally
Dispute Finder probably isn’t for everyone
46/44
Questions?