powerpoint presentation - mitweb.mit.edu/kayla/www/files/dssg_kdd_poster_ushahidi.pdf1 2 3 4 voters...
TRANSCRIPT
1 2 3 4
Voters are being
threatened at the
Ofafa, Nairobi
precinct.
My number is
234-255-567.
Pls help.
Category: Voter intimidation
Location: Ofafa, Nairobi
Private info: 234-225-567
Language: English
Duplicate of: Report #33, #87
With machine learning and natural language processing,
we’re building open-source tools to improve the
human review process of crisis reports.
In crisis situations, there’s an information gap:
Crisis mapping technologies,
like our partner Ushahidi,
gather, review, and map
incident reports to narrow this gap.
• Currently, report processing is entirely manual.
Each message is individually reviewed: anonymized,
categorized, and geo-located.
• Work is slow and tedious. It’s difficult to scale up
for large volumes or fast-paced situations.
After discussions with the crisis crowdsourcing
community, we identified the most pressing needs –
automatic suggestions for:
Is the report describing bribery or violence?
Any names, addresses, phone numbers, etc.
that need anonymizing before sharing?
Where to geo-locate? Any links to pull out?
What language(s) must the reviewer know? Is translation needed?
Has this report already been submitted?
• SVMs trained on 1000s of manually-reviewed reports
from past crisis maps - overall and per situation.
• Text classification features included a bag-of-words
unigram frequency feature set (frequency cut-off = 5).
• Unigram TF-IDF and bigram features didn’t help.
•Use a pre-trained named entity recognizer (NER)
to highlight names and locations.
•Use regular expressions to highlight
e-mail addresses, phone/ID numbers, and URLs.
Tested 4 plug-in language detectors on 850 reports
for agreement with manual language identification:
SimHash: Efficiently hash each report to a 64-bit
representation. (Near-)duplicates have short distances.
We’re currently simulating a crisis experiment on live
volunteers, testing 3 versions of the review process:
• Entirely manual, without any suggestions (baseline)
• With our system’s suggestions
• With the correct answers “suggested” (upper bound)
We’ll analyze quantitative changes in reviewer speed
and accuracy, and collect qualitative feedback.
• Train adaptive classifiers (work in progress)
• Suggest new categories via topic modeling
• Create more pre-trained classifiers (e.g. for disasters)
• Estimate report urgency
• Improve tool performance
Collaborators welcome!
github.com/dssg/ushine-learning
The Cold Start Problem
Classifier performance as a function of # training reports
BLUE: Classifier trained only on the particular election’s reports, with improvement over time as more reports are included in training .
RED: Classifier trained on reports from other elections’ reports.
Bottom line: A classifier pre-trained on other election reports can help solve the cold-start problem for a new election monitoring .
87.2% Google Translate (not free)
85.2% langid Python package
84.9% cld Python package
72.1% guess_language Python package
Flagging private info, locations, & URLS
Detecting language
Detecting (near-) duplicate reports
Classifying report categories
Flagging private info
Flagging locations and URLs
Detecting language
Detecting (near-) duplicate reports
Category: Voting Issues Category: Security Issues
Classifying report categories
Natural disasters
Contested elections
Humanitarian abuses
Voters
Election monitors
Survivors Victims
Relief agencies
NGOs & journalists
Kenya 2013 election
# reports 2,200
F1
sco
re
1.0
0
Lebanon 2009 election
# reports 150
F1
sco
re
1.0
0
Philippines 2010 election
# reports 240
1.0
0
F1
sco
re
Mexico 2009 election
# reports 250
1.0
0
F1
sco
re