powerpoint presentation - mitweb.mit.edu/kayla/www/files/dssg_kdd_poster_ushahidi.pdf1 2 3 4 voters...

1 2 3 4

Voters are being

threatened at the

Ofafa, Nairobi

precinct.

My number is

234-255-567.

Pls help.

Category: Voter intimidation

Location: Ofafa, Nairobi

Private info: 234-225-567

Language: English

Duplicate of: Report #33, #87

With machine learning and natural language processing,

we’re building open-source tools to improve the

human review process of crisis reports.

In crisis situations, there’s an information gap:

Crisis mapping technologies,

like our partner Ushahidi,

gather, review, and map

incident reports to narrow this gap.

• Currently, report processing is entirely manual.

Each message is individually reviewed: anonymized,

categorized, and geo-located.

• Work is slow and tedious. It’s difficult to scale up

for large volumes or fast-paced situations.

After discussions with the crisis crowdsourcing

community, we identified the most pressing needs –

automatic suggestions for:

Is the report describing bribery or violence?

Any names, addresses, phone numbers, etc.

that need anonymizing before sharing?

Where to geo-locate? Any links to pull out?

What language(s) must the reviewer know? Is translation needed?

Has this report already been submitted?

• SVMs trained on 1000s of manually-reviewed reports

from past crisis maps - overall and per situation.

• Text classification features included a bag-of-words

unigram frequency feature set (frequency cut-off = 5).

• Unigram TF-IDF and bigram features didn’t help.

•Use a pre-trained named entity recognizer (NER)

to highlight names and locations.

•Use regular expressions to highlight

e-mail addresses, phone/ID numbers, and URLs.

Tested 4 plug-in language detectors on 850 reports

for agreement with manual language identification:

SimHash: Efficiently hash each report to a 64-bit

representation. (Near-)duplicates have short distances.

We’re currently simulating a crisis experiment on live

volunteers, testing 3 versions of the review process:

• Entirely manual, without any suggestions (baseline)

• With our system’s suggestions

• With the correct answers “suggested” (upper bound)

We’ll analyze quantitative changes in reviewer speed

and accuracy, and collect qualitative feedback.

• Train adaptive classifiers (work in progress)

• Suggest new categories via topic modeling

• Create more pre-trained classifiers (e.g. for disasters)

• Estimate report urgency

• Improve tool performance

Collaborators welcome!

github.com/dssg/ushine-learning

The Cold Start Problem

Classifier performance as a function of # training reports

BLUE: Classifier trained only on the particular election’s reports, with improvement over time as more reports are included in training .

RED: Classifier trained on reports from other elections’ reports.

Bottom line: A classifier pre-trained on other election reports can help solve the cold-start problem for a new election monitoring .

87.2% Google Translate (not free)

85.2% langid Python package

84.9% cld Python package

72.1% guess_language Python package

Flagging private info, locations, & URLS

Detecting language

Detecting (near-) duplicate reports

Classifying report categories

Flagging private info

Flagging locations and URLs

Detecting language

Detecting (near-) duplicate reports

Category: Voting Issues Category: Security Issues

Classifying report categories

Natural disasters

Contested elections

Humanitarian abuses

Voters

Election monitors

Survivors Victims

Relief agencies

NGOs & journalists

Kenya 2013 election

# reports 2,200

F1

sco

re

1.0

0

Lebanon 2009 election

# reports 150

F1

sco

re

1.0

0

Philippines 2010 election

# reports 240

1.0

0

F1

sco

re

Mexico 2009 election

# reports 250

1.0

0

F1

sco

re

powerpoint presentation - mitweb.mit.edu/kayla/www/files/dssg_kdd_poster_ushahidi.pdf1 2 3 4 voters...

Documents