powerpoint presentation - mitweb.mit.edu/kayla/www/files/dssg_kdd_poster_ushahidi.pdf1 2 3 4 voters...

1
1 2 3 4 Voters are being threatened at the Ofafa, Nairobi precinct. My number is 234-255-567. Pls help. Category: Voter intimidation Location: Ofafa, Nairobi Private info: 234-225-567 Language: English Duplicate of: Report #33, #87 With machine learning and natural language processing, we’re building open-source tools to improve the human review process of crisis reports. In crisis situations, there’s an information gap: Crisis mapping technologies, like our partner Ushahidi, gather, review, and map incident reports to narrow this gap. Currently, report processing is entirely manual. Each message is individually reviewed: anonymized, categorized, and geo-located. Work is slow and tedious. It’s difficult to scale up for large volumes or fast-paced situations. After discussions with the crisis crowdsourcing community, we identified the most pressing needs – automatic suggestions for: Is the report describing bribery or violence? Any names, addresses, phone numbers, etc. that need anonymizing before sharing? Where to geo-locate? Any links to pull out? What language(s) must the reviewer know? Is translation needed? Has this report already been submitted? SVMs trained on 1000s of manually-reviewed reports from past crisis maps - overall and per situation. Text classification features included a bag-of-words unigram frequency feature set (frequency cut-off = 5). Unigram TF-IDF and bigram features didn’t help. Use a pre-trained named entity recognizer (NER) to highlight names and locations. Use regular expressions to highlight e-mail addresses, phone/ID numbers, and URLs. Tested 4 plug-in language detectors on 850 reports for agreement with manual language identification: SimHash: Efficiently hash each report to a 64-bit representation. (Near-)duplicates have short distances. We’re currently simulating a crisis experiment on live volunteers, testing 3 versions of the review process: Entirely manual, without any suggestions (baseline) With our system’s suggestions With the correct answers “suggested” (upper bound) We’ll analyze quantitative changes in reviewer speed and accuracy, and collect qualitative feedback. Train adaptive classifiers (work in progress) Suggest new categories via topic modeling Create more pre-trained classifiers (e.g. for disasters) Estimate report urgency Improve tool performance Collaborators welcome! github.com/dssg/ushine-learning The Cold Start Problem Classifier performance as a function of # training reports BLUE: Classifier trained only on the particular election’s reports, with improvement over time as more reports are included in training . RED: Classifier trained on reports from other elections’ reports. Bottom line: A classifier pre-trained on other election reports can help solve the cold-start problem for a new election monitoring . 87.2% Google Translate (not free) 85.2% langid Python package 84.9% cld Python package 72.1% guess_language Python package Flagging private info, locations, & URLS Detecting language Detecting (near-) duplicate reports Classifying report categories Flagging private info Flagging locations and URLs Detecting language Detecting (near-) duplicate reports Category: Voting Issues Category: Security Issues Classifying report categories Natural disasters Contested elections Humanitarian abuses Voters Election monitors Survivors Victims Relief agencies NGOs & journalists Kenya 2013 election # reports 2,200 F1 score 1.0 0 Lebanon 2009 election # reports 150 F1 score 1.0 0 Philippines 2010 election # reports 240 1.0 0 F1 score Mexico 2009 election # reports 250 1.0 0 F1 score

Upload: others

Post on 10-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PowerPoint Presentation - MITweb.mit.edu/kayla/www/files/DSSG_KDD_Poster_Ushahidi.pdf1 2 3 4 Voters are being threatened at the Ofafa, Nairobi precinct. My number is 234-255-567. Pls

1 2 3 4

Voters are being

threatened at the

Ofafa, Nairobi

precinct.

My number is

234-255-567.

Pls help.

Category: Voter intimidation

Location: Ofafa, Nairobi

Private info: 234-225-567

Language: English

Duplicate of: Report #33, #87

With machine learning and natural language processing,

we’re building open-source tools to improve the

human review process of crisis reports.

In crisis situations, there’s an information gap:

Crisis mapping technologies,

like our partner Ushahidi,

gather, review, and map

incident reports to narrow this gap.

• Currently, report processing is entirely manual.

Each message is individually reviewed: anonymized,

categorized, and geo-located.

• Work is slow and tedious. It’s difficult to scale up

for large volumes or fast-paced situations.

After discussions with the crisis crowdsourcing

community, we identified the most pressing needs –

automatic suggestions for:

Is the report describing bribery or violence?

Any names, addresses, phone numbers, etc.

that need anonymizing before sharing?

Where to geo-locate? Any links to pull out?

What language(s) must the reviewer know? Is translation needed?

Has this report already been submitted?

• SVMs trained on 1000s of manually-reviewed reports

from past crisis maps - overall and per situation.

• Text classification features included a bag-of-words

unigram frequency feature set (frequency cut-off = 5).

• Unigram TF-IDF and bigram features didn’t help.

•Use a pre-trained named entity recognizer (NER)

to highlight names and locations.

•Use regular expressions to highlight

e-mail addresses, phone/ID numbers, and URLs.

Tested 4 plug-in language detectors on 850 reports

for agreement with manual language identification:

SimHash: Efficiently hash each report to a 64-bit

representation. (Near-)duplicates have short distances.

We’re currently simulating a crisis experiment on live

volunteers, testing 3 versions of the review process:

• Entirely manual, without any suggestions (baseline)

• With our system’s suggestions

• With the correct answers “suggested” (upper bound)

We’ll analyze quantitative changes in reviewer speed

and accuracy, and collect qualitative feedback.

• Train adaptive classifiers (work in progress)

• Suggest new categories via topic modeling

• Create more pre-trained classifiers (e.g. for disasters)

• Estimate report urgency

• Improve tool performance

Collaborators welcome!

github.com/dssg/ushine-learning

The Cold Start Problem

Classifier performance as a function of # training reports

BLUE: Classifier trained only on the particular election’s reports, with improvement over time as more reports are included in training .

RED: Classifier trained on reports from other elections’ reports.

Bottom line: A classifier pre-trained on other election reports can help solve the cold-start problem for a new election monitoring .

87.2% Google Translate (not free)

85.2% langid Python package

84.9% cld Python package

72.1% guess_language Python package

Flagging private info, locations, & URLS

Detecting language

Detecting (near-) duplicate reports

Classifying report categories

Flagging private info

Flagging locations and URLs

Detecting language

Detecting (near-) duplicate reports

Category: Voting Issues Category: Security Issues

Classifying report categories

Natural disasters

Contested elections

Humanitarian abuses

Voters

Election monitors

Survivors Victims

Relief agencies

NGOs & journalists

Kenya 2013 election

# reports 2,200

F1

sco

re

1.0

0

Lebanon 2009 election

# reports 150

F1

sco

re

1.0

0

Philippines 2010 election

# reports 240

1.0

0

F1

sco

re

Mexico 2009 election

# reports 250

1.0

0

F1

sco

re