data science for social good and ushahidi

17
Project Update - July 11, 2013 The Eric & Wendy Schmidt Data Science for Social Good Summer Fellowship 2013 www.dssg.io | [email protected]

Upload: ushahidi

Post on 09-May-2015

2.710 views

Category:

Technology


1 download

DESCRIPTION

The Eric and Wendy Schmidt Data Science for Social Good - Summer Fellowship 2013 Preliminary Update July 2013 About the DSSG Rock stars: http://dssg.io/ https://twitter.com/datascifellows/ Their project: http://dssg.io/2013/07/15/ushahidi-machine-learning-for-human-rights.html More @ ushahidi.com / wiki.ushahidi.com / blog.ushahidi.com

TRANSCRIPT

Page 1: Data Science for Social Good and Ushahidi

Project Update - July 11, 2013

The Eric & Wendy Schmidt

Data Sciencefor Social GoodSummer Fellowship 2013

www.dssg.io | [email protected]

Page 2: Data Science for Social Good and Ushahidi

Ushahidi Workflow

Page 3: Data Science for Social Good and Ushahidi

Ushahidi Workflow + DSSG

Page 4: Data Science for Social Good and Ushahidi

Data Sets

23,000 reports from 20 datasets

• 22% English

• 35% non-English

• 43% mixed languages

Each report includes text, category, location, sometimes more data

Page 5: Data Science for Social Good and Ushahidi

Data SetsAdditional unusable datasets for various reasons (e.g. overly formulaic language)

What is the quality of the existing "gold standard" annotation?

Working on translations of non-English texts

Page 6: Data Science for Social Good and Ushahidi

Afghanistan election(peaceful)

Kenyan election(less peaceful)

Data Set Differences

Page 7: Data Science for Social Good and Ushahidi

Current Task Status [July 11]

1) Suggest categories.......................

2) Extract named entities...................

(especially locations)

3) Detect language............................

4) Detect (near-)duplicate reports.....End of presentation has more extensive technical details

Page 8: Data Science for Social Good and Ushahidi

Toy Demo

http://ec2-54-218-196-140.us-west-2.compute.amazonaws.com/home

Note this is ONLY a basic "toy" user interface to demonstrate the current prototype functionality.

Our plan is to deliver an open-source code library,which Ushahidi will incorporate into the existing user interface.

If link doesn't work -- just look at the screenshots in the next slides. :)

Page 9: Data Science for Social Good and Ushahidi

Demo: Example #1

Page 10: Data Science for Social Good and Ushahidi

Demo: Example #2

Page 11: Data Science for Social Good and Ushahidi

Secondary Project Ideas

1. Detect private info to strip

2. Urgency assessment

3. Filtering irrelevant reports (not strictly spam)

4. Automatically proposing new [sub-]categories

5. Cluster similar (non-identical) reports

6. Hierarchical topic modelling / visualization

Page 12: Data Science for Social Good and Ushahidi

Evaluation Plans• Tap into Ushahidi and crisis mapping communities

for feedback

• Simulate past event with our system

• Success metrics:o Increased annotator speedo Increased annotator categorization accuracyo Decreased annotator frustration/tediumo Increased citizen web report speed

Page 13: Data Science for Social Good and Ushahidi

Feedback welcome!Contact us at dssg-

[email protected]

We would love your input!

See next 4 slides for technical details on our 4 tasks...

or skip if you're happy to stay unaware... :)

Page 14: Data Science for Social Good and Ushahidi

1) Suggest categoriesCurrently:

• Simple bag-of-words unigram features

• 1-vs.-all classification (scikit-learn)

• Little categories fewer big categories

• Performance uninspiring :(

Future:

Bigrams... word frequency filter...

balancing positive/negative examples...

topic modeling... hierarchical categories...

Page 15: Data Science for Social Good and Ushahidi

2) Extract named entities

Currently:

• NLTK's Named Entity Recognizer

• Eval: pretty good

Future:

• Train location-recognizer on datasets

• Merge types for non-location NEs

• Remember previously-confirmed NEs

Page 16: Data Science for Social Good and Ushahidi

3) Detect Language

Currently:

• Existing packages (Bing, python, ...)

Future:

• Evaluate quality

• Allow event-specific language bias

Page 17: Data Science for Social Good and Ushahidi

4) Near-Duplicate Detection

Currently:

• SimHash compares distances of message text hashes efficiently

Future:

• Evaluate quality more rigorously

• Explore other methods