crowdsourcing in nlp

What is Crowdsourcing ? • “Crowdsourcing is the act of a company or institution taking a function once

performed by employees and outsourcing it to an undefined and large network of people in the form of an open call.”(2006 - magazine article)

• "Crowdsourcing is a type of participative online activity in which an

individual, an institution, a non-profit organization, or company proposes to a

group of individuals of varying knowledge, heterogeneity, and number, via a flexible open call, the voluntary undertaking of a task. The

undertaking of the task, of variable complexity and modularity, and in

which the crowd should participate bringing their work, money, knowledge and/or

experience, always entails mutual benefit….". (Estellés-Arolas, 2012 – integrated 40

definitions from literature 2008 onward)

1

Sample Tasks (Difficult for computers but simple for human beings): • Identify the disease mentions from PubMed abstracts • Classify book reviews as positive or negative

Amazon Mechanical Turk(AMT) (launched 2005)

A Crowdsourcing Platform

• Requester

– Designs the task and prepare the dataset (i.e., taskitems)

– Submits to AMT

• # judgments per taskitem

• Reward payment for each taskitem to each worker

• Specify restrictions (worker locations, previous tasks accuracy)

• Workers

– Work on taskitem(s); can work on as many or as few taskitems as they please

– Get paid small amounts of money (a few cents per taskitem)

Amazon Mechanical

Turk

2

About the AMT workers Who are they? (Ross et al. 2010)

• Age – Average 31, Min 18, Max 71,

Median 27

• Gender – Female 55%, Male 45%

• Occupation – 38% FT, 31% PT, 31% unemployed

• Education – 66% college or higher, 33%

students

• Salary – Median 20k – 30k

• Country – USA 57%, India 32%, Other 11%

How do they search for Tasks on AMT (Chilton et al. 2010)

• Title

• Reward amount

• Date posted – Newest to oldest

• Time allotted

• Number of task items – Most to fewest task items

• Expiration date

3

Paper 1: Cheap and fast – but is it good? Evaluating non-expert annotations for natural language tasks

Rion Snow, Brendon O’Conner, Daniel Jurafsky, Andrew Y. Ng Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2008

Motivation

• Large-scale annotation – Vital to NLP research and developing new

algorithms

• Challenges to expert annotation – Financially Expensive

– Time Consuming

• An alternative – Explore non-expert annotations (crowdsourcing)

4

Overview

• 5 natural language tasks (short and simple) 1. Affect Recognition

2. Word Similarity

3. Recognizing Textual Entailment (RTE)

4. Event Temporal Ordering

5. Word Sense Disambiguation

• Method – Post tasks on Amazon Mechanical Turk

– Request 10 independent judgments(annotations)/taskitem

• Evaluate Performance of Non-experts – Compare annotations (with experts)

– Compare m/c learning classifier performance (trained on expert vs. non-expert data)

5

Task#1: Affect Recognition Original Experiment with expert: (Strapparava and Mihalcea, 2007) in SemEval

• Given a textual headline – identify and rate emotions – anger [0,100], disgust [0,100],

fear [0,100], joy [0,100], sadness [0,100], surprise [0,100], overall valence [-100,100]

• Example headline-annotation pair

Outcry at N Korea ‘nuclear test’ (Anger, 30), (Disgust, 30), (Fear, 30), (Joy, 0), (Sadness, 20), (Surprise, 40), (Valence, -50)

• Original Experiment – 1000 headlines extracted from

New York Times, CNN, Google News

– 6 expert annotators per headline

• Non-expert (Crowdsourcing) Experiment – 100 headline sample – 10 annotations per headline

• 70 affect labels

– Paid $2.00 to Amazon for collecting 7000 affect labels

6

Task#1 Affect Recognition Results (Inter-Annotator Agreement ITA)

• ITAtask= average(ITAannotator)

• ITAannotator= Pearson’s correlation

– This annotator’s labels Vs. Average of labels of other annotators

Original experiment with experts

Individual Experts are better labelers than individual non-experts

Non-expert annotations are good enough to increase overall quality of task 7

Task#1 Affect Recognition Results How many non-expert labelers are equivalent to 1 expert labeler?

• Treat n (1,2,3,…10) non-expert annotators as 1 meta-labeler

– Average the labels for all possible subsets of size n

• Find the minimum number of non-experts (k) needed to rival the performance of an expert

On an average, it takes 4 non-experts to produce expert-like performance. For this task, it takes $1.00 to generate 875 expert equivalent labels

8

Task#2: Word Similarity (Original Experiment with Experts – Resnik, 1999)

• Given a pair of words {boy, lad}

– Provide numeric judgments [0,10] on similarity

• Original Experiment

– 30 pairs of words

– 10 experts

– ITA = 0.958

• Crowdsourcing Experiment – 30 pairs of words

– 10 annotations per pair

– Paid Total $0.2 for 300 annotations

– Task was completed within 11 minutes of posting

– Maximum ITA 0.952

9

Task#3: Recognizing Textual Entailment (Original Experiment: Dagan et al. 2006)

• To determine whether second sentence infers from the first sentence

S1: Crude Oil Prices Slump

S2: Oil prices drop

– Answer: true

S1: The government announced that it plans to raise oil prices

S2: Oil prices drop

– Answer: false

• Original Experiment – 800 sentence pairs

– ITA = 0.91

• Crowdsourcing experiment – 800 sentence pairs

– 10 annotators

– For average response, used simple majority voting, and random tie breaking

– Maximum ITA = 0.89

10

Task#5: Word Sense Disambiguation (Original Experiment - SemEval Pradhan et al. 2007)

• Robert E. Lyons III … was appointed president and chief operating officer… – What is the sense of

“president”? • Executive officer of a form,

corporation or university • Head of a country (other than

the US) • Head of the US, President of the

United States

• Original Experiment – ITA not reported – Provides gold standard

• Crowdsourcing Experiment

– 177 examples of noun “president” for the 3 senses

– 10 annotators – Results aggregated using maj. voting

and random tie breaking, and accuracy calculated w.r.t gold

– Red line represents best system’s performance SemEval Task 17 (Cai et al., 2007)

11

Training Classifiers: Non-Experts vs. Experts Task#1: Affect Recognition

• Designed a supervised affect recognition system

• Training the system

• Testing a new headline

12

e= emotion t=token (word in headline) H=headline Ht= set of headlines containing token t

• Experiments (Training: 100;Testing: 900 )

– In most cases, only 1 non-expert helped better train the system.

– Possible Explanation: • Individual labelers(experts or non-

experts) tend to be biased • For non-experts, even a single set of

annotations (1-NE) is created using multiple non-expert labelers. (because of the nature of crowdsourcing) – may have reduced bias

Summary

• Individual experts better than individual non-experts

• Non-experts improve quality of annotations

• For many tasks, only a small number(avg. 4) of non-experts/item are needed to equal the expert performance

• Non-experts’ trained system performed better probably because they offer more diversity

13

Paper 2: Validating Candidate gene-mutation relations in MEDLINE abstracts via crowdsourcing

John Burger, Emily Doughty, Sam Bayer, et al. Data Integration in the Life Sciences, 8th International Conference, DILS 2012

Goal

• To identify relationships between genes and mutations from the biomedical literature (mutation grounding)

Challenge

• Multiple mentions of genes and mutations, extracting the correct association is challenging

Method • Identify all mutation and gene mentions using existing tools

• Identify gene-mutation relationships using crowdsourcing with non-experts

14

Dataset

• PubMed abstracts – Mesh terms: Mutation, Mutation AND

Polymorphism/genetic

– Diseases (identified using Metamap): breast cancer, prostate cancer, autism spectrum disorder

• 810 abstracts – Expert curated (Gold standard) 1608 gene-mutation

relationships

• Working dataset: selected 250 abstracts with 578 gene-mutation pairs

15

Method

Extracting Mentions Extracting Relationships

• Normalized all mentions

• Generated cross-product of all gene-mutation relationships within an article

• Total 1398 candidate gene-mutation pairs – submitted as a taskitem to Amazon Mechanical Turk

16

• Used existing tool EMU (Extractor of Mutations) (Doughty et al. 2011)

• Gene Identification – String match with HUGO

and NCBI gene databases

• Mutation(SNPs) Identification – Regular expressions

Sample Task-item posted for crowdsourcing

17

Method: Crowdsourcing

• 5 annotations per task-item(candidate association)

• 1398 task-items + 467 control items – Control items are hidden tests (Amazon uses them to

calculate worker’s rating)

• Restricted the task to workers – from United States only – With 95% rating(from previous tasks)

• Payment – 8 cents per abstract to each worker – Total: $900

18

Results

• Mutation Recall: 477/550= 86.7%

• Gene Recall: 257/276 = 93.1%

• Out of 250 abstracts, 185 were perfect documents (with 100% recall of genes and mutations)

• Crowdsourcing Results:

– Time: 30 hr

– Total 58 workers • 12 performed only 1 item

• 22 performed 2-10

• 13 performed 11-100 items

• 11 performed 100+items

19

Results

Worker Accuracy Consensus Accuracy

• Simple Majority Voting – 78.4%

• Weighted Vote Approach (based on worker’s ratings) – 78.3%

• Naïve Bayes Classifier (to identify probability of correctness of each response) – 82.1%

20

Conclusion

• It is easy to recruit the workers and achieve a fast turnaround time

• Performance of workers varies

• The task required significant level of biomedical literacy , yet one worker gave 95% accurate responses.

• It is important to find new ways to identify qualified workers and aggregating results

21

crowdsourcing in nlp

Documents

expert annotators

nonexpert labelers

nonexpert data5

performance of nonexperts

experts ita

expert equivalent labels8

affect labels

onannotations similarity