crowdsourcing research opportunities: lessons from natural language processing

Post on 17-May-2015

598 Views

Category:

Sports

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

How is crowdsourcing used in science? How did it impact the field of NLP? A presentation of the key points described in: Marta Sabou, Kalina Bontcheva, Arno Scharl (2012) Crowdsourcing Research Opportunities: Lessons from Natural Language Processing. In 12th International Conference on Knowledge Management and Knowledge Technologies (i-KNOW), Special Track on Research 2.0.

TRANSCRIPT

Crowdsourcing Research Opportunities:

Lessons from Natural Language Processing

Marta Sabou, Kalina Bontcheva, Arno Scharl

Crowdsourcing

Crowdsourcing

Undefined and generally large group

Crowdsourcing in Science

Crowdsourcing for NLP

Challenges

Crowdsourcing in science – is not new

Citizen science, from early 19th century, 60,000 – 80,000 yearly volunteers

Sir Francis Galton, “VOX POPULI”

Genre 1: Mechanised Labour

Participants (workers) paid a small amount of money to complete easy tasks (HIT = Human Intelligence Task)

Genre 2: Games with a purposeFrom 2008240k players

Crowdsourcing via Facebook

Genre 3: Altruistic Crowdsourcing

>250K players

>670K players

Crowdsourcing in Science - Typical Use

InputProcess/

AlgorithmOutput Evaluation

•Harness human intuition to prune solution space

•Form based data collection•Labeling, Classification•Surveys

Crowdsourcing in Science

Crowdsourcing for NLP

Challenges

Crowdsourcing in NLP

Papers relying on crowdsourcing in major NLP venues

Crowdsourcing Genres in NLP

Benefit 1: Affordable, Large-Scale Resources

A variety of small-medium sized resources can be obtained with as little as 100$ using AMT

Crowdsourcing is also cost effective for large resources (Poesio, 2012)

$/label

1 M labels ($)

Traditional High Q. 1 1,000,000

Mechanical Turk .38 380,000 (<40%)

Game .19 217,000 (20%)

Benefit 2: Diversification of research

Challenge 1: Contributor Selection and Training

From: prior to resource creation To: during the resource creation

Challenge 2: Aggregation and Quality Control

From: a few experts‘ annotations To: multiple, noisy annotations from non-

experts Approach 1: Statistical techniques

Simplest (and most popular): majority voting More complex: Machine learning model trained

on various features Approach 2: Crowdsourcing the QC process

itselfHIT1 (Create):

Translate the following sentence:

HIT2 (Verify):Which of these 5 sentences is the

best translation?

Conclusions (What have we learned from NLP?)

Crowdsourcing is revolutionalising NLP research Cheaper resource acquisition Diversification of research agenda

But requires more complex methodologies For contributor management For quality control and data aggregation

Other findings: most popular Genre: mechanised labour Task: acquiring input data Problem: solving subjective tasks

Crowdsourcing in Science

Crowdsourcing for NLP

Challenges

User Motivation

Motivating users Motivations for scientific projects might

differ Task-granularity might impact motivation

Promoting learning and science Advertise STEM research to young people Support learning and self-improvement

through participation in crowdsourcing

Legal and Ethical Issues

Acknowledging the Crowd‘s contribution S. Cooper, [other auhors], and Foldit players:

Predicting protein structures with a multiplayer online game. Nature, 466(7307):756-760, 2010.

Ensuring privacy and wellbeing Mechnised labour criticesed for low wages

(,$2/hour), lack of worker rights Prevent addition, prolonged-use & user

exploitation Licensing and consent

Some clearly state the use of Creative Common licenses

General failure to provide informed consent information

Technical Issues

Scaling up to large resources Preventing bias Increasing repeatability

Through reuse of crowdsourcing elements (e.g., HIT templates)

uComp - Embedded Human Computation for Knowledge Extraction and Evaluation 3 year project, starting November 2012 Develops a scalable and generic HC framework

for knowledge creation Provides reusable HC elements

Thank you!

top related