crowdsourcing - kbs · 13daniel kahneman and amos tversky ... // ... worker access and quali cation...

33
Crowdsourcing Markus Rokicki L3S Research Center 09.05.2017 Markus Rokicki | [email protected] | 1/27

Upload: danglien

Post on 26-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Crowdsourcing

Markus Rokicki

L3S Research Center

09.05.2017

Markus Rokicki | [email protected] | 1/27

Human Computation

“Automaton Chess Player” or “Mechanical Turk”.1

Crowdsourcing facilitates human computation through the web.

1Source: https://en.wikipedia.org/wiki/The_Turk

Markus Rokicki | [email protected] | 2/27

Human Computation

“Automaton Chess Player” or “Mechanical Turk”.1

Crowdsourcing facilitates human computation through the web.

1Source: https://en.wikipedia.org/wiki/The_Turk

Markus Rokicki | [email protected] | 2/27

Example: Product Categorization

Figure: Product categorization task on Amazon Mechanical Turk2

2https://www.mturk.com

Markus Rokicki | [email protected] | 3/27

Example: Stereotype vs Gender appropriate3

3Tolga Bolukbasi et al. “Man is to computer programmer as woman is to homemaker? debiasing word embed-dings”. In: Advances in Neural Information Processing Systems. 2016, pp. 4349–4357.

Markus Rokicki | [email protected] | 4/27

Example: Damage Assessment during Disasters4

4Luke Barrington et al. “Crowdsourcing earthquake damage assessment using remote sensing imagery”. In:Annals of Geophysics 54.6 (2012).

Markus Rokicki | [email protected] | 5/27

Crowdsourcing definition

Crowdsourcing, according to a literature survey5:

I Participative online activity

I an individual, institution, non-profit, or company proposesvoluntary undertaking of a task

I in a flexible open call to a group of individualsI varying knowledge, heterogeneity, and number

I Users receive satisfaction of a given type of needI economic, social recognition, self-esteem, learning, . . .

I always entails mutual benefit

5Enrique Estelles-Arolas and Fernando Gonzalez-Ladron-de Guevara. “Towards an integrated crowdsourcingdefinition”. In: Journal of Information science 38.2 (2012), pp. 189–200.

Markus Rokicki | [email protected] | 6/27

Crowdsourcing Platforms

Some crowdsourcing platforms:I Amazon Mechanical Turk6

I Paid “microtasks”I CrowdFlower7

I Paid “microtasks”

I Topcoder8

I Programming competitions

I Threadless9

I Propose and vote on t-shirt designs

I Kickstarter10

I Fund products6https://www.mturk.com

7https://www.crowdflower.com

8https://www.topcoder.com

9https://www.threadless.com

10https://www.kickstarter.com

Markus Rokicki | [email protected] | 7/27

Paid Microtask Crowdsourcing

Paid crowdsourcing scheme11.

11Figure source: http://dx.doi.org/10.1155/2014/135641

Markus Rokicki | [email protected] | 8/27

Paid Microtask Crowdsourcing

Paid crowdsourcing scheme11.

11Figure source: http://dx.doi.org/10.1155/2014/135641

Markus Rokicki | [email protected] | 8/27

Paid Microtask Crowdsourcing

Paid crowdsourcing scheme11.

11Figure source: http://dx.doi.org/10.1155/2014/135641

Markus Rokicki | [email protected] | 8/27

Task design

Depending on application:I (how) should you break up the task?I more or less answering options? Free-form answers?I how to present the problem

E.g.: Highlighting keywords in search results12:

12Omar Alonso and Ricardo Baeza-Yates. “Design and implementation of relevance assessments using crowd-sourcing”. In: European Conference on Information Retrieval. Springer. 2011, pp. 153–164.

Markus Rokicki | [email protected] | 9/27

Task design issues: Cognitive Biases15

Anchoring effects

I “Humans start with a first approximation (anchor) and thenmake adjustments to that number based on additionalinformation.”13

I Observed in crowdsourcing experiments14

I Group A Q1: More or less than 65 African countries in UN?I Group B Q1: More or less than 12 African countries in UN?

I Q2: How many countries in Africa?I Group A mean: 42.6I Group B mean: 18.5

Also: Order effects and many more13Daniel Kahneman and Amos Tversky. “Subjective probability: A judgment of representativeness”. In: Cognitive

psychology 3.3 (1972), pp. 430–454.14Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. “Running experiments on Amazon Mechanical

Turk”. In: Judgment and Decision Making 5.5 (2010).15Adapted from https://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation

Markus Rokicki | [email protected] | 10/27

Quality: Worker Access and Qualification

Crowdsourcing platforms offer means to restrict access to tasksbased on:

I Trust levels based on overall past accuracy

I GeographyI Language skills

I Certified based on language testsI Classified by the platform based on geolocation, browser data

and user history16

I Qualification for specific kinds of tasks gained throughqualification/training tasks

16https://www.crowdflower.com/crowdflower-now-offering-twelve-language-skill-groups

Markus Rokicki | [email protected] | 11/27

Qualification TasksQualification tests also provide feedback about the judgmentsexpected by the worker and thus influence quality17.

Figure on the right:Accuracy depending onclass imbalance in trainingquestions.

Classes: ‘Matching’, ‘NotMatching’

17John Le et al. “Ensuring quality in crowdsourced search relevance evaluation: The effects of training questiondistribution”. In: SIGIR 2010 workshop on crowdsourcing for search evaluation. 2010, pp. 21–26.

Markus Rokicki | [email protected] | 12/27

Measuring Worker Accuracy: Gold Standard

Would like to know how reliable answers are for the taskI Requesters may want to reject the work of unreliable workers

I In particular for ‘spammers’ who do not try to solve the task

Gold standard / honey-pot tasks:

I Add tasks for which the ground truth is already known atrandom positions

I Compare worker input with the ground truth

I Feedback on correctness is given to the workers

Markus Rokicki | [email protected] | 13/27

Using Gold Standard18

Challenges:I Need enough gold standard tasks or limit #tasks per worker

I Workers who recognize gold standard will tend to ‘spam’

I Data composition:I Class balanceI Educational examples addressing likely errors

I Need to be indistinguishable from regular tasks. But: need tobe unambiguous

Standard workflow:

I Iterate ground truth creation

I Start with hand crafted small ground truth

I Use annotations to create additional gold standard data

18David Oleson et al. “Programmatic Gold: Targeted and Scalable Quality Assurance in Crowdsourcing.” In:Human computation 11.11 (2011).

Markus Rokicki | [email protected] | 14/27

Redundancy: Wisdom of Crowds

So far: ensuring individual annotation quality. However: even besteffort human annotations are not perfect most of the time.

Source: https://www.domo.com/learn/the-wisdom-of-crowdsSource: https://www.domo.com/learn/the-wisdom-of-crowds

Markus Rokicki | [email protected] | 15/27

Redundancy: Wisdom of Crowds

So far: ensuring individual annotation quality. However: even besteffort human annotations are not perfect most of the time.

Source: https://www.domo.com/learn/the-wisdom-of-crowds

Source: https://www.domo.com/learn/the-wisdom-of-crowds

Markus Rokicki | [email protected] | 15/27

Redundancy: Wisdom of Crowds

So far: ensuring individual annotation quality. However: even besteffort human annotations are not perfect most of the time.

Source: https://www.domo.com/learn/the-wisdom-of-crowds

Source: https://www.domo.com/learn/the-wisdom-of-crowds

Markus Rokicki | [email protected] | 15/27

Redundant Annotations and Aggregation of Results

Redundancy:

I Each task is annotated by multiple workersI Quality can be estimated based on inter-annotator agreement

(e.g. Fleiss’ kappa)I If certain agreement is not reached: obtain more annotations

I Aggregate answers by majority vote (in categorical case)I Improves accuracy if workers are better than random

I Introduces additional cost

Markus Rokicki | [email protected] | 16/27

Majority Vote Accuracy20

Majority vote accuracy depending on redundancy and individualaccuracy, assuming equal accuracy (p) for all workers19.

19Figure source: https://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation

20Ludmila I Kuncheva et al. “Limits on the majority vote accuracy in classifier fusion”. In: Pattern Analysis &Applications 6.1 (2003), pp. 22–31.

Markus Rokicki | [email protected] | 17/27

Estimating Worker Reliability21

In reality, worker accuracy varies. Solution: estimate workerreliability and take it into account for computing the labels.

Sketch of the approach:I Treat each worker as a classifier characterized by

I Probability of correctly classifying each class

I Estimate iteratively using expectation maximization:I E-step: Estimate hidden true labels of the data given current

estimated worker reliabilityI M-step: Estimate worker reliability given current estimated

labels

21Alexander Philip Dawid and Allan M Skene. “Maximum likelihood estimation of observer error-rates using theEM algorithm”. In: Applied statistics (1979), pp. 20–28.

Markus Rokicki | [email protected] | 18/27

Incentives: Influence of Payment 122

Setting: Reorder a list of images taken from traffic cameraschronologically.

611 participants sorted 36,000 images sets of varying size, forvarying payments.

22Winter Mason and Duncan J Watts. “Financial incentives and the performance of crowds”. In: ACM SigKDDExplorations Newsletter 11.2 (2010), pp. 100–108.

Markus Rokicki | [email protected] | 19/27

Findings

Figure: Accuracy (left) and number of completed tasks (right) inrelation to payment

Markus Rokicki | [email protected] | 20/27

Findings

Figure: Post-hoc survey of perceived value

Markus Rokicki | [email protected] | 21/27

Influence of Payment 223

Setting: Find planets orbiting distantstars

23Andrew Mao et al. “Volunteering versus work for pay: Incentives and tradeoffs in crowdsourcing”. In: FirstAAAI conference on human computation and crowdsourcing. 2013.

Markus Rokicki | [email protected] | 22/27

Findings

Experiments with 356 workers and 17.000 annotated light curves.Simulated transits were added to real light curves (with noisybrightness values).

Figure: Accuracy of results depending on task difficulty.

Markus Rokicki | [email protected] | 23/27

Additional Incentives: Competitions and Teamwork24

1 2 3 4 5 6 7 8 9 10

Re

wa

rd

Rank

...

Rewards:

I Non-linear distributionamong teams

I Individual share proportionalto contribution

Communication

I Team chats

24Markus Rokicki, Sergej Zerr, and Stefan Siersdorfer. “Groupsourcing: Team competition designs for crowd-sourcing”. In: Proceedings of the 24th International Conference on World Wide Web. ACM. 2015, pp. 906–915.

Markus Rokicki | [email protected] | 24/27

No Payment: Games With A Purpose25

Figure: The ESP Game

25Luis Von Ahn and Laura Dabbish. “Labeling images with a computer game”. In: Proceedings of the SIGCHIconference on Human factors in computing systems. ACM. 2004, pp. 319–326.

Markus Rokicki | [email protected] | 25/27

No Payment: Games With A Purpose25

Figure: The ESP Game

25Luis Von Ahn and Laura Dabbish. “Labeling images with a computer game”. In: Proceedings of the SIGCHIconference on Human factors in computing systems. ACM. 2004, pp. 319–326.

Markus Rokicki | [email protected] | 25/27

Mechanical Turk Workers26

Surveys of 500-1000 workers on Mechanical Turk.26Joel Ross et al. “Who are the crowdworkers?: shifting demographics in mechanical turk”. In: CHI’10 extended

abstracts on Human factors in computing systems. ACM. 2010, pp. 2863–2872.

Markus Rokicki | [email protected] | 26/27

Selected papersI Dynamics and task types on MTurk:

Djellel Eddine Difallah et al. “The dynamics of micro-task crowdsourcing: The

case of amazon mturk”. In: Proceedings of the 24th International Conference

on World Wide Web. ACM. 2015, pp. 238–247

I Postprocessing results for quality:Vikas C Raykar et al. “Learning from crowds”. In: Journal of Machine Learning

Research 11.Apr (2010), pp. 1297–1322

I Influence of compensation and payment on quality:Gabriella Kazai. “In search of quality in crowdsourcing for search engine

evaluation”. In: European Conference on Information Retrieval. Springer.

2011, pp. 165–176

I Collaborative workflows:Michael S. Bernstein et al. “Soylent: a word processor with a crowd inside”.

In: Proceedings of the 23rd Annual ACM Symposium on User Interface

Software and Technology. 2010, pp. 313–322

Markus Rokicki | [email protected] | 27/27