generative models for crowdsourced data

Generative Models for Crowdsourced Data

Outline

• What is Crowdsourcing?• Modeling the labeling process• Example with real data• Extensions• Future Directions

What is Crowdsourcing?

• Human based computation.• Outsourcing certain steps of a computation to

humans.• ``Artificial artificial intelligence.’’• Data science:– Making an immediate decision.– Creating a labeled data set for learning.

Immediate Decision Workflow

Labeled Data Set Workflow

An Example HIT

Funny enough …

• Not everybody agrees on the gender of a Twitter profile.

• Difficult Instances• Worker Ability / Motivation• Worker Bias• Adversarial Behaviour

Difficult Instance

Worker Ability

Worker Motivation

Worker Bias

Disagreements

• When some workers say “male” and some workers say “female”, what to do?

Majority Rules Heuristic

• Assign label l to item x if a majority of workers agree.

• Otherwise item x remains unlabeled.



• Otherwise item x remains unlabeled.• Ignores prior worker data.



• Otherwise item x remains unlabeled.• Ignores prior worker data.• Introduce bias in labeled data.

Train on all labels

• For labeled data set workflow.• Add all item-label pairs to the data set.• Equivalent to cost vector of:– P (l | { lw }) = 1/nw S 1{l = lw}

Train on all labels


• Ignores prior worker data.

Train on all labels


• Ignores prior worker data.• Models the crowd, not the “ground truth.”

What is ground truth

• Different theoretical approaches.– PAC learning with noisy labels.– Fully-adversarial active learning.

• Bayesians have been very active.– “Easy” to posit a functional form and quickly

develop inference algorithms.– Issue of model correctness is ultimately empirical.

Bayesian Literature

• (2009) Whitehill et. al. GLAD framework.– (1979) Dawid and Skene. Maximum Likelihood

Estimation of Observer Error-Rates Using the EM Algorithm.

• (2010) Welinder et. al. The Multidimensional Wisdom of Crowds.

• (2010) Raykar et. al. Learning from Crowds.

Bayesian Approach

• Define ground truth via a generative model which describes how “ground truth” is related to the observed output of crowdsource workers.

• Fit to observed data.• Extract posterior over ground truth.• Make decision or train classifier.

Generative Model

Example: Binary Classification

• Each worker has a matrix.

α = ( -1 α01 )

( α10 -1 )

• Each item has a scalar difficulty β > 0.• P (lw = j | z = i) = e-βαij / (Σk e-βαik)

• αij ~ N (μij, 1) ; μij ~ N (0, 1)• log β ~ N (ρ, 1) ; ρ ~ N (0, 1)

Other Problems

• Multiclass classification:– Same as binary with larger confusion matrix.

• Ordinal classification: (“Hot or not”)– Confusion matrix has special form.– O (L) parameters instead of O (L2).

• Multilabel classification:– Reduce to multiclass on power set.– Assume low-rank confusion matrix.

EM

• Initially all workers are assumed moderately accurate and without bias.– Implies initial estimate of ground truth distribution

favors consensus.– Disagreeing with the majority is a likely error.

EM

• Initially all workers are assumed moderately accurate.

• Workers consistently in the minority have their confusion probabilities increase.

EM

• Initially all workers are assumed moderately accurate.

• Workers consistently in the minority have their confusion probabilities increase.

• Workers with higher confusion probabilities contribute less to the distribution of ground truth.

“Different” workers are marginalized

“Different” workers are marginalized

• Workers that are consistently in the minority will not contribute strongly to the posterior distribution over ground truth.– Even if they are actually more accurate.

• Can correct when an accurate worker(s) is paired with some inaccurate workers.

• Good for breaking ties.• Raykar et. al.

Example with real data

Online EM

• Given a set of worker-label pairs for a single item:

• (Inference) Using current α, find most likely β* and distribution q* over ground truth.

• (Training) Do SGD update of α with respect to EM auxiliary function evaluated at β* and q*.

Things to do with q*

• Take an immediate cost-sensitive decision– d* = argmind Ez~q*[f (z, d)]

• Train a (importance-weighted) classifier– cost vector cd = Ez~q*[f (z, d)]– e.g. 0/1 loss: cd = (1 - q*d)– e.g. binary 0/1 loss: |c1 – c0| = |1 – 2 q*1|– No need to decide what the true label is!

• Raykar et. al.: why not jointly estimate classifier and worker confusion?

Raykar et. al. insight

• Cost vector is constructed by estimating worker confusion matrices.

• Subsequently, classifier is trained; it will sometimes disagree with workers.

• Would be nice to use that disagreement to inform the worker confusion matrices.

• Circular dependency suggests joint estimation.

Generative Model

Online Joint Estimation

Online Joint Estimation

• Initially the classifier will output an uninformative prior and therefore will be trained to follow consensus of workers.

• Eventually workers which disagree with the classifier will have their confusion probabilities increase.

• Workers consistently in the minority can contribute strongly to the posterior if they tend to agree with the classifier.

Additional Resources

• Software– http://code.google.com/p/nincompoop

• Blog– http://machinedlearnings.com/

http://code.google.com/p/nincompoop

http://machinedlearnings.com/



generative models for crowdsourced data

Documents

prior worker data

labeled data set

data set workflow

data science

item x

majority of workers

itemlabel pairs

power set