lessons learned from large scale crowdsourced data...

Lessons Learned from Large‑Scale Crowdsourced Data Collection for ILSVRC

Jonathan Krause

OverviewClassification

Localization

Detection Pelican


Localization

Detection BirdFrog

Classification Overview• 1.4M images • 1,000 classes

Pelican

By hand: • 5 sec/image • 50% images correct • 12 hours worked/day

= 324 days!

CrowdsourcingLet the crowd do the work for you!

Classification Pipeline1. Collect candidate images for each category

2. Put candidate images on Amazon Mechanical Turk (AMT)

3. AMT workers click on images containing each class

4. Aggregate worker responses into labels

Collecting ImagesCategory: “Whippet”

Google Image Search:

Problem: Limited Images• Web searches are limited • Solution: Query Expansion

• WordNet: Whippet: “a small slender dog of greyhound type developed in England”

→ “whippet dog”, “whippet greyhound”

→ translate into other languages

Deploying on AMT

Annotate many images at once!

Make sure workers

understand the classes!

Understanding ClassesWikipedia and Google links

Understanding ClassesGive them a definition delta: a low triangular area of alluvial deposits where a river divid before entering a larger body of water: “the Mississippi River delta”; “the Nile delta”

Understanding ClassesTest them on the definition

Understanding ClassesGive example images (if you have them)

Hard Easya small slender dog of greyhound type developed in England

a small slender dog of greyhound type developed in England

+

Quality ControlWorkers on AMT are:

• Fast • Inexpensive • Plentiful

But they are not: • Highly trained

Solution: Multiple responses, merge results

Quality ControlGiven

• Set of (worker, image, response)

Want • P(image has label) for each image • (Optionally) worker quality estimates

A Simple Method• Majority vote

Q: Is this a whippet?Responses:

Yes No Yes Yes No No Yes

Yes

Majority VoteProblems: • Doesn’t give confidence • Hard to measure worker quality

Responses: Yes No Yes Yes No No Yes

How sure are we it’s positive?

How good are these workers?

One Approach• Annotate a subset of images with many

annotations • Majority vote to determine ground truth • Determine confidence given fewer annotations

Deng et al. 2009

Pro & ConPro

• Simple • Gives image confidence

Con • Treats all workers the same • Relies on initial majority vote

Another ApproachModel:

• Prior of label correct • Worker confusion matrix

Dawid, Skene. 1979

Max-likelihood with EM

Another ApproachWorker Quality

• Compute Soft Label: distribution over labels given worker response

• Calculate expected cost of soft label q:

Ipeirotis, Provost, Wang. 2012

Pro & ConPro

• Gives image confidence • Gives worker quality

Con • More complex • Need to run optimization


Localization

Detection Pelican

Localization Overview• Classification images • 1,000 classes • 600k training bounding boxes

Pelican

Main Challenge: Collecting and verifying bounding boxes

Bounding BoxesRequirements:

• Tight around object • Around all object instances • Not around other objects

Su, Deng, Fei-Fei. 2012

bounding boxes for “bottle”

Tasks1. Draw a bounding box around a single instance

2. Quality verification of bounding box

3. Coverage verification

DrawingIntuitively simple…..

But the devil is in the details

DrawingThings vision researchers take for granted

• Include all visible parts • Include only visible parts • Make the bounding box tight • Only include a single instance • Don’t draw over any instances that already have

bounding boxes • What if there are no unannotated objects?

→ Provide instructions and use a qualification task!

DrawingInclude all visible parts

Good Bad

DrawingInclude only all visible parts

• Don’t try to “complete” the object

Good Bad

DrawingMake the bounding box tight

• Even though loose is much faster

Good Bad

DrawingOnly include a single instance

Good Bad

DrawingDon’t draw over instances that already have bounding boxes

• Can enforce this in the UI

Good Bad

DrawingWhat if there are no unannotated objects?

• Give option to annotate no bounding boxes

Good Bad

→ No more objects anything else

Quality VerificationSimpler than bounding box drawing

Still has some details

Is this bounding box good? YES

Quality VerificationDetails:

• Still need to know about good bounding boxes • Quality control

Is this bounding box good? YES

Quality VerificationQuality control

• Embed “gold standard” images • Positives: Majority vote • Negatives: Perturb the positives • Reject annotations if bad answers to these • Can be used for almost any type of task!

• (Optionally) require agreement of more than one annotator

Coverage Verification

Any unannotated raccoons?

Similar in style to quality verification • Just a different question • Still need instructions, quality control

Nope!

Bounding Boxes: Misc.Provide definitions and example images!

• Especially if uncommon objects • But also helps with common objects • Annotators from different cultures

Make sure objects being annotated are actually in your images

• Do the classification task first

Bounding Boxes: Misc.Make qualification tasks

Verification tasks are much faster than drawing

Corner cases: Each task needs plan for when previous task goes wrong.

Detection Overview• 456k training images • 61k fully-annotated val+test • 200 classes

BirdFrog

Detection Overview• 456k training images • 61k fully-annotated val+test • 200 classes

BirdFrog

Main Challenge: Annotating all 200 classes in every image.

Detection Pipeline1. Collect images

2. Class presence annotation

3. Bounding box annotation

BirdFrog




BirdFrog

Same as previous




BirdFrog

Collecting ImagesNeed images that aren’t single object-centric

Additional queries: • Compound object queries (“tiger lion”,

“skunk and cat”) • Complex scene queries (“kitchenette”,

“dining table”, “orchestra”)




BirdFrog

Deng, Russakovsky, Krause, Bernstein, Berg, Fei-‐Fei. CHI 2014

Naive approach: ask for each object

Answer

Question

Machine Crowd

Is there a table?

Yes

Table Chair Horse Dog Cat Bird

? ? ? ? ? ?




+ ? ? ? ? ?

Answer

Question

Machine Crowd

Is there a table?

Yes



+ + ? ? ? ?

Answer

Question

Machine Crowd

Is there a chair?

Yes




+ + -‐ ? ? ?

Answer

Question

Machine Crowd

Is there a horse?

No




+ + -‐ -‐ ? ?

Answer

Question

Machine Crowd

Is there a dog?

No




+ + -‐ -‐ -‐ ?

Answer

Question

Machine Crowd

Is there a cat?

No




+ + -‐ -‐ -‐ -‐

Answer

Question

Machine Crowd

Is there a bird?

No




+ + -‐ -‐ -‐ -‐

+ -‐ -‐ -‐ + -‐

+ + -‐ -‐ -‐ -‐

Cost: O(NK) for N images and K objects




+ + -‐ -‐ -‐ -‐

Furniture Mammal

AnimalHierarchy



+ + -‐ -‐ -‐ -‐

+ -‐ -‐ -‐ + -‐

+ + -‐ -‐ -‐ -‐

SparsityCorrelation

FurnitureMammal

AnimalHierarchy



? ? ? ? ? ?

Answer

Question

Machine Crowd

Furniture Mammal

Animal

Better approach: exploit label structure



? ? ? ? ? ?

Answer

Question

Machine Crowd

Is there an animal?

No

Furniture Mammal

Animal




? ? -‐ -‐ -‐ -‐

Answer

Question

Machine Crowd

Is there an animal?

No

Furniture Mammal

Animal




? ? -‐ -‐ -‐ -‐

Answer

Question

Machine Crowd

Is there furniture?

Yes

Mammal

Animal

Furniture



Answer

Question

Machine Crowd

Is there a table?

Yes

Mammal

Animal


? ? -‐ -‐ -‐ -‐

Furniture




+ ? -‐ -‐ -‐ -‐

Answer

Question

Machine Crowd

Is there a chair?

Yes

Mammal

Animal

Furniture



Answer

Question

Machine Crowd

Is there a chair?

Yes

Mammal

Animal


+ + -‐ -‐ -‐ -‐

Furniture



Selecting the Right Question

Goal: Get as much utility (new labels) as possible, for as little cost (worker time) as possible, given a desired level of accuracy.


Accuracy constraint

• User-specified accuracy threshold, e.g., 95% • Might require only one worker, might require

several based on the task


Cost: worker time (time = money)

Question (is there …) Cost (second)

a thing used to open cans/bottles 14.4an item that runs on electricity (plugged in or using batteries) 12.6a stringed instrument 3.4a canine 2.0

expected human time to get an answer with 95% accuracy


Utility: expected # of new labelsTable Chair Horse Dog Cat Bird

? ? ? ? ? ?

Is there a table?

Yes

No


+ ? ? ? ? ?


-‐ ? ? ? ? ?

utility = 1



? ? ? ? ? ?

Is there a table?

Yes

No


+ ? ? ? ? ?


-‐ ? ? ? ? ?


? ? ? ? ? ?

Is there an animal?


? ? ? ? ? ?


? ? -‐ -‐ -‐ -‐

utility = 1

utility = 0.5 * 0 + 0.5 * 4 = 2

Pr(Y) = 0.5

Pr(N) = 0.5

Utility: expected # of new labels


Pick the question with the most labels per second

Query: Is there a... Utility (num labels)

Cost (worker time in secs)

Utility-‐Cost Ratio (labels per sec)

mammal with claws or fingers

12.0 3.0 4.0

living organism 24.8 7.9 3.1mammal 17.6 7.4 2.4creature without legs 5.9 2.6 2.3land or avian creature 20.8 9.5 2.2

Selecting the Right Question


• Dataset: 20K images from ImageNet Challenge 2013. • Labels: 200 basic categories (dog, cat, table…) • 64 internal nodes in hierarchy

Results


Results: accuracy

Accuracy Threshold per question (parameter)

Accuracy (F1 score) Naive approach

Accuracy (F1 score) Our approach

0.95 99.64 (75.67) 99.75 (76.97)0.90 99.29 (60.17) 99.62 (60.69)

Annotating 10K images with 200 objects


Results: cost

Accuracy Threshold per question (parameter)

Cost saving (our approach compared to

naive approach)

0.95 3.93x0.90 6.18x

Annotating 10K images with 200 objects

6 times more labels per second


Localization

Detection BirdFrog

Final Thoughts• Provide good instructions • Do quality control • Visualize results • Listen to your workers

Questions?

lessons learned from large scale crowdsourced data...

Documents