loss-based learning with weak supervisionmpawankumar.info/tutorials/cvpr2013/slides/cvpr... ·...

Loss-based Learning with Weak Supervision

M. Pawan Kumar

Segmentation

Information

Log

(Siz

e)

~ 2000

Computer Vision Data

Segmentation

Log

(Siz

e)

~ 2000

Information

Bounding Box

~ 1 M


Segmentation

Log

(Siz

e)

Bounding Box Image-Level ~ 2000

~ 1 M > 14 M

“Car” “Chair” Information


Segmentation

Log

(Siz

e)

Image-Level Noisy Label

~ 2000

> 14 M

> 6 B

Information

Bounding Box

~ 1 M


Learn with missing information (latent variables)

Detailed annotation is expensive

Sometimes annotation is impossible

Desired annotation keeps changing


•  Two Types of Problems

•  Part I – Annotation Mismatch

•  Part II – Output Mismatch

Outline

Annotation Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Classification

Mismatch between desired and available annotations

Exact value of latent variable is not “important”

Desired output during test time is y

Output Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Classification

Output Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Detection

Mismatch between output and available annotations

Exact value of latent variable is important

Desired output during test time is (y,h)

Part I

•  Latent SVM

•  Optimization

•  Practice

•  Extensions

Outline – Annotation Mismatch

Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Weakly Supervised Data

Input x

Output y ∈ {-1,+1}

Hidden h

x

y = +1

h

Weakly Supervised Classification

Feature Φ(x,h)

Joint Feature Vector

Ψ(x,y,h)

x

y = +1

h


Feature Φ(x,h)


Ψ(x,+1,h) Φ(x,h)

0 =

x

y = +1

h


Feature Φ(x,h)


Ψ(x,-1,h) 0

Φ(x,h) =

x

y = +1

h


Feature Φ(x,h)


Ψ(x,y,h)

Score f : Ψ(x,y,h) è (-∞, +∞)

Optimize score over all possible y and h

x

y = +1

h

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h)

Latent SVM

Parameters

Learning Latent SVM

Δ(yi, yi(w)) Σi

Empirical risk minimization

minw

No restriction on the loss function

Annotation mismatch

Training data {(xi,yi), i = 1,2,…,n}

Learning Latent SVM

Δ(yi, yi(w)) Σi

Empirical risk minimization

minw

Non-convex

Parameters cannot be regularized

Find a regularization-sensitive upper bound

Learning Latent SVM

- wTΨ(xi,yi(w),hi(w))

Δ(yi, yi(w)) wTΨ(xi,yi(w),hi(w)) +

Learning Latent SVM

Δ(yi, yi(w)) wTΨ(xi,yi(w),hi(w)) +

- maxhi wTΨ(xi,yi,hi)


Learning Latent SVM

Δ(yi, y) wTΨ(xi,y,h) + maxy,h

- maxhi wTΨ(xi,yi,hi) ≤ ξi

minw ||w||2 + C Σiξi

Parameters can be regularized

Is this also convex?

Learning Latent SVM




Convex Convex -

Difference of convex (DC) program


wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Scoring function

wTΨ(x,y,h)

Prediction


Learning

Recap

•  Latent SVM

•  Optimization

•  Practice

•  Extensions


Learning Latent SVM




Difference of convex (DC) program

Concave-Convex Procedure

+

Δ(yi, y) wTΨ(xi,y,h) +

maxy,h

wTΨ(xi,yi,hi)

- maxhi

Linear upper-bound of concave part


+


maxy,h

wTΨ(xi,yi,hi)

- maxhi

Optimize the convex upper bound


+


maxy,h

wTΨ(xi,yi,hi)

- maxhi

Linear upper-bound of concave part


+


maxy,h

wTΨ(xi,yi,hi)

- maxhi

Until Convergence


+


maxy,h

wTΨ(xi,yi,hi)

- maxhi

Linear upper bound?

Linear Upper Bound


-wTΨ(xi,yi,hi*)

hi* = argmaxhi wt

TΨ(xi,yi,hi)

Current estimate = wt

≥ - maxhi wTΨ(xi,yi,hi)

CCCP for Latent SVM Start with an initial estimate w0

Update

Update wt+1 as the ε-optimal solution of

min ||w||2 + C∑i ξi

wTΨ(xi,yi,hi*) - wTΨ(xi,y,h) ≥ Δ(yi, y) - ξi

hi* = argmaxhi∈H wtTΨ(xi,yi,hi)

Repeat until convergence

•  Latent SVM

•  Optimization

•  Practice

•  Extensions


Action Classification Input x

Output y = “Using Computer”

PASCAL VOC 2011

80/20 Train/Test Split 5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

•  0-1 loss function

•  Poselet-based feature vector

•  4 seeds for random initialization

•  Code + Data

•  Train/Test scripts with hyperparameter settings

Setup

http://www.centrale-ponts.fr/tutorials/cvpr2013/

Objective

Train Error

Test Error

•  Latent SVM

•  Optimization

•  Practice –  Annealing the Tolerance –  Annealing the Regularization –  Self-Paced Learning –  Choice of Loss Function

•  Extensions


Start with an initial estimate w0

Update






Overfitting in initial iterations


ε’ = ε/K

and ε’ = ε


Update

Update wt+1 as the ε’-optimal solution of




Objective

Train Error

Test Error

•  Latent SVM

•  Optimization


•  Extensions



Update






Overfitting in initial iterations


C’ = C x K

and C’ = C


Update


min ||w||2 + C’∑i ξi



Objective

Train Error

Test Error

•  Latent SVM

•  Optimization


•  Extensions


Kumar, Packer and Koller, NIPS 2010

1 + 1 = 2

1/3 + 1/6 = 1/2

eiπ+1 = 0

Math is for losers !!

FAILURE … BAD LOCAL MINIMUM

CCCP for Human Learning

Euler was a Genius!!

SUCCESS … GOOD LOCAL MINIMUM

1 + 1 = 2

1/3 + 1/6 = 1/2

eiπ+1 = 0

Self-Paced Learning

Start with “easy” examples, then consider “hard” ones

Easy vs. Hard

Expensive

Easy for human ≠ Easy for machine

Self-Paced Learning

Simultaneously estimate easiness and parameters Easiness is property of data sets, not single instances


Update





CCCP for Latent SVM


wTΨ(xi,yi,hi*) - wTΨ(xi,y,h) ≥ Δ(yi, y, h) - ξi

Self-Paced Learning

min ||w||2 + C∑i viξi


vi ∈ {0,1}

Trivial Solution

Self-Paced Learning

vi ∈ {0,1}

Large K Medium K Small K

min ||w||2 + C∑i viξi - ∑ivi/K


Self-Paced Learning

vi ∈ [0,1]

min ||w||2 + C∑i viξi - ∑ivi/K


Large K Medium K Small K

Biconvex Problem

Alternating Convex Search

Self-Paced Learning


Update

min ||w||2 + C∑i ξi - ∑i vi/K



Decrease K ← K/µ

SPL for Latent SVM


Objective

Train Error

Test Error

•  Latent SVM

•  Optimization


•  Extensions


Behl, Jawahar and Kumar, In Preparation

Ranking Rank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1

Ranking Rank 1 Rank 2 Rank 3

Rank 4 Rank 5 Rank 6

Average Precision = 1 Accuracy = 1 Average Precision = 0.92 Accuracy = 0.67 Average Precision = 0.81

Ranking

During testing, AP is frequently used

During training, a surrogate loss is used

Contradictory to loss-based learning

Optimize AP directly

Results

Statistically significant improvement

Speed – Proximal Regularization Start with an good initial estimate w0

Update


min ||w||2 + C∑i ξi + Ct ||w - wt||2




Speed – Cascades

Weiss and Taskar, AISTATS 2010 Sapp, Toshev and Taskar, ECCV 2010

Accuracy – (Self) Pacing

Pacing the sample complexity – NIPS 2010

Pacing the model complexity

Pacing the problem complexity

Building Accurate Systems

Model

Inference

Learning

85%

5%

10%

Learning cannot provide huge gains without a good model

Inference cannot provide huge gains without a good model

•  Latent SVM

•  Optimization

•  Practice

•  Extensions –  Latent Variable Dependent Loss –  Max-Margin Min-Entropy Models


Yu and Joachims, ICML 2009

Latent Variable Dependent Loss

- wTΨ(xi,yi(w),hi(w))

Δ(yi, yi(w), hi(w)) wTΨ(xi,yi(w),hi(w)) +


Δ(yi, yi(w), hi(w)) wTΨ(xi,yi(w),hi(w)) +




Δ(yi, y, h) wTΨ(xi,y,h) + maxy,h



Optimizing Precision@k

Input X = {xi, i = 1, …, n}

Annotation Y = {yi, i = 1, …, n} ∈ {-1,+1}n

Latent H = ranking

Δ(Y*, Y, H)

1-Precision@k

•  Latent SVM

•  Optimization

•  Practice

•  Extensions –  Latent Variable Dependent Loss –  Max-Margin Min-Entropy (M3E) Models


Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

Running vs. Jumping Classification

0.00 0.00 0.25 0.00 0.25 0.00 0.25 0.00 0.00

Score wTΨ(x,y,h) è (-∞, +∞)

wTΨ(x,y1,h)

0.00 0.24 0.00 0.00 0.00 0.00 0.01 0.00 0.00

wTΨ(x,y2,h) 0.00 0.00 0.25 0.00 0.25 0.00 0.25 0.00 0.00

wTΨ(x,y1,h)

Only maximum score used

No other useful cue?

Score wTΨ(x,y,h) è (-∞, +∞)

Uncertainty in h

Running vs. Jumping Classification

Scoring function

Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)

Prediction

y(w) = argminy Hα(Pw(h|y,x)) – log Pw(y|x)

Partition Function

Marginalized Probability

Rényi Entropy

Rényi Entropy of Generalized Distribution Gα(y;x,w)

M3E

Gα(y;x,w) = 1

1-α log

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh log(Pw(y,h|x))

Rényi Entropy

Gα(y;x,w) = 1

1-α log

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh wTΨ(x,y,h)

Same prediction as latent SVM

Rényi Entropy


Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Learning M3E

Δ(yi,yi(w)) Gα(yi(w);xi,w) +


- Gα(yi(w);xi,w)

Δ(yi,yi(w)) ≤ Gα(yi;xi,w) + - Gα(yi(w);xi,w)

maxy{Δ(yi,y) ≤ Gα(yi;xi,w) + - Gα(y;xi,w)}

Learning M3E



Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

When α tends to infinity, M3E = Latent SVM

Other values can give better results

Learning M3E

Motif + Markov Background Model. Yu and Joachims, 2009

Motif Finding Results

Part II

Output Mismatch

Input x

Annotation y

Latent h

x

y = “jumping”

h

Action Detection

Mismatch between output and available annotations

Exact value of latent variable is important

Desired output during test time is (y,h)

•  Problem Formulation

•  Dissimilarity Coefficient Learning

•  Optimization

•  Experiments

Outline – Output Mismatch

Weakly Supervised Data

Input x

Output y ∈ {0,1,…,C}

Hidden h

x

y = 0

h

Weakly Supervised Detection

Feature Φ(x,h)


Ψ(x,y,h)

x

y = 0

h


Feature Φ(x,h)


Ψ(x,0,h) Φ(x,h)

0 =

x

y = 0

h

0

.

.

.


Feature Φ(x,h)


Ψ(x,1,h) 0

Φ(x,h) =

x

y = 0

h

0

.

.

.


Feature Φ(x,h)


Ψ(x,C,h) 0

0 =

x

y = 0

h

Φ(x,h)

.

.

.


Feature Φ(x,h)


Ψ(x,y,h)

Score f : Ψ(x,y,h) è (-∞, +∞)

Optimize score over all possible y and h

x

y = 0

h

Scoring function

wTΨ(x,y,h)

Prediction


Linear Model

Parameters

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

Unknown latent variable values

Supervised Samples

+ Σi Δ’(yi,yi(w),hi(w)) Weakly

Supervised Samples

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

A single distribution to achieve two objectives

Pw(hi|xi,yi) Σhi



•  Optimization

•  Experiments


Kumar, Packer and Koller, ICML 2012

Problem

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Solution

Model Uncertainty in Latent Variables


Use two different distributions for the two different tasks

Solution


Use two different distributions for the two different tasks

Pθ(hi|yi,xi)

hi

Solution Use two different distributions for the two different tasks

hi

Pw(yi,hi|xi)

(yi,hi) (yi(w),hi(w))

Pθ(hi|yi,xi)

The Ideal Case No latent variable uncertainty, correct prediction

hi

Pw(yi,hi|xi)

(yi,hi) (yi,hi(w))

Pθ(hi|yi,xi)

hi(w)

In Practice Restrictions in the representation power of models

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

Our Framework Minimize the dissimilarity between the two distributions

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

User-defined dissimilarity measure

Our Framework Minimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)

Hi(w,θ)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))

- β Hi(θ,θ) Hi(w,θ)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- β Hi(θ,θ) Hi(w,θ) minw,θ Σi



•  Optimization

•  Experiments


Optimization

minw,θ Σi Hi(w,θ) - β Hi(θ,θ)

Initialize the parameters to w0 and θ0


End

Fix w and optimize θ

Fix θ and optimize w

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Optimization of θ


hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Optimization of θ


hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Stochastic subgradient descent

Optimization of w

minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Expected loss, models uncertainty

Form of optimization similar to Latent SVM

Δ independent of h, implies latent SVM

Concave-Convex Procedure (CCCP)



•  Optimization

•  Experiments


Action Detection Input x

Output y = “Using Computer” Latent Variable h

PASCAL VOC 2011

60/40 Train/Test Split 5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Results – 0/1 Loss

0

0.2

0.4

0.6

0.8

1

1.2

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Average Test Loss

LSVM Our

Statistically Significant

Results – Overlap Loss

0.63 0.64 0.65 0.66 0.67 0.68 0.69

0.7 0.71 0.72 0.73 0.74

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Average Test Loss

LSVM Our

Statistically Significant

Questions?

http://www.centrale-ponts.fr/personnel/pawan

loss-based learning with weak supervisionmpawankumar.info/tutorials/cvpr2013/slides/cvpr... ·...

Documents