loss-based learning with latent variables m. pawan kumar École centrale paris École des ponts...

Loss-based Learningwith Latent Variables

M. Pawan KumarÉcole Centrale Paris

École des Ponts ParisTechINRIA Saclay, Île-de-France

Joint work withBen Packer, Daphne Koller, Kevin Miller and Danny Goodman

Image Classification

Images xi Labels yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Boxes hi

y = “Deer”

Image x


0.00 0.00 0.000.00 0.75 0.000.00 0.00 0.00

Feature Ψ(x,y,h) (e.g. HOG)

Score f : Ψ(x,y,h) (-∞, +∞)

f (Ψ(x,y1,h))


0.00 0.23 0.000.00 0.00 0.010.01 0.00 0.00

f (Ψ(x,y2,h))

0.00 0.00 0.000.00 0.75 0.000.00 0.00 0.00

f (Ψ(x,y1,h))

Feature Ψ(x,y,h) (e.g. HOG)

Score f : Ψ(x,y,h) (-∞, +∞)

Prediction y(f)

Learn f

Loss-based Learning

User defined loss function Δ(y,y(f))

f* = argminf Σi Δ(yi,yi(f))

Minimize loss between predicted and ground-truth output

No restriction on the loss function

General framework (object detection, segmentation, …)

• Latent SVM

• Max-Margin Min-Entropy Models

• Dissimilarity Coefficient Learning

Outline

Andrews et al., NIPS 2001; Smola et al., AISTATS 2005;Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009


Images xi Labels yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Boxes hi

y = “Deer”

Image x

Latent SVM

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h))

Parameters Features

Learning Latent SVM

Training data {(xi,yi), i = 1,2,…,n}

Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Learning Latent SVM

Δ(yi,yi(w))wTΨ(x,yi(w),hi(w)) + - wTΨ(x,yi(w),hi(w))

Δ(yi,yi(w))≤ wTΨ(x,yi(w),hi(w)) + - maxhi wTΨ(x,yi,hi)

Δ(yi,y)}≤ maxy,h{wTΨ(x,y,h) + - maxhi wTΨ(x,yi,hi)


Learning Latent SVM


minw ||w||2 + C Σiξi

wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Difference-of-convex program in w

Local minimum or saddle point solution (CCCP)

Self-Paced Learning, NIPS 2010

Recap


wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h))

Learning


Images xi Labels yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Boxes hi

y = “Deer”

Image x


0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

Score wTΨ(x,y,h) (-∞, +∞)

wTΨ(x,y1,h)


0.00 0.24 0.000.00 0.00 0.000.01 0.00 0.00

wTΨ(x,y2,h)

0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

wTΨ(x,y1,h)

Only maximum score used

No other useful cue?

Score wTΨ(x,y,h) (-∞, +∞)

Uncertainty in h

• Latent SVM

• Max-Margin Min-Entropy (M3E) Models


Outline

Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

M3E

Scoring function

Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)

Prediction

y(w) = argminy Hα(Pw(h|y,x)) – log Pw(y|x)

Partition Function

MarginalizedProbability

RényiEntropy

Rényi Entropy of Generalized Distribution Gα(y;x,w)

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh log(Pw(y,h|x))

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh wTΨ(x,y,h)

Same prediction as latent SVM

Learning M3E


Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Learning M3E

Δ(yi,yi(w))Gα(yi(w);xi,w) +


- Gα(yi(w);xi,w)

Δ(yi,yi(w))≤ Gα(yi;xi,w) + - Gα(yi(w);xi,w)

maxy{Δ(yi,y)≤ Gα(yi;xi,w) + - Gα(y;xi,w)}

Learning M3E



Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

When α tends to infinity, M3E = Latent SVM

Other values can give better results


271 images, 6 classes

90/10 train/test split

5 folds

Mammals Dataset

0/1 loss


HOG-Based Model. Dalal and Triggs, 2005

Motif Finding

~ 40,000 sequences

50/50 train/test split

5 Proteins, 5 folds

UniProbe Dataset

Binding vs. Not-Binding

Motif + Markov Background Model. Yu and Joachims, 2009

Motif Finding

Recap

Scoring function

Prediction

Learning

Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)

y(w) = argminy Gα(y;x,w)


Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

• Latent SVM

• Max-Margin Min-Entropy Models


Outline

Kumar, Packer and Koller, ICML 2012

Object Detection

Images xi Labels yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Boxes hi

y = “Deer”

Image x

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

Unknown latent variable values

Supervised Samples

+ Σi Δ’(yi,yi(w),hi(w))Weakly

Supervised Samples

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

A single distribution to achieve two objectives

Pw(hi|xi,yi)Σhi

Problem

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Solution

Model Uncertainty in Latent Variables


Use two different distributions for the two different tasks

Solution


Use two different distributions for the two different tasks

Pθ(hi|yi,xi)

hi

SolutionUse two different distributions for the two different tasks

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

The Ideal CaseNo latent variable uncertainty, correct prediction

hi

Pw(yi,hi|xi)

(yi,hi)(yi,hi(w))

Pθ(hi|yi,xi)

hi(w)

In PracticeRestrictions in the representation power of models

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

Our FrameworkMinimize the dissimilarity between the two distributions

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

User-defined dissimilarity measure

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)

Hi(w,θ)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))

- β Hi(θ,θ)Hi(w,θ)


hi

Pw(yi,hi|xi)


Pθ(hi|yi,xi)

- β Hi(θ,θ)Hi(w,θ)minw,θ Σi

Optimization

minw,θ Σi Hi(w,θ) - β Hi(θ,θ)

Initialize the parameters to w0 and θ0

Repeat until convergence

End

Fix w and optimize θ

Fix θ and optimize w

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Optimization of θ


hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Optimization of θ


hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Stochastic subgradient descent

Optimization of w

minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Expected loss, models uncertainty

Form of optimization similar to Latent SVM

Δ independent of h, implies latent SVM

Concave-Convex Procedure (CCCP)

Object Detection

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Input x

Output y = “Deer”

Latent Variable h

Mammals Dataset

60/40 Train/Test Split

5 Folds

Train Input xi Output yi

Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Average Test Loss

LSVMOur

Statistically Significant

Results – Overlap Loss


0.1

0.2

0.3

0.4

0.5

0.6

Average Test Loss

LSVMOur

Action DetectionInput x

Output y = “Using Computer”

Latent Variable h

PASCAL VOC 2011

60/40 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Results – 0/1 Loss


0.2

0.4

0.6

0.8

1

1.2

Average Test Loss

LSVMOur


Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50.62

0.64

0.66

0.68

0.7

0.72

0.74

Average Test Loss

LSVMOur


• Latent SVM ignores latent variable uncertainty

• M3E for latent variable independent loss

• DISC for latent variable dependent loss

• Strict generalizations of Latent SVM

• Code available online

Conclusions

Questions?

http://cvc.centrale-ponts.fr/personnel/pawan

loss-based learning with latent variables m. pawan kumar École centrale paris École des ponts...

Documents

h yi

w yi

y gyixi

y gyxi

y maxhi wtxi

loss function y

h maxhi wtx

yiw wtx