loss-based learning with latent variables m. pawan kumar École centrale paris École des ponts...

57
Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben Packer, Daphne Koller, Kevin Miller and Danny Goodman

Upload: buck-floyd

Post on 29-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Loss-based Learningwith Latent Variables

M. Pawan KumarÉcole Centrale Paris

École des Ponts ParisTechINRIA Saclay, Île-de-France

Joint work withBen Packer, Daphne Koller, Kevin Miller and Danny Goodman

Page 2: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Image Classification

Images xi Labels yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Boxes hi

y = “Deer”

Image x

Page 3: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Image Classification

0.00 0.00 0.000.00 0.75 0.000.00 0.00 0.00

Feature Ψ(x,y,h) (e.g. HOG)

Score f : Ψ(x,y,h) (-∞, +∞)

f (Ψ(x,y1,h))

Page 4: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Image Classification

0.00 0.23 0.000.00 0.00 0.010.01 0.00 0.00

f (Ψ(x,y2,h))

0.00 0.00 0.000.00 0.75 0.000.00 0.00 0.00

f (Ψ(x,y1,h))

Feature Ψ(x,y,h) (e.g. HOG)

Score f : Ψ(x,y,h) (-∞, +∞)

Prediction y(f)

Learn f

Page 5: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Loss-based Learning

User defined loss function Δ(y,y(f))

f* = argminf Σi Δ(yi,yi(f))

Minimize loss between predicted and ground-truth output

No restriction on the loss function

General framework (object detection, segmentation, …)

Page 6: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

• Latent SVM

• Max-Margin Min-Entropy Models

• Dissimilarity Coefficient Learning

Outline

Andrews et al., NIPS 2001; Smola et al., AISTATS 2005;Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

Page 7: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Image Classification

Images xi Labels yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Boxes hi

y = “Deer”

Image x

Page 8: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Latent SVM

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h))

Parameters Features

Page 9: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Learning Latent SVM

Training data {(xi,yi), i = 1,2,…,n}

Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Page 10: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Learning Latent SVM

Δ(yi,yi(w))wTΨ(x,yi(w),hi(w)) + - wTΨ(x,yi(w),hi(w))

Δ(yi,yi(w))≤ wTΨ(x,yi(w),hi(w)) + - maxhi wTΨ(x,yi,hi)

Δ(yi,y)}≤ maxy,h{wTΨ(x,y,h) + - maxhi wTΨ(x,yi,hi)

Training data {(xi,yi), i = 1,2,…,n}

Page 11: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Learning Latent SVM

Training data {(xi,yi), i = 1,2,…,n}

minw ||w||2 + C Σiξi

wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Difference-of-convex program in w

Local minimum or saddle point solution (CCCP)

Self-Paced Learning, NIPS 2010

Page 12: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Recap

minw ||w||2 + C Σiξi

wTΨ(xi,y,h) + Δ(yi,y) - maxhi wTΨ(xi,yi,hi) ≤ ξi

Scoring function

wTΨ(x,y,h)

Prediction

y(w),h(w) = argmaxy,h wTΨ(x,y,h))

Learning

Page 13: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Image Classification

Images xi Labels yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Boxes hi

y = “Deer”

Image x

Page 14: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Image Classification

0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

Score wTΨ(x,y,h) (-∞, +∞)

wTΨ(x,y1,h)

Page 15: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Image Classification

0.00 0.24 0.000.00 0.00 0.000.01 0.00 0.00

wTΨ(x,y2,h)

0.00 0.00 0.250.00 0.25 0.000.25 0.00 0.00

wTΨ(x,y1,h)

Only maximum score used

No other useful cue?

Score wTΨ(x,y,h) (-∞, +∞)

Uncertainty in h

Page 16: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

• Latent SVM

• Max-Margin Min-Entropy (M3E) Models

• Dissimilarity Coefficient Learning

Outline

Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

Page 17: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

M3E

Scoring function

Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)

Prediction

y(w) = argminy Hα(Pw(h|y,x)) – log Pw(y|x)

Partition Function

MarginalizedProbability

RényiEntropy

Rényi Entropy of Generalized Distribution Gα(y;x,w)

Page 18: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = 1. Shannon Entropy of Generalized Distribution

- Σh Pw(y,h|x) log(Pw(y,h|x))

Σh Pw(y,h|x)

Page 19: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh log(Pw(y,h|x))

Page 20: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Rényi Entropy

Gα(y;x,w) =1

1-αlog

Σh Pw(y,h|x)α

Σh Pw(y,h|x)

α = Infinity. Minimum Entropy of Generalized Distribution

- maxh wTΨ(x,y,h)

Same prediction as latent SVM

Page 21: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Learning M3E

Training data {(xi,yi), i = 1,2,…,n}

Highly non-convex in w

Cannot regularize w to prevent overfitting

w* = argminw Σi Δ(yi,yi(w))

Page 22: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Learning M3E

Δ(yi,yi(w))Gα(yi(w);xi,w) +

Training data {(xi,yi), i = 1,2,…,n}

- Gα(yi(w);xi,w)

Δ(yi,yi(w))≤ Gα(yi;xi,w) + - Gα(yi(w);xi,w)

maxy{Δ(yi,y)≤ Gα(yi;xi,w) + - Gα(y;xi,w)}

Page 23: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Learning M3E

Training data {(xi,yi), i = 1,2,…,n}

minw ||w||2 + C Σiξi

Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

When α tends to infinity, M3E = Latent SVM

Other values can give better results

Page 24: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Image Classification

271 images, 6 classes

90/10 train/test split

5 folds

Mammals Dataset

0/1 loss

Page 25: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Image Classification

HOG-Based Model. Dalal and Triggs, 2005

Page 26: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Motif Finding

~ 40,000 sequences

50/50 train/test split

5 Proteins, 5 folds

UniProbe Dataset

Binding vs. Not-Binding

Page 27: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Motif + Markov Background Model. Yu and Joachims, 2009

Motif Finding

Page 28: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Recap

Scoring function

Prediction

Learning

Pw(y,h|x) = exp(wTΨ(x,y,h))/Z(x)

y(w) = argminy Gα(y;x,w)

minw ||w||2 + C Σiξi

Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

Page 29: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

• Latent SVM

• Max-Margin Min-Entropy Models

• Dissimilarity Coefficient Learning

Outline

Kumar, Packer and Koller, ICML 2012

Page 30: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Object Detection

Images xi Labels yi

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Boxes hi

y = “Deer”

Image x

Page 31: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

Unknown latent variable values

Supervised Samples

+ Σi Δ’(yi,yi(w),hi(w))Weakly

Supervised Samples

Page 32: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Minimizing General Loss

minw Σi Δ(yi,hi,yi(w),hi(w))

A single distribution to achieve two objectives

Pw(hi|xi,yi)Σhi

Page 33: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Problem

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Page 34: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Solution

Model Uncertainty in Latent Variables

Model Accuracy of Latent Variable Predictions

Use two different distributions for the two different tasks

Page 35: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Solution

Model Accuracy of Latent Variable Predictions

Use two different distributions for the two different tasks

Pθ(hi|yi,xi)

hi

Page 36: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

SolutionUse two different distributions for the two different tasks

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Page 37: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

The Ideal CaseNo latent variable uncertainty, correct prediction

hi

Pw(yi,hi|xi)

(yi,hi)(yi,hi(w))

Pθ(hi|yi,xi)

hi(w)

Page 38: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

In PracticeRestrictions in the representation power of models

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Page 39: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Our FrameworkMinimize the dissimilarity between the two distributions

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

User-defined dissimilarity measure

Page 40: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Page 41: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- β Σh,h’ Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi)

Hi(w,θ)

Page 42: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- (1-β) Δ(yi(w),hi(w),yi(w),hi(w))

- β Hi(θ,θ)Hi(w,θ)

Page 43: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Our FrameworkMinimize Rao’s Dissimilarity Coefficient

hi

Pw(yi,hi|xi)

(yi,hi)(yi(w),hi(w))

Pθ(hi|yi,xi)

- β Hi(θ,θ)Hi(w,θ)minw,θ Σi

Page 44: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Optimization

minw,θ Σi Hi(w,θ) - β Hi(θ,θ)

Initialize the parameters to w0 and θ0

Repeat until convergence

End

Fix w and optimize θ

Fix θ and optimize w

Page 45: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Page 46: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case I: yi(w) = yi

hi(w)

Page 47: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Page 48: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Optimization of θ

minθ Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ)

hi

Pθ(hi|yi,xi)

Case II: yi(w) ≠ yi

Stochastic subgradient descent

Page 49: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Optimization of w

minw Σi Σh Δ(yi,h,yi(w),hi(w))Pθ(h|yi,xi)

Expected loss, models uncertainty

Form of optimization similar to Latent SVM

Δ independent of h, implies latent SVM

Concave-Convex Procedure (CCCP)

Page 50: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Object Detection

Bison

Deer

Elephant

Giraffe

Llama

Rhino

Input x

Output y = “Deer”

Latent Variable h

Mammals Dataset

60/40 Train/Test Split

5 Folds

Train Input xi Output yi

Page 51: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Average Test Loss

LSVMOur

Statistically Significant

Page 52: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.1

0.2

0.3

0.4

0.5

0.6

Average Test Loss

LSVMOur

Page 53: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Action DetectionInput x

Output y = “Using Computer”

Latent Variable h

PASCAL VOC 2011

60/40 Train/Test Split

5 Folds

Jumping

Phoning

Playing Instrument

Reading

Riding Bike

Riding Horse

Running

Taking Photo

Using Computer

Walking

Train Input xi Output yi

Page 54: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Results – 0/1 Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50

0.2

0.4

0.6

0.8

1

1.2

Average Test Loss

LSVMOur

Statistically Significant

Page 55: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Results – Overlap Loss

Fold 1 Fold 2 Fold 3 Fold 4 Fold 50.62

0.64

0.66

0.68

0.7

0.72

0.74

Average Test Loss

LSVMOur

Statistically Significant

Page 56: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

• Latent SVM ignores latent variable uncertainty

• M3E for latent variable independent loss

• DISC for latent variable dependent loss

• Strict generalizations of Latent SVM

• Code available online

Conclusions

Page 57: Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben

Questions?

http://cvc.centrale-ponts.fr/personnel/pawan