alexander g. schwing - fields institute · deep learning meets structured prediction alexander g....

Deep Learning meets Structured Prediction

Alexander G. Schwing

in collaboration with T. Hazan, M. Pollefeys and R. Urtasunin collaboration with L.-C. Chen, A. L. Yuille and R. Urtasun

University of Toronto

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 1 / 40

Big Data and Statistical Machine Learning

Large scale problems according to the program:input dimensionalitynumber of training samplesnumber of categories


Scene understanding

Tutoring systems Tag prediction

S1 S1

S2 S2

S4 S4

S3 S3

t = 1 t = 2

x = image

x = responses x = image

s ∈ S : room layout

s ∈ S : student skills s ∈ S : tags

Large scale problems• input dimensionality x is large• number of training samples |D| = |{(x , s)}| is large• number of categories |S| is large


Scene understanding Tutoring systems

Tag prediction

S1 S1

S2 S2

S4 S4

S3 S3

t = 1 t = 2

x = image x = responses

x = image

s ∈ S : room layout s ∈ S : student skills

s ∈ S : tags



Scene understanding Tutoring systems Tag prediction

S1 S1

S2 S2

S4 S4

S3 S3

t = 1 t = 2

x = image x = responses x = image

s ∈ S : room layout s ∈ S : student skills s ∈ S : tags



Why is large scale a challenge

Inference:s∗ = arg max

s∈SF (s, x ,w)

Search over output space SComputation of F (s, x ,w) from data x

Learning:

w∗ = arg maxw

∑(x ,s)∈D

F (s, x ,w)− ln∑s∈S

exp F (s, x ,w)

Summation over output space SSummation over dataset DComputation of F (s, x ,w) from data x


How we deal with the challenges?

Inference:How to find the maximizer of F (s, x ,w) given x, w?

Learning:How to find the parameters w of F (s, x ,w) given D?


Inference

s∗ = arg maxs∈S

F (s, x ,w)

The domain size |S| is potentially largeImageNet challenge: |S| = 1000Layout prediction: |S| = 504

Tutoring systems: |S| = 2Number of modeled skills

Tag prediction: |S| = 2Number of tags

Computation of F (s, x ,w) for all possible s ∈ S in general oftenintractable.

But: Interest in jointly predicting multiple variables s = (s1, . . . , sn)


Assumption: function/model decomposes additively

F (s, x ,w) = F (s1, . . . , sn, x ,w) =∑

r

fr (sr , x ,w)

Restriction r : sr = (si)i∈r

Discrete domain:

f{1,2}(s{1,2}) = f{1,2}(s1, s2) =[f{1,2}(1,1), f{1,2}(1,2), . . .

]Visualization

s1,2


[Werner‘07; Koller&Friedman‘09]

Example

s∗ = arg maxs

∑r

fr (sr )

s1 s2

s1,2

Integer Linear Program (LP) equivalence: variables br (sr )

maxb1,b2,b12

b1(0)b1(1)b2(0)b2(1)

b12(0,0)b12(1,0)b12(0,1)b12(1,1)

>

f1(0)f1(1)f2(0)f2(1)

f12(0,0)f12(1,0)f12(0,1)f12(1,1)

s.t.

br (sr ) ∈ {0,1}∑sr

br (sr ) = 1∑sp\sr

bp(sp) = br (sr )



Example

s∗ = arg maxs

∑r

fr (sr )

s1 s2

s1,2


maxb1,b2,b12

b1(0)b1(1)b2(0)b2(1)

b12(0,0)b12(1,0)b12(0,1)b12(1,1)

>

f1(0)f1(1)f2(0)f2(1)

f12(0,0)f12(1,0)f12(0,1)f12(1,1)

s.t.

br (sr ) ∈ {0,1}

∑sr


bp(sp) = br (sr )



Example

s∗ = arg maxs

∑r

fr (sr )

s1 s2

s1,2


maxb1,b2,b12

b1(0)b1(1)b2(0)b2(1)

b12(0,0)b12(1,0)b12(0,1)b12(1,1)

>

f1(0)f1(1)f2(0)f2(1)

f12(0,0)f12(1,0)f12(0,1)f12(1,1)

s.t.

br (sr ) ∈ {0,1}∑sr

br (sr ) = 1

∑sp\sr

bp(sp) = br (sr )



Example

s∗ = arg maxs

∑r

fr (sr )

s1 s2

s1,2


maxb1,b2,b12

b1(0)b1(1)b2(0)b2(1)

b12(0,0)b12(1,0)b12(0,1)b12(1,1)

>

f1(0)f1(1)f2(0)f2(1)

f12(0,0)f12(1,0)f12(0,1)f12(1,1)

s.t.

br (sr ) ∈ {0,1}∑sr


bp(sp) = br (sr )



s = arg maxs

∑r

fr (sr )

Integer linear program:

maxb1,b2,b12

b1(1)b1(2)b2(1)b2(2)

b12(1,1)b12(2,1)b12(1,2)b12(2,2)

>

f1(1)f1(2)f2(1)f2(2)

f12(1,1)f12(2,1)f12(1,2)f12(2,2)

s.t.

br (sr ) ∈ {0,1}br (sr ) ≥ 0∑sr


bp(sp) = br (sr )

Standard LP solvers are slow because of many variables andconstraints. Specifically tailored algorithms...



s = arg maxs

∑r

fr (sr )


maxbr

∑r ,sr

br (sr )fr (sr ) s.t.

br (sr ) ∈ {0,1}br (sr ) ≥ 0∑sr


bp(sp) = br (sr )




s = arg maxs

∑r

fr (sr )


maxbr

∑r ,sr


br (sr ) ∈ {0,1}br (sr ) ≥ 0∑sr

br (sr ) = 1

Marginalization




s = arg maxs

∑r

fr (sr )


maxbr

∑r ,sr


br (sr ) ∈ {0,1}

Local probability br

Marginalization




s = arg maxs

∑r

fr (sr )

LP relaxation:

maxbr

∑r ,sr


((((((((hhhhhhhhbr (sr ) ∈ {0,1}

Local probability br

Marginalization



[Weiss et al.‘07, Globerson&Jaakkola‘07, Johnson‘08,Jojic et al.‘10, Hazan&Shashua‘10, Savchynskyy et al.‘12,

Ravikumar et al.‘10, Martins et al.‘11, Meshi&Globerson‘11,Komodakis et al.‘10‘12, Schwing et al.‘11‘12‘14]

Graph structure defined via marginalization constraints

s1 s2

s1,2

Message passing solversAdvantage: Efficient due to analytically computable sub-problemsProblem: Special care required to find global optimum

Subgradient methodsAdvantage: Guaranteed globally convergentProblem: Special care required to find fast algorithms


Block-coordinate ascent/message passing solvers[Weiss et al.‘07, Globerson&Jaakkola‘07, Johnson‘08,

Jojic et al.‘10, Hazan&Shashua‘10, Savchynskyy et al.‘12,Ravikumar et al.‘10, Martins et al.‘11, Meshi&Globerson‘11]

Optimize w.r.t. subset of variables

Advantage: Efficient due to analytically computable sub-problemsProblem: Getting stuck in corners

Smoothing, proximal updates, augmented Lagrangian methodsA. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 11 / 40

Subgradient Methods[Komodakis et al.‘10‘12, Kappes et al.‘12]

Update Lagrange multipliers via any subgradient direction

Advantage: Globally convergentProblem: Slow and non-monotone convergence

What we like: steepest subgradient ascent direction


Distributed Inference for Graphical Models

Goal:Optimize the LP relaxation objectiveLeverage the problem structureDistribute memory and computation requirementsMaintain convergence and optimality guarantees

Dual decomposition extension of LP relaxation solvers via partitioningof variables.


Dual decomposition intuition:

s1 s2 s3 s4

s5 s6 s7 s8

s1,2 s2,3 s3,4

s5,6 s6,7 s7,8

s1,5 s2,6 s3,7 s4,8

Unique set of variables:bκr (sr )

Marginalization constraints and local beliefs ∀κConsistency constraints:

bκr (sr ) = br (sr )



s1 s2 s3 s4

s5 s6 s7 s8






1 2 3 4

5 6 7 8

κ1 κ2





1 2 3 4

5 6 7 8

κ1 κ2

Distributed LP Relaxation

maxb

∑κ

( ∑r∈κ,sr

bκr (sr )fr (sr )

)

s.t.∀κ Local probabilities bκr∀κ Marginalization constraints∀κ Consistency


Algorithm:Some message passing iterations in parallel ∀κ

1 2 3 4

5 6 7 8

κ1 κ2Exchange of information between different κ

1 2 3 4

5 6 7 8

κ1 κ2


State-of-the-artlibDAI 0.2.7 [Mooij 2010]graphLAB [Low et al. 2010]

Method Runtime [s] Efficiency [nodes/µs] primalBP (libDAI) 2617/4 0.04 1.0241

BP (graphLAB RR) 1800 0.06 1.0113SplashBP (graphLAB) 689 0.06 1.0121

Ours (General) 371 0.29 1.0450Ours (Ded.) 113 0.94 1.0450

Ours dist. (Ded.) 18 5.8 1.0449


Inter-machine communication

200 400 600 800 10004.815

4.82

4.825

4.83

4.835

4.84x 10

6

Iterations

Dua

l Ene

rgy

15102050100

102

4.815

4.82

4.825

4.83

4.835

4.84x 10

6

Time [s]

Dua

l Ene

rgy

15102050100


Large-scale setting

x is large (> 12 MPixel image)|S| is large (28012,000,000)sr is large (> 12 million regions with 280 states)sr is large (> 24 million regions with 80k states)

Sources publicly available on http://alexander-schwing.de


How we deal with the challenges?

Inference:How to find the maximizer of F (s, x ,w) given x, w?

Learning:How to find the parameters w of F (s, x ,w) given D?


Learning

good parameters from annotated examples

D = {(x , s)}

Log-linear models (CRFs, structured SVMs):

F (s, x ,w) = w>F (s, x)

Non-linear models, e.g., CNNs (this talk):

F (s, x ,w)


Inference:s∗ = arg max

s∈SF (s, x ,w)

Probability of a configuration s:

p(s | x ,w) =1

Z (x ,w)exp F (s, x ,w)

Z (x ,w) =∑s∈S

exp F (s, x ,w)

Inference alternatively:

s∗ = arg maxs∈S

p(s | x ,w)


Probability of a configuration s:

p(s | x ,w) =1

Z (x ,w)exp F (s, x ,w)

Z (x ,w) =∑s∈S

exp F (s, x ,w)

Maximize the likelihood of training data via

w∗ = arg maxw

∏(x ,s)∈D

p(s|x ,w)

= arg maxw

∑(x ,s)∈D

F (s, x ,w)− ln∑s∈S

exp F (s, x ,w)


Maximum likelihood is equivalent to maximizing cross-entropy:Target distribution: p(x ,s),tg(s) = δ(s = s)Cross-Entropy:

maxw

∑(x ,s)∈D,s

p(x ,s),tg(s) ln p(s | x ;w)

= maxw

∑(x ,s)∈D

ln p(s | x ;w)

= maxw

ln∏

(x ,s)∈D

p(s | x ;w)


Program of interest:

maxw

∑(x ,s)∈D,s


Optimize via gradient ascent

∂

∂w

∑(x ,s)∈D,s


=∑

(x ,s)∈D,s

(p(x ,s),tg(s)− p(s | x ;w)

) ∂

∂wF (s, x ,w)

Compute predicted distribution p(s | x ;w)

Use chain rule to pass back difference between prediction andobservation


Algorithm: Deep Learning

Repeat until stopping criteria

1 Forward pass to compute F (s, x ,w)

2 Compute p(s | x ,w)

3 Backward pass via chain rule to obtaingradient

4 Update parameters w

Why is large scale data a challenge?

How do we even represent F (s, x ,w) if S is large?How do we compute p(s | x ,w)?


Domain size of typical applications:ImageNet challenge: |S| = 1000Layout prediction: |S| = 504

Tutoring systems: |S| = 2Number of modeled skills

Tag prediction: |S| = 2Number of tags

Solution:Interest in jointly predicting multiple variables s = (s1, . . . , sn)

Assumption: function/model decomposes additively

F (s, x ,w) = F (s1, . . . , sn, x ,w) =∑

r

fr (sr , x ,w)

Every fr (sr , x ,w) is an arbitrary function, e.g., a CNN


How to compute gradient:

∂

∂w

∑(x ,s)∈D,s


=∑

(x ,s)∈D,s

(p(x ,s),tg(s)− p(s | x ;w)

) ∂

∂wF (s, x ,w)

=∑

(x ,s)∈D,r ,sr

(p(x ,s),r ,tg(sr )− pr (sr | x ;w)

) ∂

∂wfr (sr , x ,w)

How to obtain marginals pr (sr |x ,w)?

Approximation of marginals via:Sampling methodsInference methods


Inference approximations:

maxb∈L

∑r ,sr

br (sr | x ,w)fr (sr , x ,w)︸︷︷︸Inference

+∑

r

cr H(br )

s1 s2

s1,2


Typically employed variational algorithms:Convex Belief Propagation (distributed)Tree-reweighted message passing(Generalized) Loopy Belief Propagation(Generalized) double loop Loopy Belief Propagation


Approximated Deep Structured Learning

Repeat until stopping criteria

1 CNN Forward pass to compute fr (sr , x ,w) ∀r2 Compute approximate beliefs br (sr | x ,w)

3 Backward pass via chain rule to obtaingradient g

4 Update parameters w

g =∑

(x ,s)∈D,r ,sr

(p(x ,s),r ,tg(sr )− br (sr | x ,w)

) ∂

∂wfr (sr , x ,w)


Dealing with large number |D| of training examples:Parallelized across samples (any number of machines and GPUs)Usage of mini batches

Dealing with large input dimension x :Usage of standard CNNsGPU and CPU implementation

Dealing with large output spaces S:Variational approximationsBlending of learning and inference


ImageNet dataset|S| = 10001.2 million training examples50,000 validation examples

Model Validation set error [%]AlexNet 19.95

DeepNet16 10.29DeepNet19 10.37

Different from reported results because of missing averaging, differentimage crops, etc.


Layout datasetGiven a single image x , predict a 3D parametric box that bestdescribes the observed room layout

|S| = 504

Linear model205 training examples104 test examples


Pixel-wise prediction errors [%] on layout dataset:

OM GC OM + GC Others[Hoiem07] - 28.9 - -[Hedau09] - 21.2 - -[Wang10] 22.2 - - -[Lee10] 24.7 22.7 18.6 -[Pero12] - - - 16.3

Ours 18.63 15.35 13.59 -


Flickr dataset|S| = 238

10000 training examples10000 test examples

Training method Prediction error [%]Unary only 9.36Piecewise 7.70

Joint (with pre-training) 7.25


Visual results

female/indoor/portrait sky/plant life/tree water/animals/seafemale/indoor/portrait sky/plant life/tree water/animals/sky

animals/dog/indoor indoor/flower/plant lifeanimals/dog ∅


Learnt class correlations:


Distributed InferenceMaintaining convergence guaranteesParallelized across computers

Deep Nonlinear Structured PredictionNonlinearity, e.g., via CNNs in every factorUnifying structured prediction


Future directionsApplicationsEffects of approximationsLatent variable modelsCan we include the optimization of the model hyper-parametersTime series data


alexander g. schwing - fields institute · deep learning meets structured prediction alexander g....

Documents