convex point estimation using undirected bayesian transfer hierarchies gal elidan ben packer geremy...

25
Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab

Upload: alexander-strong

Post on 15-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

Gal Elidan Ben Packer

Geremy Heitz Daphne Koller

Stanford AI Lab

MotivationProblem:

With few instances, learned

models aren’t robust

MEAN

PrincipalComponents

Task:

Shape modeling

std +1std -1

std +1

std -1

MEAN

Transfer Learning

MEAN

PrincipalComponents

Shape is stabilized, but doesn’t

look like an elephant

Can we use rhinos to help

elephants?

std +1std -1

std +1

std -1

MEAN

Hierarchical Bayes

root

Elephant Rhino

P(Elephant|root) P(Rhino|root)

P(DataElephant|Elephant) P(DataRhino|Rhino)

P(root)

Goals

Transfer between related classes Range of settings, tasks Probabilistic motivation Multilevel, complex hierarchies Simple, efficient computation Automatically learn what to

transfer

Hierarchical Bayes

root

Elephant Rhino

Compute full posterior P(£|D)

P(£c|£root) must be conjugate to P(D|£c)

Problem: Often can’t perform full Bayesian computations

Approx.: Point estimation

root

Elephant Rhino

Empirical Bayes Point estimation

Other approximations:

Posterior as normal, sampling, etc.

Best parameters are good enough;don’t need full distribution

More Issues: Multiple Levels

Conjugate priors usually can’t be extended to

multiple levels (e.g., Dirichlet, inverse-Wishart)Exception: Thibeaux and Jordan (‘05)

More Issues: Restrictive Priors

Example: inverse-Wishart Pseudocount

restriction >= d If d is large, N is

small, signal from prior overwhelms data

We show experiments with N=3, d=20

AA BB

Normal-Inverse-Wishart parameters

Gaussian parameters

N = # samples, d = dimension

Pseudocounts

Alternative: ShrinkageMcCallum et al. (‘98)1. Compute maximum likelihood at each

node2. “Shrink” each node toward its parent

Linear combination of and parent

Uses cross-validation of

pa

rent

Pros: Simple to compute Handles multiple levels

Cons: Naive heuristic for transfer Averaging not always

appropriate

parent

Undirected HB Reformulation

Defines an undirected Markov random field model over D

Probabilistic Abstraction Hierarchies (Segal et al. ‘01)

Fdata: Encourage parameters to explain data

Undirected Probabilistic Model

root

Fdata Fdata

Divergence

: high

Elephant Rhino

: low

Divergence: Encourage parameters to be similar to parents

Divergence

Purpose of Reformulation

Easy to specify Fdata can be likelihood, classification, or other

objective Divergence can be L1-distance, L2-distance, -insensitive loss, KL divergence, etc. No conjugacy or proper prior restrictions

Easy to optimize Convex over if Fdata is convex and Divergence is

concave

Task: Categorize Documents

Bag-of-words model

Fdata : Multinomial log likelihood (regularized)

µi represents frequency of word i

Divergence: L2 norm

Application: Text categorization

Newsgroup20Dataset

Baselines1. Maximum likelihood at each node (no hierarchy)

2. Cross-validate regularization (no hierarchy)

3. Shrinkage (McCallum et al. ’98, with hierarchy)

A B

parent

pa

rent

Can It Handle Multiple Levels?

75 150 225 300 375

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Cla

ssif

icat

ion

Rat

e

Total Number of Training Instances

Newsgroup Topic Classification

Max Likelihood (No regularization)

Regularized Max LikelihoodShrinkage

Undirected HB

Task: Learn shape(Density estimation – test likelihood)Instances represented by 60 x-y coordinates of landmarks on outline

Divergence:L2 norm over mean and variance

Application: Shape Modeling

Mean landmark location

Covariance over landmarks

Regularization

MEAN

PrincipalComponents

Mammals Dataset (Fink, ’05)

Does Hierarchy Help?

6 10 20 30-350

-300

-250

-200

-150

-100

-50

0

50

Total Number of Training Instances

Del

ta l

og

-lo

ss /

in

stan

ceMammal Pairs

Regularized Max Likelihood

Bison-Rhino

Bison-RhinoElephant-BisonElephant-RhinoGiraffe-BisonGiraffe-ElephantGiraffe-RhinoLlama-BisonLlama-ElephantLlama-GiraffeLlama-Rhino

Unregularized max likelihood, shrinkage: Much worse, not shown

Transfer

Not all parameters deserve equal sharing

Split into subcomponents µi with weights

Allows for different strengths for different subcomponents, child-parent pairs

Degrees of Transfer

How do we estimate all these parameters?

Learning Degrees of Transfer

Bootstrap approachIf and have a consistent relationship, want to encourage them to be similar

Hyper-prior approachBayesian idea: Put prior on ¸ Add ¸ as parameter to optimization along with £

Concretely: inverse-Gamma prior (forced to be positive)

Prior on Degree of Transfer

If likelihood is concave, entire objective is convex!

Do Degrees of Transfer Help?

6 10 20 30-15

-10

-5

0

5

10

15

Total Number of Training Instances

Del

ta l

og

-lo

ss /

in

stan

ce

Mammal Pairs

Bison-RhinoElephant-BisonElephant-RhinoGiraffe-BisonGiraffe-ElephantGiraffe-RhinoLlama-BisonLlama-ElephantLlama-GiraffeLlama-Rhino

Regularized Max Likelihood

Bison-Rhino

Hyperprior

Degrees of Transfer

1/

Stronger transfer Weaker transfer

root

0 5 10 15 20 25 30 35 40 45 500

2

4

6

8

10

12

14

16

18

20

Distribution of DOT coefficients using Hyperprior

Summary

Transfer between related classes Range of settings, tasks Probabilistic motivation Multilevel, complex hierarchies Simple, efficient computation Refined transfer of components

Future Work Non-tree hierarchies

(multiple inheritance)

Block degrees of transfer

Structure learning

General undirected model doesn’t require tree structure

Part discovery

Gene Ontology (GO) network

WordNet Hierarchy