crash course on machine learning part vi

59
Crash Course on Machine Learning Part VI Several slides from Derek Hoiem, Ben Taskar, Christopher Bishop, Lise Getoor

Upload: brosh

Post on 24-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Crash Course on Machine Learning Part VI. Several slides from Derek Hoiem , Ben Taskar , Christopher Bishop, Lise Getoor. Graphical Models. Representations Bayes Nets Markov Random Fields Factor Graphs Inference Exact Variable Elimination BP for trees Approximate inference - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Crash Course on Machine  Learning Part  VI

Crash Course on Machine LearningPart VI

Several slides from Derek Hoiem, Ben Taskar, Christopher Bishop, Lise Getoor

Page 2: Crash Course on Machine  Learning Part  VI

Graphical Models• Representations

– Bayes Nets– Markov Random Fields– Factor Graphs

• Inference– Exact

• Variable Elimination• BP for trees

– Approximate inference• Sampling• Loopy BP• Optimization

• Learning (today)

Page 3: Crash Course on Machine  Learning Part  VI

SP2-3

Summary of inference methodsChain (online) Low treewidth High treewidth

Discrete BP = forwardsBoyen-Koller (ADF), beam search

VarElim, Jtree, recursive conditioning

Loopy BP, mean field, structured variational, EP, graphcutsGibbs

Gaussian BP = Kalman filter Jtree = sparse linear algebra

Loopy BPGibbs

Other EKF, UKF, moment matching (ADF)Particle filter

EP, EM, VB, NBP, Gibbs

EP, variational EM, VB, NBP, Gibbs

Exact Deterministic approximation Stochastic approximationBP=belief propagation, EP = expectation propagation, ADF = assumed density filtering, EKF = extended Kalman filter, UKF = unscented Kalman filter, VarElim = variable elimination, Jtree= junction tree, EM = expectation maximization, VB = variational Bayes, NBP = non-parametric BPSlide Credit: Kevin Murphy

Page 4: Crash Course on Machine  Learning Part  VI

Variable EliminationB

A

E

M J

Page 5: Crash Course on Machine  Learning Part  VI

The Sum-Product Algorithm

Objective:i. to obtain an efficient, exact inference algorithm

for finding marginals;ii. in situations where several marginals are

required, to allow computations to be shared efficiently.

Key idea: Distributive Law

Page 6: Crash Course on Machine  Learning Part  VI

The Sum-Product Algorithm

Page 7: Crash Course on Machine  Learning Part  VI

The Sum-Product Algorithm

Page 8: Crash Course on Machine  Learning Part  VI

The Sum-Product Algorithm

• Initialization

Page 9: Crash Course on Machine  Learning Part  VI

The Sum-Product Algorithm

To compute local marginals:• Pick an arbitrary node as root• Compute and propagate messages from the leaf nodes

to the root, storing received messages at every node.• Compute and propagate messages from the root to

the leaf nodes, storing received messages at every node.

• Compute the product of received messages at each node for which the marginal is required, and normalize if necessary.

Page 10: Crash Course on Machine  Learning Part  VI

Loopy Belief Propagation

• Sum-Product on general graphs.• Initial unit messages passed across all links,

after which messages are passed around until convergence (not guaranteed!).

• Approximate but tractable for large graphs.• Sometime works well, sometimes not at all.

Page 11: Crash Course on Machine  Learning Part  VI

Learning Bayesian networks

InducerData + Prior information

E

R

B

A

C .9 .1e

b

e

.7 .3

.99 .01.8 .2

be

b

b

e

BE P(A | E,B)

Page 12: Crash Course on Machine  Learning Part  VI

Known Structure -- Complete Data E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

Inducer

E B

A.9 .1

e

b

e

.7 .3

.99 .01.8 .2

be

b

b

e

BE P(A | E,B)

? ?e

b

e

? ?

? ?? ?

be

b

b

e

BE P(A | E,B) E B

A

• Network structure is specified– Inducer needs to estimate parameters

• Data does not contain missing values

Page 13: Crash Course on Machine  Learning Part  VI

Unknown Structure -- Complete Data E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

Inducer

E B

A.9 .1

e

b

e

.7 .3

.99 .01.8 .2

be

b

b

e

BE P(A | E,B)

? ?e

b

e

? ?

? ?? ?

be

b

b

e

BE P(A | E,B) E B

A

• Network structure is not specified– Inducer needs to select arcs & estimate parameters

• Data does not contain missing values

Page 14: Crash Course on Machine  Learning Part  VI

Known Structure -- Incomplete Data

Inducer

E B

A.9 .1

e

b

e

.7 .3

.99 .01.8 .2

be

b

b

e

BE P(A | E,B)

? ?e

b

e

? ?

? ?? ?

be

b

b

e

BE P(A | E,B) E B

A

• Network structure is specified• Data contains missing values

– We consider assignments to missing values

E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>

Page 15: Crash Course on Machine  Learning Part  VI

Known Structure / Complete Data

• Given a network structure G– And choice of parametric family for P(Xi|Pai)

• Learn parameters for network

Goal• Construct a network that is “closest” to

probability that generated the data

Page 16: Crash Course on Machine  Learning Part  VI

Learning Parameters for a Bayesian NetworkE B

A

C

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

D

• Training data has the form:

Page 17: Crash Course on Machine  Learning Part  VI

Learning Parameters for a Bayesian NetworkE B

A

C

• Since we assume i.i.d. samples,likelihood function is

m

mCmAmBmEPDL ):][],[],[],[():(

Page 18: Crash Course on Machine  Learning Part  VI

Learning Parameters for a Bayesian NetworkE B

A

C

• By definition of network, we get

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

Page 19: Crash Course on Machine  Learning Part  VI

Learning Parameters for a Bayesian NetworkE B

A

C

• Rewriting terms, we get

m

m

m

m

m

mAmCP

mEmBmAP

mBP

mEP

mCmAmBmEPDL

):][|][(

):][],[|][(

):][(

):][(

):][],[],[],[():(

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

Page 20: Crash Course on Machine  Learning Part  VI

General Bayesian Networks

Generalizing for any Bayesian network:

• The likelihood decomposes according to the structure of the network.

iii

i miii

m iiii

mn

DL

mPamxP

mPamxP

mxmxPDL

):(

):][|][(

):][|][(

):][,],[():( 1 i.i.d.

samples

Network factorization

Page 21: Crash Course on Machine  Learning Part  VI

General Bayesian Networks (Cont.)

Decomposition Independent Estimation Problems

If the parameters for each family are not related, then they can be estimated

independently of each other.

Page 22: Crash Course on Machine  Learning Part  VI

From Binomial to Multinomial• For example, suppose X can have the values

1,2,…,K• We want to learn the parameters 1, 2. …, K

Sufficient statistics:• N1, N2, …, NK - the number of times each

outcome is observedLikelihood function:

MLE:

K

k

Nk

kDL1

):(

N

Nkkˆ

Page 23: Crash Course on Machine  Learning Part  VI

Likelihood for Multinomial Networks• When we assume that P(Xi | Pai ) is

multinomial, we get further decomposition:

i i

ii

ii

i i

ii

i ii

pa x

paxNpax

pa x

paxNiii

pa pamPamiii

miiiii

paxP

pamxP

mPamxPDL

),(|

),(][,

):|(

):|][(

):][|][():(

m

iiiii mPamxPDL ):][|][():(

i i

ii

i ii

pa x

paxNiii

pa pamPamiii

miiiii

paxP

pamxP

mPamxPDL

),(][,

):|(

):|][(

):][|][():(

i iipa pamPamiii

miiiii

pamxP

mPamxPDL

][,):|][(

):][|][():(

Page 24: Crash Course on Machine  Learning Part  VI

Likelihood for Multinomial Networks• When we assume that P(Xi | Pai ) is

multinomial, we get further decomposition:

• For each value pai of the parents of Xi we get an independent multinomial problem

• The MLE is

i i

ii

iipa x

paxNpaxii DL ),(

|):(

)(),(ˆ |

i

iipax paN

paxNii

Page 25: Crash Course on Machine  Learning Part  VI

Bayesian Approach: Dirichlet Priors

• Recall that the likelihood function is

• A Dirichlet prior with hyperparameters 1,…,K is defined as

for legal 1,…, K

Then the posterior has the same form, with

hyperparameters 1+N 1,…,K +N K

K

1k

Nk

kDL ):(

K

kk

kP1

1)(

K

k

Nk

K

k

Nk

K

kk

kkkkDPPDP1

1

11

1)|()()|(

Page 26: Crash Course on Machine  Learning Part  VI

Dirichlet Priors (cont.)• We can compute the prediction on a new event

in closed form: • If P() is Dirichlet with hyperparameters 1,

…,K then

• Since the posterior is also Dirichlet, we get

kk d)(P)k]1[X(P

)N(

Nd)D|(P)D|k]1M[X(P kkk

Page 27: Crash Course on Machine  Learning Part  VI

Bayesian Prediction(cont.)• Given these observations, we can compute the

posterior for each multinomial Xi | pai independently– The posterior is Dirichlet with parameters

(Xi=1|pai)+N (Xi=1|pai),…, (Xi=k|pai)+N (Xi=k|pai)

• The predictive distribution is then represented by the parameters

)pa(N)pa(

)pa,x(N)pa,x(~ii

iiiipa|x ii

Page 28: Crash Course on Machine  Learning Part  VI

Learning Parameters: Summary• Estimation relies on sufficient statistics

– For multinomial these are of the form N (xi,pai) – Parameter estimation

• Bayesian methods also require choice of priors• Both MLE and Bayesian are asymptotically equivalent and

consistent• Both can be implemented in an on-line manner by

accumulating sufficient statistics

)()(),(),(~

|ii

iiiipax paNpa

paxNpaxii

)(

),(ˆ |i

iipax paN

paxNii

MLE Bayesian (Dirichlet)

Page 29: Crash Course on Machine  Learning Part  VI

Learning Structure from Complete Data

Page 30: Crash Course on Machine  Learning Part  VI

Benefits of Learning Structure

• Efficient learning -- more accurate models with less data– Compare: P(A) and P(B) vs. joint P(A,B)

• Discover structural properties of the domain– Ordering of events– Relevance

• Identifying independencies faster inference• Predict effect of actions

– Involves learning causal relationship among variables

Page 31: Crash Course on Machine  Learning Part  VI

Why Struggle for Accurate Structure?

• Increases the number of parameters to be fitted

• Wrong assumptions about causality and domain structure

• Cannot be compensated by accurate fitting of parameters

• Also misses causality and domain structure

Earthquake Alarm Set

Sound

BurglaryEarthquake Alarm Set

Sound

Burglary

Earthquake Alarm Set

Sound

Burglary

Adding an arc Missing an arc

Page 32: Crash Course on Machine  Learning Part  VI

Approaches to Learning Structure

• Constraint based– Perform tests of conditional independence– Search for a network that is consistent with the observed

dependencies and independencies

• Pros & Cons Intuitive, follows closely the construction of BNs Separates structure learning from the form of the

independence tests Sensitive to errors in individual tests

Page 33: Crash Course on Machine  Learning Part  VI

Approaches to Learning Structure• Score based

– Define a score that evaluates how well the (in)dependencies in a structure match the observations

– Search for a structure that maximizes the score• Pros & Cons

Statistically motivated Can make compromises Takes the structure of conditional probabilities into

account Computationally hard

Page 34: Crash Course on Machine  Learning Part  VI

Likelihood Score for Structures

First cut approach: – Use likelihood function

• Since we know how to maximize parameters from now we assume

m ii,G

Gii

mGn1G

),G:]m[Pa|]m[x(P

),G:]m[x,],m[x(P)D:,G(L

):,(max):( DGLDGL GG

Page 35: Crash Course on Machine  Learning Part  VI

Likelihood Score for Structure (cont.)

Rearranging terms:

where• H(X) is the entropy of X • I(X;Y) is the mutual information between X and

Y– I(X;Y) measures how much “information” each variables provides about the other– I(X;Y) 0 – I(X;Y) = 0 iff X and Y are independent– I(X;Y) = H(X) iff X is totally predictable given Y

Page 36: Crash Course on Machine  Learning Part  VI

Likelihood Score for Structure (cont.)

Good news:• Intuitive explanation of likelihood score:

– The larger the dependency of each variable on its parents, the higher the score

• Likelihood as a compromise among dependencies, based on their strength

Page 37: Crash Course on Machine  Learning Part  VI

Likelihood Score for Structure (cont.)

Bad news:• Adding arcs always helps

– I(X;Y) I(X;Y,Z)– Maximal score attained by fully connected

networks– Such networks can overfit the data ---

parameters capture the noise in the data

Page 38: Crash Course on Machine  Learning Part  VI

Avoiding Overfitting“Classic” issue in learning. Approaches:• Restricting the hypotheses space

– Limits the overfitting capability of the learner– Example: restrict # of parents or # of parameters

• Minimum description length– Description length measures complexity– Prefer models that compactly describes the training data

• Bayesian methods– Average over all possible parameter values– Use prior knowledge

Page 39: Crash Course on Machine  Learning Part  VI

Bayesian Inference

• Bayesian Reasoning---compute expectation over unknown G

• Assumption: Gs are mutually exclusive and exhaustive

• We know how to compute P(x[M+1]|G,D)– Same as prediction with fixed structure

• How do we compute P(G|D)?

G

)D|G(P)G,D|]1M[x(P)D|]1M[x(P

Page 40: Crash Course on Machine  Learning Part  VI

Marginal likelihood

Prior over structures

)()()|()|(

DPGPGDPDGP

Using Bayes rule:

P(D) is the same for all structures GCan be ignored when comparing structures

Probability of Data

Posterior Score

Page 41: Crash Course on Machine  Learning Part  VI

Marginal Likelihood

• By introduction of variables, we have that

• This integral measures sensitivity to choice of parameters

dGPGDPGDP )|(),|()|(

Likelihood Prior over

parameters

Page 42: Crash Course on Machine  Learning Part  VI

Marginal Likelihood for General Network

The marginal likelihood has the form:

where • N(..) are the counts from the data• (..) are the hyperparameters for each family given G

i pa xG

i

Gi

Gi

GG

G

Gi i i

ii

ii

i

paxpaxNpax

paNpapa

GDP )),(()),(),((

)()()()|(

Dirichlet Marginal LikelihoodFor the sequence of values of Xi when

Xi’s parents have a particular value

Page 43: Crash Course on Machine  Learning Part  VI

Bayesian ScoreTheorem: If the prior P( |G) is “well-

behaved”, then

Page 44: Crash Course on Machine  Learning Part  VI

Asymptotic Behavior: Consequences

• Bayesian score is consistent– As M the “true” structure G* maximizes the score

(almost surely)– For sufficiently large M, the maximal scoring structures are

equivalent to G*• Observed data eventually overrides prior information

– Assuming that the prior assigns positive probability to all cases

Page 45: Crash Course on Machine  Learning Part  VI

Asymptotic Behavior

• This score can also be justified by the Minimal Description Length (MDL) principle

• This equation explicitly shows the tradeoff between– Fitness to data --- likelihood term– Penalty for complexity --- regularization term

Page 46: Crash Course on Machine  Learning Part  VI

Priors

• Over network structure– Minor role– Uniform,

• Over parameters– Separate prior for each structure is not feasible– Fixed dirichlet distributions– BDe Prior

Page 47: Crash Course on Machine  Learning Part  VI

BDe Score

Possible solution: The BDe prior• Represent prior using two elements M0, B0

– M0 - equivalent sample size– B0 - network representing the prior probability of

events

Page 48: Crash Course on Machine  Learning Part  VI

BDe Score

Intuition: M0 prior examples distributed by B0

• Set (xi,pai) = M0 P(xi,ai ) – Compute P(xi,pai

G| B0) using standard inference procedures

• Such priors have desirable theoretical properties– Equivalent networks are assigned the same score

Page 49: Crash Course on Machine  Learning Part  VI

Scores -- Summary

• Likelihood, MDL, (log) BDe have the form

• BDe requires assessing prior network.It can naturally incorporate prior knowledge and previous experience

• BDe is consistent and asymptotically equivalent (up to a constant) to MDL

• All are score-equivalent– G equivalent to G’ Score(G) = Score(G’)

Page 50: Crash Course on Machine  Learning Part  VI

Optimization Problem

Input:– Training data– Scoring function (including priors, if needed)– Set of possible structures

• Including prior knowledge about structure

Output:– A network (or networks) that maximize the score

Key Property:– Decomposability: the score of a network is a sum of terms.

Page 51: Crash Course on Machine  Learning Part  VI

Heuristic Search• Hard problem with tractable solutions for

– Tree structures– Known orders

• heuristic search

• Define a search space:– nodes are possible structures– edges denote adjacency of structures

• Traverse this space looking for high-scoring structures• Search techniques:

– Greedy hill-climbing– Best first search– Simulated Annealing– ...

Page 52: Crash Course on Machine  Learning Part  VI

Heuristic Search (cont.)• Typical operations:

S C

E

D

S C

E

D

Reverse C E

Delete C E

Add C D

S C

E

D

S C

E

D

Page 53: Crash Course on Machine  Learning Part  VI

Exploiting Decomposability in Local Search

• Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move

S C

E

D

S C

E

D

S C

E

D

S C

E

D

Page 54: Crash Course on Machine  Learning Part  VI

Greedy Hill-Climbing• Simplest heuristic local search

– Start with a given network• empty network• best tree • a random network

– At each iteration• Evaluate all possible changes• Apply change that leads to best improvement in score• Reiterate

– Stop when no modification improves score• Each step requires evaluating approximately n new

changes

Page 55: Crash Course on Machine  Learning Part  VI

Greedy Hill-Climbing: Possible Pitfalls

• Greedy Hill-Climbing can get struck in:– Local Maxima:

• All one-edge changes reduce the score– Plateaus:

• Some one-edge changes leave the score unchanged• Happens because equivalent networks received the same

score and are neighbors in the search space

• Both occur during structure search• Standard heuristics can escape both

– Random restarts– TABU search

Page 56: Crash Course on Machine  Learning Part  VI

Search: Summary• Discrete optimization problem• In general, NP-Hard

– Need to resort to heuristic search– In practice, search is relatively fast (~100 vars in ~10

min):• Decomposability• Sufficient statistics

• In some cases, we can reduce the search problem to an easy optimization problem– Example: learning trees

Page 57: Crash Course on Machine  Learning Part  VI

Learning in Graphical Models

Structure Observability Method

Known Full ML or MAP estimation

Known Partial EM algorithm

Unknown Full Model selection or model averaging

Unknown Partial EM + model sel. or model aver.

Page 58: Crash Course on Machine  Learning Part  VI

Graphical Models Summary• Representations

– Graphs are cool way to put constraints on distributions, so that you can say lots of stuff without even looking at the numbers!

• Inference– GM let you compute all kinds of different probabilities

efficiently

• Learning– You can even learn them auto-magically!

Page 59: Crash Course on Machine  Learning Part  VI

Graphical Models

• Pros and cons– Very powerful if dependency structure is sparse

and known– Modular (especially Bayesian networks)– Flexible representation (but not as flexible as

“feature passing”)– Many inference methods– Recent development in learning Markov network

parameters, but still tricky