learning in graphical models - disi, university of...

Learning in Graphical Models

Andrea [email protected]

Machine Learning


Learning graphical models

Parameter estimationWe assume the structure of the model is givenWe are given a dataset of examples D = {x(1), . . . , x(N)}Each example x(i) is a configuration for all (complete data)or some (incomplete data) variables in the modelWe need to estimate the parameters of the model(conditional probability distributions) from the dataThe simplest approach consists of learning the parametersmaximizing the likelihood of the data:

✓max = argmax✓p(D|✓) = argmax✓L(D,✓)


Learning Bayesian Networks

X3(1)X2(1)

X1(1)

X3(N)

X1(N)

X2(N)

⇥3|1

X3(2)X2(2)

X1(2)

⇥1

⇥2|1

Maximum likelihood estimation, complete data

p(D|✓) =NY

i=1

p(x(i)|✓) examples independent given ✓

=NY

i=1

mY

j=1

p(xj

(i)|pa

j

(i),✓) factorization for BN


Learning Bayesian Networks

X3(1)X2(1)

X1(1)

X3(N)

X1(N)

X2(N)

⇥3|1

X3(2)X2(2)

X1(2)

⇥1

⇥2|1

Maximum likelihood estimation, complete data

p(D|✓) =NY

i=1

mY

j=1

p(xj

(i)|pa

j

(i),✓) factorization for BN

=NY

i=1

mY

j=1

p(xj

(i)|pa

j

(i),✓X

j

|pa

j

) disjoint CPD parameters



Maximum likelihood estimation, complete dataThe parameters of each CPD can be estimatedindependently:

✓maxX

j

|Pa

j

= argmax✓X

j

|Pa

j

NY

i=1

p(xj

(i)|pa

j

(i),✓X

j

|Pa

j

)

| {z }L(✓

X

j

|Pa

j

,D)

A discrete CPD P(X |U), can be represented as a table,with:

a number of rows equal to the number Val(X ) ofconfigurations for X

a number of columns equal to the number Val(U) ofconfigurations for its parents U

each table entry ✓x|u indicating the probability of a specific

configuration of X = x and its parents U = u



Maximum likelihood estimation, complete dataReplacing p(x(i)|pa(i)) with ✓

x(i)|u(i), the local likelihood ofa single CPD becames:

L(✓X |Pa

,D) =NY

i=1

p(x(i)|pa(i),✓X |Pa

j

)

=NY

i=1

✓x(i)|u(i)

=Y

u2Val(U)

2

4Y

x2Val(X)

✓N

u,x

x |u

3

5

where N

u,x is the number of times the specificconfiguration X = x , U = u was found in the data



Maximum likelihood estimation, complete dataA column in the CPD table contains a multinomialdistribution over values of X for a certain configuration ofthe parents U

Thus each column should sum to one:P

x

✓x |u = 1

Parameters of different columns can be estimatedindependentlyFor each multinomial distribution, zeroing the gradient ofthe maximum likelihood and considering the normalizationconstraint, we obtain:

✓max

x |u =N

u,xPx

N

u,x

The maximum likelihood parameters are simply the fractionof times in which the specific configuration was observed inthe data



Adding priorsML estimation tends to overfit the training setConfiguration not appearing in the training set will receivezero probabilityA common approach consists of combining ML with a priorprobability on the parameters, achieving amaximum-a-posteriori estimate:

✓max = argmax✓p(D|✓)p(✓)



Dirichlet priorsThe conjugate (read natural) prior for a multinomialdistribution is a Dirichlet distribution with parameters ↵

x |ufor each possible value of x

The resulting maximum-a-posteriori estimate is:

✓max

x |u =N

u,x + ↵x |uP

x

�N

u,x + ↵x |u

�

The prior is like having observed ↵x |u imaginary samples

with configuration X = x , U = u



Incomplete dataWith incomplete data, some of the examples missevidence on some of the variablesCounts of occurrences of different configurations cannot becomputed if not all data are observedThe full Bayesian approach of integrating over missingvariables is often intractable in practiceWe need approximate methods to deal with the problem


andrea passerini

Text

Learning with missing data: Expectation-Maximization

E-M for Bayesian nets in a nutshellSufficient statistics (counts) cannot be computed (missingdata)Fill-in missing data inferring them using current parameters(solve inference problem to get expected counts)Compute parameters maximizing likelihood (or posterior)of such expected countsIterate the procedure to improve quality of parameters


Learning with missing data: Expectation-Maximization

Expectation-Maximization algorithme-step Compute the expected sufficient statistics for the

complete dataset, with expectation taken wrt thejoint distribution for X conditioned of the currentvalue of ✓ and the known data D:

E

p(x |D,✓)[Nijk

] =nX

l=1

p(Xi

(l) = x

k

, Pa

i

(l) = pa

j

|Xl

,✓)

If X

i

(l) and Pa

i

(l) are observed for X

l

, it iseither zero or oneOtherwise, run Bayesian inference tocompute probabilities from observed variables


Learning with missing data: Expectation-MaximizationExpectation-Maximization algorithm

m-step compute parameters maximizing likelihood of thecomplete dataset D

c

(using expected counts):

✓⇤ = argmax✓p(Dc

|✓)

which for each multinomial parameter evaluates to:

✓⇤ijk

=E

p(x |D,✓)[Nijk

]P

r

i

k=1 E

p(x |D,✓)[Nijk

]

NoteML estimation can be replace by maximum a posterior (MAP)estimation giving:

✓⇤ijk

=↵

ijk

+ E

p(x |D,✓,S)[Nijk

]P

r

i

k=1

⇣↵

ijk

+ E

p(x |D,✓,S)[Nijk

]⌘


Learning structure of graphical models

Approachesconstraint-based test conditional independencies on the data

and construct a model satisfying themscore-based assign a score to each possible structure, define

a search procedure looking for the structuremaximizing the score

model-averaging assign a prior probability to each structure,and average prediction over all possible structuresweighted by their probabilities (full Bayesian,intractable)


Learning the structure

ProblemAveraging over all possible structures is too expensive

Model selectionChoose a best structure S

⇤ and assume P(S⇤|D) = 1Approaches:

Score-based:Assign a score to each structureChoose S

⇤ to maximize the scoreConstraint-based:

Test conditional independencies on dataChoose S

⇤ that satifies these independencies


Score-based model selection

Structure scoresMaximum-likelihood score:

S

⇤ = argmax

S2Sp(D|S)

Maximum-a-posteriori score:

S

⇤ = argmax

S2Sp(D|S)p(S)


Computing P(D|S)

Maximum likelihood approximationThe easiest solution is to approximate P(D|S) with themaximum-likelihood score over the parameters:

P(D|S) ⇡ max✓P(D|S, ✓)

Unfortunately, this boils down to adding a connectionbetween two variables if their empirical mutual informationover the training set is non-zero (proof omitted)Because of noise, empirical mutual information betweenany two variables is almost never exactly zero ) fullyconnected network


Computing P(D|S) ⌘ P

S

(D): Bayesian-Dirichletscoring

Simple case: settingX is a single variable with r possible realizations (r -faceddie)S is a single nodeProbability distribution is a multinomial with Dirichlet priors↵1, . . . ,↵r

.D is a sequence of N realizations (die tosses)


Computing P

S

(D): Bayesian-Dirichlet scoring

Simple case: approachSort D according to outcome:

D = {x

1, x

1, . . . , x

1, x

2, . . . , x

2, . . . , x

r , . . . , x

r}

Its probability can be decomposed as:

P

S

(D) =NY

t=1

P

S

(X (t)|X (t � 1), . . . , X (1)| {z }D(t�1)

)

The prediction for a new event given the past is:

P

S

(X (t + 1) = x

k |D(t)) = E

p

S

(✓|D(t))[✓k

] =↵

k

+ N

k

(t)

↵ + t

where N

k

(t) is the number of times we have X = x

k in thefirst t examples in D


Computing P

S


Simple case: approach

P

S

(D) =↵1

↵

↵1 + 1↵ + 1

· · · ↵1 + N1 � 1↵ + N1 � 1

· ↵2

↵ + N1

↵2 + 1↵ + N1 + 1

· · · ↵2 + N2 � 1↵ + N1 + N2 � 1

· · ·

· ↵r

↵ + N1 + · · · + N

r�1· · · ↵

r

+ N

r

� 1↵ + N � 1

=�(↵)

�(↵ + N)

rY

k=1

�(↵k

+ N

k

)

↵k

where we used the Gamma function (�(x + 1) = x�(x)):

↵(1 + ↵) . . . (N � 1 + ↵) =�(N + ↵)

�(↵)


Computing P

S


General case

P

S

(D) =Y

i

Y

j

�(↵ij

)

�(↵ij

+ N

ij

)

rY

k=1

�(↵ijk

+ N

ijk

)

↵ijk

wherei 2 {1, . . . , n} ranges over nodes in the networkj 2 {1, q

i

} ranges over configurations of X

i

’s parentsk 2 {1, r

i

} ranges over states of X

i

NoteThe score is decomposable: it is the product of independentscores associated with the distribution of each node in the net


Search strategy

ApproachDiscrete search problem: NP-hard for nets whose nodeshave at most k > 1 parents.Heuristic search strategies employed:

Search space: set of DAGsOperators: add, remove, reverse one arcInitial structure: e.g. random, fully disconnected, ...Strategies: hill climbing, best first, simulated annealing

NoteDecomposable scores allow to recompute local scores only fora single move


Equivalence classes

DefinitionTwo Bayesian network structures for X are independence

equivalent if they encode the same set ofconditional-independence assumptions for X

VerificationThe undirected versions of S and S

0 have the samestructureS and S

0 have the same v-structures:A

B

CA

B

C A

B

CA

B

C

S1 S2 S3 S4

S1 ⌘ S2 ⌘ S3 all encode that C ? A |BS4 6= S1 as it encodes C>>A |B


learning in graphical models - disi, university of...

Documents