learning in graphical models - disi, university of...

24
Learning in Graphical Models Andrea Passerini [email protected] Machine Learning Learning in Graphical Models

Upload: others

Post on 01-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning in Graphical Models

Andrea [email protected]

Machine Learning

Learning in Graphical Models

Page 2: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning graphical models

Parameter estimationWe assume the structure of the model is givenWe are given a dataset of examples D = {x(1), . . . , x(N)}Each example x(i) is a configuration for all (complete data)or some (incomplete data) variables in the modelWe need to estimate the parameters of the model(conditional probability distributions) from the dataThe simplest approach consists of learning the parametersmaximizing the likelihood of the data:

✓max = argmax✓p(D|✓) = argmax✓L(D,✓)

Learning in Graphical Models

Page 3: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning Bayesian Networks

X3(1)X2(1)

X1(1)

X3(N)

X1(N)

X2(N)

⇥3|1

X3(2)X2(2)

X1(2)

⇥1

⇥2|1

Maximum likelihood estimation, complete data

p(D|✓) =NY

i=1

p(x(i)|✓) examples independent given ✓

=NY

i=1

mY

j=1

p(xj

(i)|pa

j

(i),✓) factorization for BN

Learning in Graphical Models

Page 4: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning Bayesian Networks

X3(1)X2(1)

X1(1)

X3(N)

X1(N)

X2(N)

⇥3|1

X3(2)X2(2)

X1(2)

⇥1

⇥2|1

Maximum likelihood estimation, complete data

p(D|✓) =NY

i=1

mY

j=1

p(xj

(i)|pa

j

(i),✓) factorization for BN

=NY

i=1

mY

j=1

p(xj

(i)|pa

j

(i),✓X

j

|pa

j

) disjoint CPD parameters

Learning in Graphical Models

Page 5: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning graphical models

Maximum likelihood estimation, complete dataThe parameters of each CPD can be estimatedindependently:

✓maxX

j

|Pa

j

= argmax✓X

j

|Pa

j

NY

i=1

p(xj

(i)|pa

j

(i),✓X

j

|Pa

j

)

| {z }L(✓

X

j

|Pa

j

,D)

A discrete CPD P(X |U), can be represented as a table,with:

a number of rows equal to the number Val(X ) ofconfigurations for X

a number of columns equal to the number Val(U) ofconfigurations for its parents U

each table entry ✓x|u indicating the probability of a specific

configuration of X = x and its parents U = u

Learning in Graphical Models

Page 6: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning graphical models

Maximum likelihood estimation, complete dataReplacing p(x(i)|pa(i)) with ✓

x(i)|u(i), the local likelihood ofa single CPD becames:

L(✓X |Pa

,D) =NY

i=1

p(x(i)|pa(i),✓X |Pa

j

)

=NY

i=1

✓x(i)|u(i)

=Y

u2Val(U)

2

4Y

x2Val(X)

✓N

u,x

x |u

3

5

where N

u,x is the number of times the specificconfiguration X = x , U = u was found in the data

Learning in Graphical Models

Page 7: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning graphical models

Maximum likelihood estimation, complete dataA column in the CPD table contains a multinomialdistribution over values of X for a certain configuration ofthe parents U

Thus each column should sum to one:P

x

✓x |u = 1

Parameters of different columns can be estimatedindependentlyFor each multinomial distribution, zeroing the gradient ofthe maximum likelihood and considering the normalizationconstraint, we obtain:

✓max

x |u =N

u,xPx

N

u,x

The maximum likelihood parameters are simply the fractionof times in which the specific configuration was observed inthe data

Learning in Graphical Models

Page 8: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning graphical models

Adding priorsML estimation tends to overfit the training setConfiguration not appearing in the training set will receivezero probabilityA common approach consists of combining ML with a priorprobability on the parameters, achieving amaximum-a-posteriori estimate:

✓max = argmax✓p(D|✓)p(✓)

Learning in Graphical Models

Page 9: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning graphical models

Dirichlet priorsThe conjugate (read natural) prior for a multinomialdistribution is a Dirichlet distribution with parameters ↵

x |ufor each possible value of x

The resulting maximum-a-posteriori estimate is:

✓max

x |u =N

u,x + ↵x |uP

x

�N

u,x + ↵x |u

The prior is like having observed ↵x |u imaginary samples

with configuration X = x , U = u

Learning in Graphical Models

Page 10: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning graphical models

Incomplete dataWith incomplete data, some of the examples missevidence on some of the variablesCounts of occurrences of different configurations cannot becomputed if not all data are observedThe full Bayesian approach of integrating over missingvariables is often intractable in practiceWe need approximate methods to deal with the problem

Learning in Graphical Models

andrea passerini
Text
Page 11: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning with missing data: Expectation-Maximization

E-M for Bayesian nets in a nutshellSufficient statistics (counts) cannot be computed (missingdata)Fill-in missing data inferring them using current parameters(solve inference problem to get expected counts)Compute parameters maximizing likelihood (or posterior)of such expected countsIterate the procedure to improve quality of parameters

Learning in Graphical Models

Page 12: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning with missing data: Expectation-Maximization

Expectation-Maximization algorithme-step Compute the expected sufficient statistics for the

complete dataset, with expectation taken wrt thejoint distribution for X conditioned of the currentvalue of ✓ and the known data D:

E

p(x |D,✓)[Nijk

] =nX

l=1

p(Xi

(l) = x

k

, Pa

i

(l) = pa

j

|Xl

,✓)

If X

i

(l) and Pa

i

(l) are observed for X

l

, it iseither zero or oneOtherwise, run Bayesian inference tocompute probabilities from observed variables

Learning in Graphical Models

Page 13: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning with missing data: Expectation-MaximizationExpectation-Maximization algorithm

m-step compute parameters maximizing likelihood of thecomplete dataset D

c

(using expected counts):

✓⇤ = argmax✓p(Dc

|✓)

which for each multinomial parameter evaluates to:

✓⇤ijk

=E

p(x |D,✓)[Nijk

]P

r

i

k=1 E

p(x |D,✓)[Nijk

]

NoteML estimation can be replace by maximum a posterior (MAP)estimation giving:

✓⇤ijk

=↵

ijk

+ E

p(x |D,✓,S)[Nijk

]P

r

i

k=1

⇣↵

ijk

+ E

p(x |D,✓,S)[Nijk

]⌘

Learning in Graphical Models

Page 14: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning structure of graphical models

Approachesconstraint-based test conditional independencies on the data

and construct a model satisfying themscore-based assign a score to each possible structure, define

a search procedure looking for the structuremaximizing the score

model-averaging assign a prior probability to each structure,and average prediction over all possible structuresweighted by their probabilities (full Bayesian,intractable)

Learning in Graphical Models

Page 15: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Appendix: Learning the structureBayesian approach

Let S be the space of possible structures (DAGS) for thedomain X .Let D be a dataset of observationsPredictions for a new instance are computed marginalizingover both structures and parameters:

p(XN+1|D) =

X

S2S

Z

✓P(X

N+1, S,✓|D)d✓

=X

S2S

Z

✓P(X

N+1|S,✓,D)P(S,✓|D)d✓

=X

S2S

Z

✓P(X

N+1|S,✓)P(✓|S,D)P(S|D)d✓

=X

S2SP(S|D)

Z

✓P(X

N+1|S,✓)P(✓|S,D)d✓

Learning in Graphical Models

Page 16: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Learning the structure

ProblemAveraging over all possible structures is too expensive

Model selectionChoose a best structure S

⇤ and assume P(S⇤|D) = 1Approaches:

Score-based:Assign a score to each structureChoose S

⇤ to maximize the scoreConstraint-based:

Test conditional independencies on dataChoose S

⇤ that satifies these independencies

Learning in Graphical Models

Page 17: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Score-based model selection

Structure scoresMaximum-likelihood score:

S

⇤ = argmax

S2Sp(D|S)

Maximum-a-posteriori score:

S

⇤ = argmax

S2Sp(D|S)p(S)

Learning in Graphical Models

Page 18: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Computing P(D|S)

Maximum likelihood approximationThe easiest solution is to approximate P(D|S) with themaximum-likelihood score over the parameters:

P(D|S) ⇡ max✓P(D|S, ✓)

Unfortunately, this boils down to adding a connectionbetween two variables if their empirical mutual informationover the training set is non-zero (proof omitted)Because of noise, empirical mutual information betweenany two variables is almost never exactly zero ) fullyconnected network

Learning in Graphical Models

Page 19: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Computing P(D|S) ⌘ P

S

(D): Bayesian-Dirichletscoring

Simple case: settingX is a single variable with r possible realizations (r -faceddie)S is a single nodeProbability distribution is a multinomial with Dirichlet priors↵1, . . . ,↵r

.D is a sequence of N realizations (die tosses)

Learning in Graphical Models

Page 20: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Computing P

S

(D): Bayesian-Dirichlet scoring

Simple case: approachSort D according to outcome:

D = {x

1, x

1, . . . , x

1, x

2, . . . , x

2, . . . , x

r , . . . , x

r}

Its probability can be decomposed as:

P

S

(D) =NY

t=1

P

S

(X (t)|X (t � 1), . . . , X (1)| {z }D(t�1)

)

The prediction for a new event given the past is:

P

S

(X (t + 1) = x

k |D(t)) = E

p

S

(✓|D(t))[✓k

] =↵

k

+ N

k

(t)

↵ + t

where N

k

(t) is the number of times we have X = x

k in thefirst t examples in D

Learning in Graphical Models

Page 21: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Computing P

S

(D): Bayesian-Dirichlet scoring

Simple case: approach

P

S

(D) =↵1

↵1 + 1↵ + 1

· · · ↵1 + N1 � 1↵ + N1 � 1

· ↵2

↵ + N1

↵2 + 1↵ + N1 + 1

· · · ↵2 + N2 � 1↵ + N1 + N2 � 1

· · ·

· ↵r

↵ + N1 + · · · + N

r�1· · · ↵

r

+ N

r

� 1↵ + N � 1

=�(↵)

�(↵ + N)

rY

k=1

�(↵k

+ N

k

)

↵k

where we used the Gamma function (�(x + 1) = x�(x)):

↵(1 + ↵) . . . (N � 1 + ↵) =�(N + ↵)

�(↵)

Learning in Graphical Models

Page 22: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Computing P

S

(D): Bayesian-Dirichlet scoring

General case

P

S

(D) =Y

i

Y

j

�(↵ij

)

�(↵ij

+ N

ij

)

rY

k=1

�(↵ijk

+ N

ijk

)

↵ijk

wherei 2 {1, . . . , n} ranges over nodes in the networkj 2 {1, q

i

} ranges over configurations of X

i

’s parentsk 2 {1, r

i

} ranges over states of X

i

NoteThe score is decomposable: it is the product of independentscores associated with the distribution of each node in the net

Learning in Graphical Models

Page 23: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Search strategy

ApproachDiscrete search problem: NP-hard for nets whose nodeshave at most k > 1 parents.Heuristic search strategies employed:

Search space: set of DAGsOperators: add, remove, reverse one arcInitial structure: e.g. random, fully disconnected, ...Strategies: hill climbing, best first, simulated annealing

NoteDecomposable scores allow to recompute local scores only fora single move

Learning in Graphical Models

Page 24: Learning in Graphical Models - DISI, University of …disi.unitn.it/.../slides/10_learning_bn/talk.pdfLearning graphical models Maximum likelihood estimation, complete data A column

Equivalence classes

DefinitionTwo Bayesian network structures for X are independence

equivalent if they encode the same set ofconditional-independence assumptions for X

VerificationThe undirected versions of S and S

0 have the samestructureS and S

0 have the same v-structures:A

B

CA

B

C A

B

CA

B

C

S1 S2 S3 S4

S1 ⌘ S2 ⌘ S3 all encode that C ? A |BS4 6= S1 as it encodes C>>A |B

Learning in Graphical Models