cs b 351 l earning p robabilistic m odels. m otivation past lectures have studied how to infer...

Post on 30-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS B351LEARNING PROBABILISTIC MODELS

MOTIVATION

Past lectures have studied how to infer characteristics of a distribution, given a fully-specified Bayes net

Next few lectures: where does the Bayes net come from?

Win?

Strength Opponent Strength

Win?

Offense strength

Opp. Off.

Strength

Defense strength

Opp. Def.

Strength

Pass yds

Rush yds Rush yds

allowed

Score allowed

SWin?

Offense strength

Opp. Off.

Strength

Defense strength

Opp. Def.

Strength

Pass yds

Rush yds Rush yds

allowed

Score allowed

Strength of

schedule

At Home

?

Injuries?Opp

injuries?

SWin?

Offense strength

Opp. Off.

Strength

Defense strength

Opp. Def.

Strength

Pass yds

Rush yds Rush yds

allowed

Score allowed

Strength of

schedule

At Home

?

Injuries?Opp

injuries?

AGENDA

Learning probability distributions from example data

Influence of structure on performance Maximum likelihood estimation (MLE) Bayesian estimation

PROBABILISTIC ESTIMATION PROBLEM

Our setting: Given a set of examples drawn from the target

distribution Each example is complete (fully observable)

Goal: Produce some representation of a belief state so

we can perform inferences & draw certain predictions

DENSITY ESTIMATION

Given dataset D={d[1],…,d[M]} drawn from underlying distribution P*

Find a distribution that matches P* as “close” as possible

High-level issues: Usually, not enough data to get an accurate

picture of P*, which forces us to approximate. Even if we did have P*, how do we define

“closeness” (both theoretically and in practice)? How do we maximize “closeness”?

WHAT CLASS OF PROBABILITY MODELS?

For small discrete distributions, just use a tabular representation Very efficient learning techniques

For large discrete distributions or continuous ones, the choice of probability model is crucial Increasing complexity =>

Can represent complex distributions more accurately Need more data to learn well (risk of overfitting) More expensive to learn and to perform inference

TWO LEARNING PROBLEMS

Parameter learning What entries should be put into the model’s

probability tables? Structure learning

Which variables should be represented / transformed for inclusion in the model?

What direct / indirect relationships between variables should be modeled?

More “high level” problem Once structure is chosen, a set of (unestimated)

parameters emerge These need to be estimated using parameter learning

LEARNING COIN FLIPS Cherry and lime candies are in an opaque

bag Observe that c out of N draws are cherries

(data)

LEARNING COIN FLIPS Observe that c out of N draws are cherries

(data) Intuition: c/N might be a good hypothesis for

the fraction of cherries in the bag(or it might not, depending on the draw!)

“Intuitive” parameter estimate: empirical distribution P(cherry) c / N(this will be justified more thoroughly later)

STRUCTURE LEARNING EXAMPLE: HISTOGRAM BUCKET SIZES

Histograms are used to estimate distributions of continuous or large #s of discrete values… but how fine?

0 20 40 60 80 100

120

140

160

180

200

012345678

0 20 40 60 80 1001201401601802000

2

4

6

8

10

12

14

16

0 100 2000

5

10

15

20

25

30

35

0 16 32 48 64 80 96 112

128

144

160

176

192

0

1

2

3

4

5

6

STRUCTURE LEARNING: INDEPENDENCE RELATIONSHIPS

Compare table P(A,B,C,D) vs P(A)P(B)P(C)P(D)

Case 1: 15 free parameters (16 entries – sum to 1 constraint) P(ABCD) = p1

P(ABCD) = p2

… P(ABCD) = p15

P(ABCD) = 1-p1-…-p15

Case 2: 4 free parameters P(A)=p1, P(A)=1-p1

P(D)=p4, P(D)=1-p4

STRUCTURE LEARNING: INDEPENDENCE RELATIONSHIPS

Compare table P(A,B,C,D) vs P(A)P(B)P(C)P(D)

P(A,B,C,D) Would be able to fit ALL relationships in the data

P(A)P(B)P(C)P(D) Inherently does not have the capability to

accurately model correlations like A~=B Leads to biased estimates: overestimate or

underestimate the true probabilities

1

2

3

0

0.1

0.2

0.3

1

2

3

1

2

3

0

0.1

0.2

1

2

3

Original joint distribution P(X,Y)Learned using independence

assumption P(X)P(Y)

XY

YX

STRUCTURE LEARNING: EXPRESSIVE POWER

Making more independence assumptions always makes a probabilistic model less expressive

If the independence relationships assumed by structure model A are a superset of those in structure B, then B can express any probability distribution that A can

X

Y Z

X

Y Z

X

Y Z

C

F1 F2 Fk

C

F1 F2 Fk

Or

?

ARCS DO NOT NECESSARILY ENCODE CAUSALITY!

A

B

C

C

B

A

2 BN’s that can encode the same joint probability distribution

READING OFF INDEPENDENCE RELATIONSHIPS

Given B, does the value of A affect the probability of C? P(C|B,A) = P(C|B)?

No! C parent’s (B) are

given, and so it is independent of its non-descendents (A)

Independence is symmetric:C A | B => A C | B

A

B

C

LEARNING IN THE FACE OF NOISY DATA

Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

X Y

Model 1

X Y

Model 2

LEARNING IN THE FACE OF NOISY DATA

Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

X Y

Model 1

X Y

Model 2

Parameters estimated via empirical distribution (“Intuitive fit”)

P(X=H) = 9/20P(Y=H) = 8/20

P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11

LEARNING IN THE FACE OF NOISY DATA

Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

X Y

Model 1

X Y

Model 2

Parameters estimated via empirical distribution (“Intuitive fit”)

P(X=H) = 9/20P(Y=H) = 8/20

P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11 Errors are

likely to be larger!

STRUCTURE LEARNING: FIT VS COMPLEXITY

Must trade off fit of data vs. complexity of model

Complex models More parameters to learn More expressive More data fragmentation = greater sensitivity

to noise

STRUCTURE LEARNING: FIT VS COMPLEXITY

Must trade off fit of data vs. complexity of model

Complex models More parameters to learn More expressive More data fragmentation = greater sensitivity

to noise

Typical approaches explore multiple structures, while optimizing the trade off between fit and complexity

Need a way of measuring “complexity” (e.g., number of edges, number of parameters) and “fit”

FURTHER READING ON STRUCTURE LEARNING

Structure learning with statistical independence testing

Score-based methods (e.g., Bayesian Information Criterion)

Bayesian methods with structure priors Cross-validated model selection (more on this

later)

STATISTICAL PARAMETER LEARNING

LEARNING COIN FLIPS Observe that c out of N draws are cherries

(data) Let the unknown fraction of cherries be q

(hypothesis) Probability of drawing a cherry is q Assumption: draws are independent and

identically distributed (i.i.d)

LEARNING COIN FLIPS Probability of drawing a cherry is q Assumption: draws are independent and

identically distributed (i.i.d) Probability of drawing 2 cherries is *q q Probability of drawing 2 limes is (1-q)2

Probability of drawing 1 cherry and 1 lime: *(1- )q q

LIKELIHOOD FUNCTION

Likelihood of data d={d1,…,dN} given q

P(d|q) = Pj P(dj|q) = qc (1-q)N-c

i.i.d assumption Gather c cherry terms together, then N-c lime terms

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1/1 cherry

q

P(d

ata

|q)

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

2/2 cherry

q

P(d

ata

|q)

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

2/3 cherry

q

P(d

ata

|q)

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2/4 cherry

q

P(d

ata

|q)

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

2/5 cherry

q

P(d

ata

|q)

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.10.20.30.40.50.60.70.80.9 10

0.0000002

0.0000004

0.0000006

0.0000008

0.000001

0.0000012

10/20 cherry

q

P(d

ata

|q)

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-31

2E-31

3E-31

4E-31

5E-31

6E-31

7E-31

8E-31

9E-31

50/100 cherry

q

P(d

ata

|q)

MAXIMUM LIKELIHOOD

Peaks of likelihood function seem to hover around the fraction of cherries…

Sharpness indicates some notion of certainty…

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-31

2E-31

3E-31

4E-31

5E-31

6E-31

7E-31

8E-31

9E-31

50/100 cherry

q

P(d

ata

|q)

MAXIMUM LIKELIHOOD

P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

MAXIMUM LIKELIHOOD

l(q) = log P(d|q) = log [ qc (1-q)N-c]

MAXIMUM LIKELIHOOD

l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]

MAXIMUM LIKELIHOOD

l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]= c log q + (N-c) log (1-q)

MAXIMUM LIKELIHOOD

l(q) = log P(d|q) = c log q + (N-c) log (1-q) Setting dl/dq( )q = 0 gives the maximum likelihood

estimate

MAXIMUM LIKELIHOOD

dl/dq(q) = c/q – (N-c)/(1-q) At MLE, c/q – (N-c)/(1-q) = 0

=> q = c/N

OTHER MLE RESULTS

Categorical distributions (Non-binary discrete variables): take fraction of counts for each value (histogram)

Continuous Gaussian distributions Mean = average data Standard deviation = standard deviation of data

AN ALTERNATIVE APPROACH: BAYESIAN ESTIMATION

P(q|d) = 1/ Z P(d|q) P(q) is the posterior Distribution of hypotheses given the data

P(d|q) is the likelihood P(q) is the hypothesis prior

q

d[1] d[2] d[M]

ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION

Assume P(q) is uniform P(q|d) = 1/ Z P(d|q) = 1/Z qc(1-q)N-c

What’s P(Y|D)?

qi

d[1] d[2] d[M]

Y

ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION

Assume P(q) is uniform P(q|d) = 1/ Z P(d|q) = 1/Z qc(1-q)N-c

What’s P(Y|D)?

qi

d[1] d[2] d[M]

Y

ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION

=>Z = c! (N-c)! / (N+1)! =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)!

= (c+1) / (N+2)

qi

d[1] d[2] d[M]

Y

Can think of this as a “correction” using “virtual counts”

NONUNIFORM PRIORS

P(q|d) P(d|q)P(q) = qc (1-q)N-c P(q)

Define, for all q, the probability that I believe in q

10 q

P(q)

BETA DISTRIBUTION

Betaa,b(q) = g qa-1 (1-q)b-1

a, b hyperparameters > 0 g is a normalization

constant a=b=1 is uniform

distribution

POSTERIOR WITH BETA PRIOR

Posterior qc (1-q)N-c P(q)= g qc+a-1 (1-q)N-c+b-1

= Betaa+c,b+N-c(q)

Prediction = meanE[ ]q =(c+a)/(N+a+b)

POSTERIOR WITH BETA PRIOR

What does this mean? Prior specifies a “virtual

count” of a=a-1 heads, b=b-1 tailsSee heads, increment aSee tails, increment b

Effect of prior diminishes with more data

CHOOSING A PRIOR

Part of the design process; must be chosen according to your intuition

Uninformed belief = =1a b , strong belief => ,a b high

EXTENSIONS OF BETA PRIORS Parameters of multi-valued (categorical)

distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in

practice still takes the form of “virtual counts”

0 20 40 60 80 1001201401601802000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 20 40 60 80 1001201401601802000

0.05

0.1

0.15

0.2

0.25

0 20 40 60 80 1001201401601802000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 20 40 60 80 1001201401601802000

0.020.040.060.080.1

0.120.140.160.18

0 1

5 10

RECAP

Learning probabilistic models Parameter vs. structure learning Single-parameter learning via coin flips

Maximum Likelihood Bayesian Learning with Beta prior

MAXIMUM LIKELIHOOD FOR BN

For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values

Alarm

Earthquake Burglar

E: 500 B: 200

N=1000

P(E) = 0.5 P(B) = 0.2

A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380

E B P(A|E,B)

T T 0.95

F T 0.95

T F 0.34

F F 0.003

FITTING CPTS

Each ML entry P(xi|paXi) is given by examining counts of (xi,paXi) in D and normalizing across rows of the CPT

Note that for large k=|PaXi|, very few datapoints will share the values of paXi! O(|D|/2k), but some values may be even rarer Large domains |Val(Xi)| can also be a problem Data fragmentation

top related