patternrose/768/ppt/dhsch3part2.pdf · chapter 3: maximum-likelihood and bayesian parameter...

31
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Upload: others

Post on 02-Aug-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 0

0

Pattern Classification

All materials in these slides were taken fromPattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000with the permission of the authors and the publisher

Page 2: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

1

Bayesian Estimation (BE)Bayesian Parameter Estimation: Gaussian CaseBayesian Parameter Estimation: General

EstimationProblems of DimensionalityComputational ComplexityComponent Analysis and DiscriminantsHidden Markov Models

Chapter 3:Maximum-Likelihood and Bayesian

Parameter Estimation (part 2)

Page 3: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 2

2Bayesian Estimation (Bayesian learning to pattern classification problems)

In MLE θ was supposed fixIn BE θ is a random variableThe computation of posterior probabilities P(ωi | x) lies at the heart of Bayesian classificationGoal: compute P(ωi | x, D) Given the sample D, Bayes formula can be written

∑=

= c

jjj

iii

DPDp

DPDpDP

1)|(),|(

)|(),|(),|(ωω

ωωωx

xx

3

Page 4: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 3

3

To demonstrate the preceding equation, use:

)(),|(

)(),|(),|(

:Thustindependen are classesdifferent in samples that assume We

) this!provides sample (Training )|()(

prob.) totalof law (from )|,()|(prob.) cond. of def. (from )|(),|()|,(

1∑

=

=

=

=

=

c

jjj

iiii

ii

jj

iii

PDp

PDpDP

DPP

DPDPDPDPDP

ωω

ωωω

ωω

ω

ωωω

x

xx

xxxx

3

Page 5: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 4

4

Bayesian Parameter Estimation: Gaussian Case

Goal: Estimate θ using the a-posteriori density P(θ | D)

The univariate case: P(μ | D)μ is the only unknown parameter

(μ0 and σ0 are known!)

),N( ~ )P(),N( ~ ) | P(x

200

2

σμμ

σμμ

4

Page 6: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 5

5

But we know

Plugging in their gaussian expressions and extracting out factors not depending on μ yields:

∫=

=

=

=

nk

kk PxP

dPPPPP

1

)()|(

Formula Bayes )()|()()|()|(

μμα

μμμμμμ

DDD

) ,(~)( and ),(~)|( 200

2 σμμσμμ NpNxp k

93) page 29 eq. (from

12121exp)|( 2

0

0

12

220

2 ⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛+−⎟⎟

⎞⎜⎜⎝

⎛+−= ∑

=

μσμ

σμ

σσαμ

n

kkx

nDp

4

Page 7: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 6

6

Observation: p(μ|D) is an exponential of a quadratic

It is again normal! It is called a reproducing density

),(~)|( 2nnNP σμμ D

4

⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛+−⎟⎟

⎞⎜⎜⎝

⎛+−= ∑

=

μσμ

σμ

σσαμ 2

0

0

12

220

2

12121exp)|(

n

kkx

nDp

Page 8: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 7

7

Identifying coefficients in the top equation with that of the generic Gaussian

Yields expressions for μn and σn2

⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛+−⎟⎟

⎞⎜⎜⎝

⎛+−= ∑

=

μσμ

σμ

σσαμ 2

0

0

12

220

2

12121exp)|(

n

kkx

nDp

21exp

21)|(

2

⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡ −−=

n

n

n

Dpσμμ

σπμ

4

20

0222

022 ˆ and 11

σμμ

σσμ

σσσ+=+= n

n

n

n

nn

Page 9: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 8

8

Solving for μn and σn2 yields:

From these equations we see as n increases:the variance decreases monotonicallythe estimate of p(μ|D) becomes more peaked

220

2202

0220

2

2200

20

ˆ

σσσσσ

μσσ

σμσσ

σμ

+=

++⎟⎟

⎞⎜⎜⎝

⎛+

=

nand

nnn

n

nn

4

Page 10: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 9

9

4

Page 11: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 10

10The univariate case P(x | D)

P(μ | D) computed (in preceding discussion)P(x | D) remains to be computed!

It provides:

We know σ2 and how to compute (Desired class-conditional density P(x | Dj, ωj))Therefore: using P(x | Dj, ωj) together with P(ωj)And using Bayes formula, we obtain theBayesian classification rule:

Gaussian is )|()|()|( μμμ dPxPxP DD ∫= ),(~)|( 22

nnNxP σσμ +D

[ ] [ ])(),|(,|( jjjj PxPMaxxPMaxjj

ωωωωω

DD ≡

4

2 and nn σμ

Page 12: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 11

11

3.5 Bayesian Parameter Estimation: General Theory

P(x | D) computation can be applied to any situation in which the unknown density can be parameterized: the basic assumptions are:

The form of P(x | θ) is assumed known, but the value of θis not known exactlyOur knowledge about θ is assumed to be contained in a known prior density P(θ)The rest of our knowledge θ is contained in a set D of n random variables x1, x2, …, xn that follows P(x)

5

Page 13: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 12

12The basic problem is:“Compute the posterior density P(θ | D)”then “Derive P(x | D)”, where

Using Bayes formula, we have:

And by the independence assumption:

)|()|(1

θxθ k

n

k

pp ∏=

=D

,)()|()()|()|(

∫=

θθθθθθdPP

PPPDDD

5

θθθxx dDppDp )|()|()|( ∫=

Page 14: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 13

13

Convergence (from notes)

Page 15: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 14

14

Problems of DimensionalityProblems involving 50 or 100 features are common (usually binary valued)Note: microarray data might entail ~20000 real-valued featuresClassification accuracy dependant on

dimensionalityamount of training datadiscrete vs continuous

7

Page 16: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 15

15

Case of two class multivariate normal with the same covariance

P(x|ωj) ~N(μj,Σ), j=1,2Statistically independent featuresIf the priors are equal then:

0)(limdistance sMahalanobi squared theis

)()( :

error) (Bayes 21)(

221

121

22/

22

=

−Σ−=

=

∞→

∞−

errorPr

rwhere

dueerrorP

r

tr

u

μμμμ

π

7

Page 17: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 16

16

If features are conditionally independent then:

Do we remember what conditional independence is?Example for binary features:

Let pi= Pr[xi=1|ω1] then P(x|ω1) is the product of the pi

2di

1i i

2i1i2

2d

22

21

r

),...,,(diag

∑ ⎟⎟⎠

⎞⎜⎜⎝

⎛σμ−μ

=

σσσ=Σ

=

=

7

Page 18: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 17

17

Most useful features are the ones for which the difference between the means is large relative to the standard deviation

Doesn’t require independence

Adding independent features helps increase r reduce error

Caution: adding features increases cost & complexity of feature extractor and classifier

It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance:

we have the wrong model !we don’t have enough training data to support the additional dimensions

7

Page 19: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 18

18

77

7

Page 20: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 19

19Computational Complexity

Our design methodology is affected by the computational difficulty

“big oh” notationf(x) = O(h(x)) “big oh of h(x)”If:(An upper bound on f(x) grows no worse than h(x) for sufficiently large x!)

f(x) = 2+3x+4x2

g(x) = x2

f(x) = O(x2)

)x(hc)x(f ;)x,c( 02

00 ≤ℜ∈∃

7

Page 21: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 20

20

“big oh” is not unique!f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)

“big theta” notationf(x) = θ(h(x))If:

f(x) = θ(x2) but f(x) ≠ θ(x3)

)x(gc)x(f)x(gc0xx ;)c,c,x(

21

03

210

≤≤≤>∀ℜ∈∃

7

Page 22: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 21

21Complexity of the ML Estimation

Gaussian priors in d dimensions classifier with n training samples for each of c classes

For each category, we have to compute the discriminant function

Total = O(d2n)Total for c classes = O(cd2n) ≅ O(d2n)

Cost increase when d and n are large!

} }

321321

876

)()(

)1()()(

)(lnˆln212ln

2)ˆ(ˆ)ˆ(

21)(

2

2

nOndO

OndO

tdnO

Pdg ωπ +−−−−−= − ΣμxΣμxx 1

7

Page 23: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 22

22

OverfittingDimensionality of model vs size of training data

Issue: not enough data to support the modelPossible solutions:

Reduce model dimensionalityMake (possibly incorrect) assumptions to better estimate Σ

Page 24: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 23

23

OverfittingEstimate better Σ

use data pooled from all classesnormalization issues

use pseudo-Bayesian form λΣ0 + (1-λ)Σn“doctor” Σ by thresholding entries

reduces chance correlations

assume statistical independencezero all off-diagonal elements

Page 25: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 24

24

Shrinkage

Shrinkage: weighted combination of common and individual covariances

We can also shrink the estimate common covariancestoward the identity matrix

10for )1(

)1()( <<+−+−

= ααααααnnnn

i

iii

ΣΣΣ

10for ) 1()( <<+−= ββββ IΣΣ

Page 26: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 25

25Component Analysis and Discriminants

Combine features in order to reduce the dimension of the feature spaceLinear combinations are simple to compute and tractableProject high dimensional data onto a lower dimensional spaceTwo classical approaches for finding “optimal”linear transformation

PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense”MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least-squares sense”

8

Page 27: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 26

26

PCA (from notes)

Page 28: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 27

27Hidden Markov Models: Markov Chains

Goal: make a sequence of decisions

Processes that unfold in time, states at time t are influenced by a state at time t-1

Applications: speech recognition, gesture recognition, parts of speech tagging and DNA sequencing,

Any temporal process without memoryωT = {ω(1), ω(2), ω(3), …, ω(T)} sequence of statesWe might have ω6 = {ω1, ω4, ω2, ω2, ω1, ω4}

The system can revisit a state at different steps and not every state need to be visited

10

Page 29: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 28

28

First-order Markov models

Our productions of any sequence is described by the transition probabilities

P(ωj(t + 1) | ωi (t)) = aij

10

Page 30: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 29

29

10

Page 31: Patternrose/768/ppt/DHSch3part2.pdf · Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2) Pattern Classification, Chapter 3 2 2 zBayesian Estimation (Bayesian

Pattern Classification, Chapter 3 30

30

θ = (aij, ωT)P(ωT | θ) = a14 . a42 . a22 . a21 . a14 .

P(ω(1) = ωi)

Example: speech recognition

“production of spoken words”Production of the word: “pattern” represented

by phonemes/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)

Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/, /er/ to /n/ and /n/ to a silent state

10