1 parameter estimation shyh-kang jeng department of electrical engineering/ graduate institute of...

11

Parameter EstimationParameter Estimation

Shyh-Kang JengShyh-Kang JengDepartment of Electrical Engineering/Department of Electrical Engineering/Graduate Institute of Communication/Graduate Institute of Communication/Graduate Institute of Networking and MultiGraduate Institute of Networking and Multimedia, National Taiwan Universitymedia, National Taiwan University

22

Typical Classification ProblemTypical Classification ProblemRarely know the complete Rarely know the complete probabilistic structure of the problemprobabilistic structure of the problemHave vague, general knowledgeHave vague, general knowledgeHave a number of design samples or Have a number of design samples or training data as representatives of training data as representatives of patterns for classificationpatterns for classificationFind some way to use this Find some way to use this information to design or train the information to design or train the classifierclassifier

33

Estimating ProbabilitiesEstimating ProbabilitiesNot difficulty to Estimate prior Not difficulty to Estimate prior probabilitiesprobabilitiesHard to estimate class-conditional Hard to estimate class-conditional densitiesdensities– Number of available samples always Number of available samples always

seems too smallseems too small– Serious when dimensionality is largeSerious when dimensionality is large

44

Estimating ParametersEstimating ParametersProblems permit to parameterize the coProblems permit to parameterize the conditional densitiesnditional densitiesSimplifies the problem from one of estiSimplifies the problem from one of estimating an unknown function to one of emating an unknown function to one of estimating the parametersstimating the parameters– e.g.,e.g., mean vector and covariance matrix for mean vector and covariance matrix for multi-variate normal distribution multi-variate normal distribution

55

Maximum-Likelihood EstimationMaximum-Likelihood EstimationView the parameters as quantities View the parameters as quantities whose values are fixed but unknownwhose values are fixed but unknownBest estimate is the one that Best estimate is the one that maximize the probability of obtaining maximize the probability of obtaining the samples actually observedthe samples actually observedNearly always have good Nearly always have good convergence properties as the convergence properties as the number of samples increasesnumber of samples increasesOften simpler than alternative Often simpler than alternative methodsmethods

66

I. I. D. Random VariablesI. I. D. Random VariablesSeparate data into Separate data into DD11, . . ., , . . ., DDccSamples in Samples in DDjj are drawn independently a are drawn independently according to ccording to pp((xx||jj))Such samples are independent and identiSuch samples are independent and identically distributed (i. i. d.) random variablescally distributed (i. i. d.) random variablesLet Let pp((xx||jj)) has a known parametric form a has a known parametric form and is determined uniquely by a parametend is determined uniquely by a parameter vector r vector j,j,, , i.e.,i.e., p p((xx||jj))=p=p((xx||jj,,jj))

77

Simplification AssumptionsSimplification AssumptionsSamples in Samples in DDii give no information about give no information about jj, if , if ii is not equal to is not equal to jjCan work with each class separatelyCan work with each class separatelyHave Have cc separate problems of the same fo separate problems of the same formrm– Use set Use set DD of i. i. d. samples from of i. i. d. samples from pp((xx||)) to esti to estimate unknown parameter vector mate unknown parameter vector

88

Maximum-likelihood EstimateMaximum-likelihood Estimate

)|( maximizes

ˆ estimate likelihood-maximum

) respect to with of d(likelihoo

)|()|(

,, samples d. i. i.contain Let

1

1

Dp

D

pDp

Dn

kk

n

x

xx

99

Maximum-likelihood EstimationMaximum-likelihood Estimation

1010

A NoteA NoteThe likelihood The likelihood pp((DD||)) as a function of as a function of is is not a probability density function of not a probability density function of Its area on the Its area on the -domain has no significa-domain has no significancenceThe likelihood The likelihood pp((DD||)) can be regarded as can be regarded as probability of probability of DD for a given for a given

1111

Analytical ApproachAnalytical Approach

p

tp

n

kk

n

kk

l

xplxpl

l

Dpl

1

1

11

,,,

0:ˆfor condition necessary

)|(ln,)|(ln)(

)(maxargˆ)|(ln)(function likelihood-log

1212

MAP EstimatorsMAP Estimators

prior uniform for theestimator MAPan isestimator (ML) likelihood-maximum

valuesparameter different ofy probabilitprior :)()(ln)( maximize that find

:estimator (MAP) posteriori a maximum

ppl

1313

Gaussian Case: Unknown Gaussian Case: Unknown

n

kk

n

kk

kk

kt

kd

k

n

p

p

1

1

1

1

1

1ˆ

0ˆ

)|(ln212ln

21)|(ln

x

xΣ

xΣx

xΣxΣx

1414

Univariate Gaussian Case: UnknowUnivariate Gaussian Case: Unknown n and and 22

n

kk

n

kk

n

k

kn

k

n

kkn

kk

kk

xn

xn

l

x

xxpl

xxp

1

22

1

122

21

1 2

11

2

1

21

22

221

ˆ1ˆ,1ˆ0

221

1

)|(

212ln

21)|(ln

,

1515

Multivariate Gaussian Case: Multivariate Gaussian Case: Unknown Unknown and and

n

k

tkk

n

kk

n

n

1

1

ˆˆ1ˆ

1ˆ

xxΣ

x

1616

Bias, Absolutely Unbiased, and Bias, Absolutely Unbiased, and Asymptotically Unbiased Asymptotically Unbiased

CΣ

Σ

xxC

nn

n

nnxx

nE

tk

n

kk

n

kk

1ˆ

unbiasedally asymptotic is ofestimator ML

ˆˆ1

1matrix covariancefor estimator unbiased y)(absolutelan

11

estimation biased a as for estimator ML

1

22

1

2

2

1717

Model ErrorModel ErrorFor reliable model, the ML classifier For reliable model, the ML classifier can give excellent resultscan give excellent resultsIf the model is wrong, the ML If the model is wrong, the ML classifier can not get the best results, classifier can not get the best results, even for the assumed set of modelseven for the assumed set of models

1818

Bayesian Estimation Bayesian Estimation (Bayesian Learning)(Bayesian Learning)

Answers obtained in general is nearly Answers obtained in general is nearly identical to those by maximum-identical to those by maximum-likelihoodlikelihoodBasic conceptual differenceBasic conceptual difference– The parameter vector The parameter vector is a random is a random

variablevariable– Use the training data to convert a Use the training data to convert a

distribution on this variable into a distribution on this variable into a posterior probability densityposterior probability density

1919

Central ProblemCentral Problem

)|( determine to)(unknown but fixed toaccordingtly independendrawn samples of set a Use

:learningBaysean of problem Centraltly.independen treatedbecan classEach

)(),|(

)(),|(),|( if ),,|(

affect not do in Samples .,, toseparated be Let )()|(find easy to are iesprobabilitprior Assume

)|(),|(

)|(),|(),|(, sample Given the

1

1

1

DppD

PDp

PDpDpjiDp

DDDDPDP

DPDp

DPDpDpD

c

jjjj

iiiij

ic

ii

c

jjj

iii

xx

x

xxx

x

xx

2020

Parameter DistributionParameter DistributionAssume Assume pp((xx)) has a known parametric for has a known parametric form with parameter vector m with parameter vector of unknown va of unknown valuelueThus,Thus, p p((xx||)) is completely known is completely knownInformation about Information about prior to observing sa prior to observing samples is contained in known prior densitmples is contained in known prior density y pp(())Observations convert Observations convert pp(()) to to pp((||DD)) – should be sharply peaked about the true valushould be sharply peaked about the true value of e of

2121

Parameter DistributionParameter Distribution

)ˆ|()|(

ˆ someabout sharply very peaks )|( if

)|()|()|(

)|(),|()|(),|()|,(

)|,()|(

xx

xx

xxxx

xx

pDp

Dp

dDppDp

pDpDpDpDp

dDpDp

2222

Univariate Gaussian Case: Univariate Gaussian Case: pp((||DD))

n

kk

n

k

k

n

kk

n

xn

x

pxpDp

dpDppDpDpxxD

Np

Nxp

120

02

220

2

1

2

0

02

1

1

200

200

200

2

12121exp"

21exp'

)()|()|(

)()|()()|()|(,,,

guess) about thisy uncertaint: ; of guessbest :(

known are and ),,(~)( Assume

unknownonly theis ),,(~)|(

2323

Reproducing DensityReproducing Density

220

2202

0220

2

220

20

20

0222

022

2

ˆ

ˆ,11prior] conjugate:)( [c.f.

density] ng[reproduci ),(~)|(

nσ

nnn

nnσ

pNDp

n

nn

nn

n

n

nn

2424

Bayesian LearningBayesian Learning

2525

DogmatismDogmatism

are and t matter wha no

,ˆ toconverge will finite, is dogmatismWhen )(dogmatism

and of ratio by theset is data empirical and

knowledgeprior between balance relative Theembetween th somewhere lies always and, and ˆ ofn combinatiolinear a is

200

nn

20

2

0

nn

2626

Univariate Gaussian Case: Univariate Gaussian Case: pp((xx||DD))

),(~)|(

2

21exp),(

),(21exp

21

)|()|()|(

22

22

22

2

22

22

22

22

22

2

nn

n

n

n

n

n

nn

nn

n

n

NDxp

dnxf

fx

dDpxpDxp

2727

Multivariate Gaussian CaseMultivariate Gaussian Case

n

kkn

nnnn

nnt

n

n

kk

n

n

nn

pxpDp

DNpNp

1

01

0111

011

1

1

1

00

1ˆ

ˆ,21exp'

)()|()|(

,,)(~)(),(~)|(

x

ΣΣΣΣΣΣ

Σ

xxΣΣx

2828

Multivariate Gaussian CaseMultivariate Gaussian Case

),(~)|()|(),(~)|(

),(~)()|( lettingby or,

)|()|()|(

11

11ˆ1

1

00

0

1

0

1

00

11111

nn

nn

n

nn

NDpDpNDp

NpDp

dDppDp

nn

nnn

ΣΣyxΣ

Σ0yyxy

xx

ΣΣΣΣΣ

ΣΣΣΣΣΣ

ABABBBAABA

2929

Multivariate Bayesian LearningMultivariate Bayesian Learning

3030

General Bayesian EstimationGeneral Bayesian Estimation

n

kkpDp

dpDppDpDp

dDppDp

1

)|()|(

)()|()()|()|(

)|()|()|(

x

xx

3131

Recursive Bayesian LearningRecursive Bayesian Learning

)()|(

)|()|()|()|()|(

)()|()()|()|(

)()|()|()()|()|(

)()|()()|()|(

)|()|()|(,,,

0

1

1

1

11

1

1

11

pDp

dDppDppDp

dpDppDpDp

dpDpppDpp

dpDppDpDp

DppDpD

nn

nnn

n

nn

nn

nn

n

nn

nn

nn

n

xx

xx

xxx

3232

Example 1:Example 1:Recursive Bayes LearningRecursive Bayes Learning

10maxfor/1)|(

otherwise0107for /1

)|()|()|(

otherwise0104for /1

)|()|()|(

)10,0(~)()|(

8,2,7,4),10,0(~)(otherwise00/1

),0(~)|(

21

22

01

1

0

n

x

nn DDp

DpxpDp

DpxpDp

UpDp

DUp

xUxp

3333

Example 1:Example 1:Recursive Bayes LearningRecursive Bayes Learning

3434

Example 1: Bayes vs. MLExample 1: Bayes vs. ML

3535

IdentifiabilityIdentifiabilitypp((xx||)) is identifiable is identifiable – Sequence of posterior densities Sequence of posterior densities pp((||DDnn)) conve converge to a delta functionrge to a delta function– Only one Only one causes causes pp((xx||)) to fit the data to fit the dataIn some occasions, more than one In some occasions, more than one valu values may yield the same es may yield the same pp((xx||)) – pp((||DDnn)) will peak near all will peak near all that explain the da that explain the datata– Ambiguity is erased in integration for Ambiguity is erased in integration for pp((xx||DDnn), ), which converges towhich converges to pp((xx) ) whether or notwhether or not pp((xx||)) is identifiableis identifiable

3636

ML vs. Bayes MethodsML vs. Bayes MethodsComputational complexityComputational complexityInterpretabilityInterpretabilityConfidence in prior informationConfidence in prior information– Form of the underlying distribution Form of the underlying distribution pp((xx||))

Results differs when Results differs when pp((||DD)) is broad, or a is broad, or asymmetric around the estimated symmetric around the estimated – Bayes methods would exploit such informatBayes methods would exploit such information whereas ML would notion whereas ML would not

3737

Classification ErrorsClassification ErrorsBayes or indistinguishability errorBayes or indistinguishability errorModel errorModel errorEstimation errorEstimation error– Parameters are estimated from a finite samParameters are estimated from a finite sampleple– Vanishes in the limit of infinite training data Vanishes in the limit of infinite training data (ML and Bayes would have the same total cl(ML and Bayes would have the same total classification error)assification error)

3838

Invariance and Invariance and Non-informative PriorsNon-informative Priors

Guidance in creating priorsGuidance in creating priorsInvarianceInvariance– Translation invarianceTranslation invariance– Scale invarianceScale invarianceNon-informative with respect to an Non-informative with respect to an invarianceinvariance– Much better than accommodating Much better than accommodating

arbitrary transformation in a MAP arbitrary transformation in a MAP estimatorestimator

– Of great use in Bayesian estimation Of great use in Bayesian estimation

3939

Gibbs AlgorithmGibbs Algorithm

classifier optimal Bayes theoferror expected themost twiceat iserror icationmisclassif thes,assumption given weak

Algorithm] Gibbs[)|()|(Let )|( toaccording apick

)|()|()|(

0

0

xx

xx

pDpDp

dDppDp

4040

Sufficient StatisticsSufficient StatisticsStatisticStatistic– Any function of samplesAny function of samples

Sufficient statisticSufficient statistic s s of samplesof samples DD – ss Contains all information relevant to estimat Contains all information relevant to estimat

ing some parameter ing some parameter – Definition: Definition: pp((DD||ss, , )) is independent of is independent of – If If can be regarded as a random variable can be regarded as a random variable

)|()|(

)|(),|(),|( ss

sss pDp

pDpDp

4141

Factorization TheoremFactorization TheoremA statistic A statistic ss is sufficient for is sufficient for if and only if if and only if PP((DD||)) can be written as the product can be written as the product

PP((DD||)) = = gg((ss, , ) ) hh((DD)) for some functions for some functions gg(.,.)(.,.) and and hh(.)(.)

4242

Example: Multivariate GaussianExample: Multivariate Gaussian

for sufficient are 1ˆ thusand

21exp

21

2exp

21exp

21)|(

),(~)|(

11

1

12/2/

1

11

1

12/12/

n

kkn

n

kk

n

kk

tknnd

n

kk

tt

kt

k

n

kd

n

n

Dp

Np

xxs

xΣxΣ

xΣΣ

xΣxΣ

Σx

4343

Proof of Factorization Theorem: Proof of Factorization Theorem: The “Only if” PartThe “Only if” Part

),()()|()()|(),|(

)|(),|()|,()( oft independen is ),( ,for sufficient is Suppose

ssss

ssss

s

ss

gDhPDhPDP

PDPDPD|P

D|P

4444

Proof of Factorization Theorem: Proof of Factorization Theorem: The “if” PartThe “if” Part

for sufficient is and , oft independen

)()(

)(),()(),(

)|()|(

)|,()|,(

)|()|,(),|(

)(|),(

s

ss

ss

sss

ss

DDDDDD

DD

DhDh

DhgDhg

DPDP

DPDP

PDPDP

DDDD

4545

Kernel DensityKernel DensityFactoring of Factoring of PP((DD||)) into into gg((ss,,))hh((DD)) is not u is not uniquenique– If If ff((ss)) is any function, is any function, gg’(’(ss,,)=)=ff((ss))gg((ss,,)) and and hh’’

((DD) = ) = hh((DD)/)/ff((ss)) are equivalent factors are equivalent factors

Ambiguity is removed by defining the kerAmbiguity is removed by defining the kernel density invariant to such scalingnel density invariant to such scaling

dggg

),(),(),(

sss

4646

Example: Multivariate GaussianExample: Multivariate Gaussian

)ˆ(1)ˆ(21exp

12

1),ˆ(

)ˆ2(2

exp),ˆ(

1ˆ ),(),ˆ(

21exp

21

2exp)|(

),(~)|(

1

2/12/

11

1

1

12/2/

1

11

nt

nd

n

ntt

n

n

kknn

n

kk

tknnd

n

kk

tt

nn

g

ng

nDhg

nDp

Np

ΣΣ

ΣΣ

x

xΣxΣ

xΣΣ

Σx

4747

Kernel Density and Kernel Density and Parameter EstimationParameter Estimation

Maximum-likelihoodMaximum-likelihood– maximization of maximization of gg((ss,,))BayesianBayesian

– If prior knowledge of If prior knowledge of is vague, is vague, pp(()) tend to tend to be uniform, and be uniform, and pp((||DD)) is approximately the is approximately the same as the kernel densitysame as the kernel density

– If If pp((xx||)) is identifiable, is identifiable, gg((ss,,)) peaks sharply a peaks sharply at some value, and t some value, and pp(()) is continuous as well is continuous as well as non-zero there, as non-zero there, pp((||DD)) approaches the ke approaches the kernel density rnel density

dpgpg

dpDppDpDp

)(),()(),(

)()|()()|()|(

ss

4848

Sufficient Statistics for Sufficient Statistics for Exponential FamilyExponential Family

n

kk

tn

kk

n

kk

n

kk

t

t

Dh

ngn

Dhg

nDp

p

1

1

11

)()(

)()(exp),(,)(1)(),(

)()()()(exp)|(

)()()(exp)()|(

x

sbasxcs

s

xxcba

xcbaxx

4949

Error Rate and DimensionalityError Rate and Dimensionality

2

1

212

211

212

2/

2/

case,t independenlly conditiona In the

,21)(

rateerror Bayes ies,probabilitprior equalWith

2,1),,(~)|(case normal temultivaria class-woConsider t

tindependenlly statistica are features Suppose

2

d

i i

ii

t

r

u

jj

r

rdueeP

jNp

Σ

Σx

5050

Accuracy and DimensionalityAccuracy and Dimensionality

5151

Effects of Additional FeaturesEffects of Additional FeaturesIn practice, beyond a certain point, In practice, beyond a certain point, inclusion of additional features leads inclusion of additional features leads to worse rather than better to worse rather than better performanceperformanceSources of difficultySources of difficulty– Wrong modelsWrong models– Number of design or training samples is Number of design or training samples is

finite and thus the distributions are not finite and thus the distributions are not estimated accuratelyestimated accurately

5252

Computational Complexity for Computational Complexity for Maximum-Likelihood EstimationMaximum-Likelihood Estimation

)1()()()()(

2ln2

)(lnˆln21ˆˆˆ

21)(

)( :matrix a oft determinan find

)( :matrix a of inverse find

)(:ˆˆ1ˆ

)(:1ˆ

32

1

3

3

2

1

1

OnOdOndOndO

dPg

dndOdd

dOdd

ndOn

ndOn

nt

n

n

k

tnknk

n

kkn

ΣxΣxx

xxΣ

x

5353

Computational Complexity for Computational Complexity for ClassificationClassification

learningan simpler th)( :tionclassificafor Total

)(:decision )(max)(: vectorseparation by the

matrix covariance inverse heMultiply t)(:ˆ Compute

Given

2

2

dO

cOgdO

dO

ii

n

x

xx

5454

Approaches for Approaches for Inadequate SamplesInadequate Samples

Reduce dimensionalityReduce dimensionality– Redesign feature extractorRedesign feature extractor– Select appropriate subset of featuresSelect appropriate subset of features– Combine the existing featuresCombine the existing features– Pool the available data by assuming all Pool the available data by assuming all

classes share the same covariance matrixclasses share the same covariance matrixLook for a better estimate for Look for a better estimate for – Use Bayesian estimate and diagonal Use Bayesian estimate and diagonal 00

– Threshold sample covariance matrixThreshold sample covariance matrix– Assume statistical independenceAssume statistical independence

5555

Shrinkage Shrinkage (Regularized Discriminant Analysis)(Regularized Discriminant Analysis)

10,)-(1)(matrixidentity the toward shrink"" or,

10,1

1onecommon thematrix to covariance individual shrink""

matrix covariance same assumingby estimated is questionin categories on theindex an is

IΣΣΣ

ΣΣΣ

Σ

nnnn

ci

i

iii

5656

Concept of OverfittingConcept of Overfitting

5757

Best Representative PointBest Representative Point

)( minimizes

)()()(

1

minimized is )(

such that find,,,Given

000

1 1

220

1

2000

1

1

2000

01

xmx

mxmx

mxmxx

xm

xxx

xxx

J

J

n

J

n

k

n

kk

n

kk

n

kk

n

kk

n

5858

Projection Along a LineProjection Along a Line

5959

Best Projection to a Line Through Best Projection to a Line Through the Sample Meanthe Sample Mean

)(

)(2

)();,,(

minimize Toerror with by Represent

Line

1

2

11

22

1

211

mxe

mxmxee

xeme

emxemx

kt

k

n

kk

n

kk

tk

n

kk

n

kkkn

kk

a

aa

aaaJ

aa

6060

Best Representative DirectionBest Representative Direction

eSeeeSee

eSee

mxSee

mxemxmxe

mxe

e

e

0)1( maximize :method Lagrange

1 subject to Maximize

))((

2)(

minimize to Find

2

1

2

1

2

1

1

2

1

2

1

21

uu

aaJ

tt

t

n

kk

t

n

kk

n

k

tkk

t

n

kk

n

kk

n

kk

6161

Principal Component Analysis Principal Component Analysis (PCA)(PCA)

seigenvaluelargest thehaving of rseigenvecto theare ,,

),,(

minimize to',,1, Find

: space Projection

'1

2

1

'

1'1'

'

1

d'd'

aJ

di

a

d

n

kk

d

iikidd

i

d

iii

See

xemee

e

emx

6262

Concept of Concept of Fisher Linear DiscriminantFisher Linear Discriminant

6363

Fisher Linear Discriminant AnalysisFisher Linear Discriminant Analysis

22

21

221

22

21

22

~~~~

)( maximize To

~~ :scatter class-Within

)~(~,1~

2,1,1

on separation maximalget to Find

ssmm

J

ss

msn

m

in

iD

tii

t

D

t

ii

Dii

t

ii

i

w

xwmwxw

xm

xwyw

xx

x

6464


tB

Bttt

WWt

Dx

tiii

it

Dxi

tti

mm

ss

s

i

i

2121

221

221

212

22

1

22

~~,~~

~

mmmmS

wSwmwmw

SSSwSw

mxmxS

wSwmwxw

6565


scales] [ignoring )(

)( ofdirection thein always is ))((

problem) eigenvalue ed(generaliz when maximized is

quotient,Rayleigh dgeneralize ,)(

211

21

2121

1

mmSw

mmwmmmmwS

wwSS

wSwS

wSwwSww

W

tB

BW

WB

Wt

Bt

J

6666

Fisher Linear Discriminant Analysis Fisher Linear Discriminant Analysis for Multivariate Normalfor Multivariate Normal

analysis]nt discriminalinear Fisher to[solution

, and , ,for estimationWith

0

boundarydecision optimalmatrix covariance same Assume

211

21

211

0

mmSw

ΣΣw

xw

Σ

W

t w

6767

Concept of Multidimensional DiscriConcept of Multidimensional Discriminant Analysisminant Analysis

6868

Multiple Discriminant AnalysisMultiple Discriminant Analysis

ii

i

i

Di

D

tiii

c

iiW

Wt

c

i D

ti

ti

tW

c

iii

D

ti

ttii

n

nnn

ciy

c-dc

xx

x

x

xmmxmxSSS

WSWmxWmxWS

mmxWm

xWyxw

1,,

~~~

~1~,1~

1,,1,

subspaceldimensiona-)1( tospace ldimensiona- from Projection

problem class-Consider

1

1

1

6969


BW

c

i

tiiiW

c

i D

tii

c

i D

tii

c

i D

tiiii

tT

i

c

ii

n

nnn

ii

i

SSmmmmS

mmmmmxmx

mmmxmmmx

mxmxS

mxm

xx

x

x

x

1

11

1

1

11

7070


WSW

WSW

S

SW

W

WSWmmmmS

Wt

Bt

W

B

Bt

c

i

tiiiB

J

n

~

~)(let

)directions principal in the variances ofproduct thet to(equivalenmatrix scatter theof

tdeterminan theisscatter of measurescalar simpleA scatter class- withon thescatter to class-between

theof ratio themaximize toation transformaSeek

~~~~~1

7171


etc. matrices, scalingor rotation by multipliedbecan it since unique,not is optimal

eigenvaluelargest the torelatedr eigenvecto dgeneralize theis and

satisfies optimal of Columns

W

wSwSW

iWiiB

7272

Expectation-Maximization (EM)Expectation-Maximization (EM)Finding the maximum-likelihood estimate of Finding the maximum-likelihood estimate of the parameters of an underlying distribution the parameters of an underlying distribution – from a given data set when the data is from a given data set when the data is

incomplete or has missing valuesincomplete or has missing valuesTwo main applicationsTwo main applications– When the data indeed has missing valuesWhen the data indeed has missing values– When optimizing the likelihood function is When optimizing the likelihood function is

analytically intractable but when the likelihood analytically intractable but when the likelihood function can be simplified by assuming the function can be simplified by assuming the existence of and values for additional but existence of and values for additional but missing (or hidden) parametersmissing (or hidden) parameters

7373

Expectation-Maximization (EM)Expectation-Maximization (EM)Full sample Full sample DD = { = {xx11, . . ., , . . ., xxnn}}

xxkk = { = { xxkgkg, , xxkbkb } }Separate individual features into Separate individual features into DDgg an and d DDbb

– DD is the union of is the union of DDgg and and DDbbForm the functionForm the function igbgDi DDDpEQ

b ;|);,(ln);(

7474

Expectation-Maximization (EM)Expectation-Maximization (EM)begin initialize begin initialize 00, , TT,, i i 0 0 do do i i i + i + 11 E step: Compute E step: Compute QQ((; ; ii)) M step: M step: ii+1+1 arg max arg max QQ((,,ii))

until until QQ((ii+1+1;;ii)-)-QQ((ii;;ii-1-1) ) TT return return ii+1+1

end end

7575

Expectation-Maximization (EM)Expectation-Maximization (EM)

7676

Example: 2D ModelExample: 2D Model

1100

,

matrix covariancediagonal with modelGaussian 2D Assume

4*

,22

,01

,20

,,,

0

22

21

2

1

41

4321

xD

D

b

xxxx

7777


'41

0'41

41

041

413

1

41420

41

3

14

041

0

|4

|4

|4

ln)|(ln

)4;|(

)|(ln)|(ln

,|);,(ln);(41

dxx

pK

dxK

xp

xpp

dxxxp

pp

DxDpEQ

kk

kk

ggx

x

xx

7878


0.2938.0

0.275.0

2ln2

42

1)|(ln

)4(21exp

21|

4ln1

)|(ln);(

1

2122

22

21

21

3

1

4122

4141

3

1

0

kk

kk

p

dxxx

pK

pQ

x

x

7979


0.200667.0

0.20.1

at converges algorithm the,iterations 3After

Σ

8080

Generalized Expectation-Generalized Expectation-Maximization (GEM)Maximization (GEM)

Instead of maximizing Instead of maximizing QQ((; ; ii), we find s), we find some ome ii+1+1 such thatsuch thatQQ((ii+1+1;;ii)>)>QQ((;;ii) )

and is also guaranteed to convergeand is also guaranteed to convergeConvergence will not as rapidConvergence will not as rapidOffers great freedom to choose computaOffers great freedom to choose computationally simpler stepstionally simpler steps– e.g., using maximum-likelihood value of unke.g., using maximum-likelihood value of unknown values, if they lead to a greater likelihnown values, if they lead to a greater likelihoodood

8181

Hidden Markov Model (HMM)Hidden Markov Model (HMM)Used for problems of making a series of Used for problems of making a series of decisionsdecisions– e.ge.g., speech or gesture recognition., speech or gesture recognitionProblem states at time Problem states at time tt are influenced d are influenced directly by a state at irectly by a state at t-t-11More reference:More reference:– L. A. Rabiner and B. W. Juang, L. A. Rabiner and B. W. Juang, FundamentalFundamentals of Speech Recognitions of Speech Recognition, Prentice-Hall, 1993,, Prentice-Hall, 1993, Chapter 6. Chapter 6.

8282

First Order Markov ModelsFirst Order Markov Models

1321223213

6312231

6 )|(,,,,,,.,.

)(,),2(),1( states of sequence

aaaaaPge

TT

8383

First Order Hidden Markov ModelsFirst Order Hidden Markov Models

jkjk

T

bttvPvvvvvvge

Tvvv

))(|)((,,,,,,.,.

)(,),2(),1( states visibleof Sequence

3241146 V

V

8484

Hidden Markov Model ProbabilitiesHidden Markov Model Probabilities

1,1

))(|)((:state visiblea ofemission ofy probabilit

))(|)1(( :yprobabilit transition1: state absorbingor final 000

k

jkj

ij

jkjk

ijij

ba

ttvPb

ttPaa

8585

Hidden Markov Model ComputationHidden Markov Model ComputationEvaluation problemEvaluation problem– Given Given aaijij and and bbjkjk, determine , determine PP((VVTT||))

Decoding problemDecoding problem– Given Given VVTT, determine the most likely sequenc, determine the most likely sequence of hidden states that lead to e of hidden states that lead to VVTT

Learning problemLearning problem– Given training observations of visible symbolGiven training observations of visible symbols and the coarse structure but not the probas and the coarse structure but not the probabilities, determine bilities, determine aaijij and and bbjkjk

8686

EvaluationEvaluation

max

max

1 1

1

1

1

))1(|)(())(|)(()(

))(|)(()|(

))1(|)(()(

)()|()(

r

r

T

t

T

T

t

Tr

T

T

t

Tr

Tr

r

r

Tr

TT

ttPttvPP

ttvPP

ttPP

PPP

V

V

VV

8787

HMM ForwardHMM Forward

)()()|)(()),(()(

state initial,1state initial,0

)0(

)1(

)),1(())1(|)(())(|(

)),(()(

))1(|)(())(|)(()(

000

1)(

1

1

1 1

max

TTTT

j

c

iiijtjkv

ti

c

iijjk

tjj

r

r

T

t

T

PPTPTPT

jj

tab

tPttPtvP

tPt

ttPttvPP

VVVV

V

V

V

8888

HMM Forward and TrellisHMM Forward and Trellis

8989

HMM ForwardHMM Forward

endstate finalfor )()(return

until

)1()(

,,0 ,1for

)0(,,,,0 initialize

0

1)(

TP

Tt

atbt

cjtt

bat

T

c

i ijitjkvj

jT

jkij

V

V

9090

HMM BackwardHMM Backward

)()()|)0(()),0(()0(

0,10,0

)(

)1(

)),1(())(|)1(())1(|(

)),(()(

))1(|)(())(|)(()(

1)1(

)1(

1

1 1

max

TTTinit

Tinitinit

i

c

jjijtjkv

tTj

c

jijjk

tTii

r

r

T

t

T

PPPP

ii

T

tab

tPttPtvP

tPt

ttPttvPP

VVVV

V

V

V

9191

HMM BackwardHMM Backward

endstate initialfor )0()(return

0 until

)1()(

,,1 ,1for

)(,,,, initialize

0

0 )1(

T

c

j tjkvijji

jT

jkij

P

t

batt

citt

TbaTt

V

V

9292

Example 3: Hidden Markov ModelExample 3: Hidden Markov Model

9393


2.01.02.05.001.07.01.01.002.01.04.03.00

00001

1.00.01.08.01.02.05.02.04.01.03.02.0

0001

jk

ij

b

a

9494


9595

Left-to-Right Models for SpeechLeft-to-Right Models for Speech

)()()|()|( T

TT

PPPP

VVV

9696

HMM DecodingHMM Decoding

9797

Problem of Local OptimizationProblem of Local OptimizationThis decoding algorithm depends This decoding algorithm depends only on the single previous time step, only on the single previous time step, not the full sequencenot the full sequenceNot guarantee that the path is Not guarantee that the path is indeed allowableindeed allowable

9898

HMM DecodingHMM Decoding

endreturn

until

to Append

)(maxarg until

)1()(

1for 0 ,1for

{},0 initialize

'

1)(

PathTt

Path

tj'cj

atbt

jjjttPatht

j

jj

c

i ijitjkvj

9999

Example 4: HMM DecodingExample 4: HMM Decoding

100100

Forward-Backward AlgorithmForward-Backward AlgorithmDetermines model parameters, Determines model parameters, aaijij and and bbjkjk,, from an ensemble of training samples from an ensemble of training samplesAn instance of a generalized expectatioAn instance of a generalized expectation-maximization algorithmn-maximization algorithmNo known method for the optimal or moNo known method for the optimal or most likely set of parameters from datast likely set of parameters from data

101101

Probability of TransitionProbability of Transition

)|()()1(

)|()|),(),1((

),|)(),1((

),|)(),1(()(

Tjjkiji

T

Tji

Tji

Tjiij

Ptbat

PttP

ttP

ttPt

V

VV

V

V

102102

Improved Estimate for Improved Estimate for aaijij

T

t k ik

T

tij

ij

ij

T

t k iki

T

tij

ji

t

ta

a

t

t

tt

1

1

1

1

)(

)(ˆ

: of Estimate

)(:

from tionsany transi ofnumber expected Total

)(

:sequence in the any timeat )( and )1(statebetween ns transitioofnumber Expected

103103

Improved Estimate for Improved Estimate for bbjkjk

T

t lil

T

vtvt lil

jk

t

tb k

1

)( ,1

)(

)(ˆ

104104

Forward-Backward AlgorithmForward-Backward Algorithm(Baum-Welch Algorithm)(Baum-Welch Algorithm)

end

)();(return

)1()(),1()(max until

)(ˆ)(

)(ˆ)(

)1( and )1( all from )(ˆ all compute

)1( and )1( all from )(ˆ all compute 1 do

0, threshold, sequence training,, initialize

,,

zbbzaa

zbzbzaza

zbzb

zaza

zbzazb

zbzazazz

zba

jkjkijij

jkjkijijkji

jkjk

ijij

jkijjk

jkijij

Tjkij

V

1 parameter estimation shyh-kang jeng department of electrical engineering/ graduate institute of...

Documents