1 parameter estimation shyh-kang jeng department of electrical engineering/ graduate institute of...

104
1 Parameter Estimation Parameter Estimation Shyh-Kang Jeng Shyh-Kang Jeng Department of Electrical Engineeri Department of Electrical Engineeri ng/ ng/ Graduate Institute of Communicatio Graduate Institute of Communicatio n/ n/ Graduate Institute of Networking a Graduate Institute of Networking a nd Multimedia, National Taiwan Uni nd Multimedia, National Taiwan Uni versity versity

Upload: aleesha-moody

Post on 06-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

3 Estimating Probabilities Not difficulty to Estimate prior probabilities Hard to estimate class-conditional densities –Number of available samples always seems too small –Serious when dimensionality is large

TRANSCRIPT

Page 1: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

11

Parameter EstimationParameter Estimation

Shyh-Kang JengShyh-Kang JengDepartment of Electrical Engineering/Department of Electrical Engineering/Graduate Institute of Communication/Graduate Institute of Communication/Graduate Institute of Networking and MultiGraduate Institute of Networking and Multimedia, National Taiwan Universitymedia, National Taiwan University

Page 2: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

22

Typical Classification ProblemTypical Classification ProblemRarely know the complete Rarely know the complete probabilistic structure of the problemprobabilistic structure of the problemHave vague, general knowledgeHave vague, general knowledgeHave a number of design samples or Have a number of design samples or training data as representatives of training data as representatives of patterns for classificationpatterns for classificationFind some way to use this Find some way to use this information to design or train the information to design or train the classifierclassifier

Page 3: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

33

Estimating ProbabilitiesEstimating ProbabilitiesNot difficulty to Estimate prior Not difficulty to Estimate prior probabilitiesprobabilitiesHard to estimate class-conditional Hard to estimate class-conditional densitiesdensities– Number of available samples always Number of available samples always

seems too smallseems too small– Serious when dimensionality is largeSerious when dimensionality is large

Page 4: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

44

Estimating ParametersEstimating ParametersProblems permit to parameterize the coProblems permit to parameterize the conditional densitiesnditional densitiesSimplifies the problem from one of estiSimplifies the problem from one of estimating an unknown function to one of emating an unknown function to one of estimating the parametersstimating the parameters– e.g.,e.g., mean vector and covariance matrix for mean vector and covariance matrix for multi-variate normal distribution multi-variate normal distribution

Page 5: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

55

Maximum-Likelihood EstimationMaximum-Likelihood EstimationView the parameters as quantities View the parameters as quantities whose values are fixed but unknownwhose values are fixed but unknownBest estimate is the one that Best estimate is the one that maximize the probability of obtaining maximize the probability of obtaining the samples actually observedthe samples actually observedNearly always have good Nearly always have good convergence properties as the convergence properties as the number of samples increasesnumber of samples increasesOften simpler than alternative Often simpler than alternative methodsmethods

Page 6: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

66

I. I. D. Random VariablesI. I. D. Random VariablesSeparate data into Separate data into DD11, . . ., , . . ., DDccSamples in Samples in DDjj are drawn independently a are drawn independently according to ccording to pp((xx||jj))Such samples are independent and identiSuch samples are independent and identically distributed (i. i. d.) random variablescally distributed (i. i. d.) random variablesLet Let pp((xx||jj)) has a known parametric form a has a known parametric form and is determined uniquely by a parametend is determined uniquely by a parameter vector r vector j,j,, , i.e.,i.e., p p((xx||jj))=p=p((xx||jj,,jj))

Page 7: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

77

Simplification AssumptionsSimplification AssumptionsSamples in Samples in DDii give no information about give no information about jj, if , if ii is not equal to is not equal to jjCan work with each class separatelyCan work with each class separatelyHave Have cc separate problems of the same fo separate problems of the same formrm– Use set Use set DD of i. i. d. samples from of i. i. d. samples from pp((xx||)) to esti to estimate unknown parameter vector mate unknown parameter vector

Page 8: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

88

Maximum-likelihood EstimateMaximum-likelihood Estimate

)|( maximizes

ˆ estimate likelihood-maximum

) respect to with of d(likelihoo

)|()|(

,, samples d. i. i.contain Let

1

1

Dp

D

pDp

Dn

kk

n

x

xx

Page 9: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

99

Maximum-likelihood EstimationMaximum-likelihood Estimation

Page 10: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

1010

A NoteA NoteThe likelihood The likelihood pp((DD||)) as a function of as a function of is is not a probability density function of not a probability density function of Its area on the Its area on the -domain has no significa-domain has no significancenceThe likelihood The likelihood pp((DD||)) can be regarded as can be regarded as probability of probability of DD for a given for a given

Page 11: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

1111

Analytical ApproachAnalytical Approach

p

tp

n

kk

n

kk

l

xplxpl

l

Dpl

1

1

11

,,,

0:ˆfor condition necessary

)|(ln,)|(ln)(

)(maxargˆ)|(ln)(function likelihood-log

Page 12: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

1212

MAP EstimatorsMAP Estimators

prior uniform for theestimator MAPan isestimator (ML) likelihood-maximum

valuesparameter different ofy probabilitprior :)()(ln)( maximize that find

:estimator (MAP) posteriori a maximum

ppl

Page 13: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

1313

Gaussian Case: Unknown Gaussian Case: Unknown

n

kk

n

kk

kk

kt

kd

k

n

p

p

1

1

1

1

1

)|(ln212ln

21)|(ln

x

xΣx

xΣxΣx

Page 14: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

1414

Univariate Gaussian Case: UnknowUnivariate Gaussian Case: Unknown n and and 22

n

kk

n

kk

n

k

kn

k

n

kkn

kk

kk

xn

xn

l

x

xxpl

xxp

1

22

1

122

21

1 2

11

2

1

21

22

221

ˆ1ˆ,1ˆ0

221

1

)|(

212ln

21)|(ln

,

Page 15: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

1515

Multivariate Gaussian Case: Multivariate Gaussian Case: Unknown Unknown and and

n

k

tkk

n

kk

n

n

1

1

ˆˆ1ˆ

xxΣ

x

Page 16: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

1616

Bias, Absolutely Unbiased, and Bias, Absolutely Unbiased, and Asymptotically Unbiased Asymptotically Unbiased

Σ

xxC

nn

n

nnxx

nE

tk

n

kk

n

kk

unbiasedally asymptotic is ofestimator ML

ˆˆ1

1matrix covariancefor estimator unbiased y)(absolutelan

11

estimation biased a as for estimator ML

1

22

1

2

2

Page 17: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

1717

Model ErrorModel ErrorFor reliable model, the ML classifier For reliable model, the ML classifier can give excellent resultscan give excellent resultsIf the model is wrong, the ML If the model is wrong, the ML classifier can not get the best results, classifier can not get the best results, even for the assumed set of modelseven for the assumed set of models

Page 18: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

1818

Bayesian Estimation Bayesian Estimation (Bayesian Learning)(Bayesian Learning)

Answers obtained in general is nearly Answers obtained in general is nearly identical to those by maximum-identical to those by maximum-likelihoodlikelihoodBasic conceptual differenceBasic conceptual difference– The parameter vector The parameter vector is a random is a random

variablevariable– Use the training data to convert a Use the training data to convert a

distribution on this variable into a distribution on this variable into a posterior probability densityposterior probability density

Page 19: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

1919

Central ProblemCentral Problem

)|( determine to)(unknown but fixed toaccordingtly independendrawn samples of set a Use

:learningBaysean of problem Centraltly.independen treatedbecan classEach

)(),|(

)(),|(),|( if ),,|(

affect not do in Samples .,, toseparated be Let )()|(find easy to are iesprobabilitprior Assume

)|(),|(

)|(),|(),|(, sample Given the

1

1

1

DppD

PDp

PDpDpjiDp

DDDDPDP

DPDp

DPDpDpD

c

jjjj

iiiij

ic

ii

c

jjj

iii

xx

x

xxx

x

xx

Page 20: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

2020

Parameter DistributionParameter DistributionAssume Assume pp((xx)) has a known parametric for has a known parametric form with parameter vector m with parameter vector of unknown va of unknown valuelueThus,Thus, p p((xx||)) is completely known is completely knownInformation about Information about prior to observing sa prior to observing samples is contained in known prior densitmples is contained in known prior density y pp(())Observations convert Observations convert pp(()) to to pp((||DD)) – should be sharply peaked about the true valushould be sharply peaked about the true value of e of

Page 21: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

2121

Parameter DistributionParameter Distribution

)ˆ|()|(

ˆ someabout sharply very peaks )|( if

)|()|()|(

)|(),|()|(),|()|,(

)|,()|(

xx

xx

xxxx

xx

pDp

Dp

dDppDp

pDpDpDpDp

dDpDp

Page 22: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

2222

Univariate Gaussian Case: Univariate Gaussian Case: pp((||DD))

n

kk

n

k

k

n

kk

n

xn

x

pxpDp

dpDppDpDpxxD

Np

Nxp

120

02

220

2

1

2

0

02

1

1

200

200

200

2

12121exp"

21exp'

)()|()|(

)()|()()|()|(,,,

guess) about thisy uncertaint: ; of guessbest :(

known are and ),,(~)( Assume

unknownonly theis ),,(~)|(

Page 23: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

2323

Reproducing DensityReproducing Density

220

2202

0220

2

220

20

20

0222

022

2

ˆ

ˆ,11prior] conjugate:)( [c.f.

density] ng[reproduci ),(~)|(

nnn

nnσ

pNDp

n

nn

nn

n

n

nn

Page 24: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

2424

Bayesian LearningBayesian Learning

Page 25: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

2525

DogmatismDogmatism

are and t matter wha no

,ˆ toconverge will finite, is dogmatismWhen )(dogmatism

and of ratio by theset is data empirical and

knowledgeprior between balance relative Theembetween th somewhere lies always and, and ˆ ofn combinatiolinear a is

200

nn

20

2

0

nn

Page 26: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

2626

Univariate Gaussian Case: Univariate Gaussian Case: pp((xx||DD))

),(~)|(

2

21exp),(

),(21exp

21

)|()|()|(

22

22

22

2

22

22

22

22

22

2

nn

n

n

n

n

n

nn

nn

n

n

NDxp

dnxf

fx

dDpxpDxp

Page 27: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

2727

Multivariate Gaussian CaseMultivariate Gaussian Case

n

kkn

nnnn

nnt

n

n

kk

n

n

nn

pxpDp

DNpNp

1

01

0111

011

1

1

1

00

ˆ,21exp'

)()|()|(

,,)(~)(),(~)|(

x

ΣΣΣΣΣΣ

Σ

xxΣΣx

Page 28: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

2828

Multivariate Gaussian CaseMultivariate Gaussian Case

),(~)|()|(),(~)|(

),(~)()|( lettingby or,

)|()|()|(

11

11ˆ1

1

00

0

1

0

1

00

11111

nn

nn

n

nn

NDpDpNDp

NpDp

dDppDp

nn

nnn

ΣΣyxΣ

Σ0yyxy

xx

ΣΣΣΣΣ

ΣΣΣΣΣΣ

ABABBBAABA

Page 29: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

2929

Multivariate Bayesian LearningMultivariate Bayesian Learning

Page 30: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

3030

General Bayesian EstimationGeneral Bayesian Estimation

n

kkpDp

dpDppDpDp

dDppDp

1

)|()|(

)()|()()|()|(

)|()|()|(

x

xx

Page 31: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

3131

Recursive Bayesian LearningRecursive Bayesian Learning

)()|(

)|()|()|()|()|(

)()|()()|()|(

)()|()|()()|()|(

)()|()()|()|(

)|()|()|(,,,

0

1

1

1

11

1

1

11

pDp

dDppDppDp

dpDppDpDp

dpDpppDpp

dpDppDpDp

DppDpD

nn

nnn

n

nn

nn

nn

n

nn

nn

nn

n

xx

xx

xxx

Page 32: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

3232

Example 1:Example 1:Recursive Bayes LearningRecursive Bayes Learning

10maxfor/1)|(

otherwise0107for /1

)|()|()|(

otherwise0104for /1

)|()|()|(

)10,0(~)()|(

8,2,7,4),10,0(~)(otherwise00/1

),0(~)|(

21

22

01

1

0

n

x

nn DDp

DpxpDp

DpxpDp

UpDp

DUp

xUxp

Page 33: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

3333

Example 1:Example 1:Recursive Bayes LearningRecursive Bayes Learning

Page 34: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

3434

Example 1: Bayes vs. MLExample 1: Bayes vs. ML

Page 35: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

3535

IdentifiabilityIdentifiabilitypp((xx||)) is identifiable is identifiable – Sequence of posterior densities Sequence of posterior densities pp((||DDnn)) conve converge to a delta functionrge to a delta function– Only one Only one causes causes pp((xx||)) to fit the data to fit the dataIn some occasions, more than one In some occasions, more than one valu values may yield the same es may yield the same pp((xx||)) – pp((||DDnn)) will peak near all will peak near all that explain the da that explain the datata– Ambiguity is erased in integration for Ambiguity is erased in integration for pp((xx||DDnn), ), which converges towhich converges to pp((xx) ) whether or notwhether or not pp((xx||)) is identifiableis identifiable

Page 36: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

3636

ML vs. Bayes MethodsML vs. Bayes MethodsComputational complexityComputational complexityInterpretabilityInterpretabilityConfidence in prior informationConfidence in prior information– Form of the underlying distribution Form of the underlying distribution pp((xx||))

Results differs when Results differs when pp((||DD)) is broad, or a is broad, or asymmetric around the estimated symmetric around the estimated – Bayes methods would exploit such informatBayes methods would exploit such information whereas ML would notion whereas ML would not

Page 37: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

3737

Classification ErrorsClassification ErrorsBayes or indistinguishability errorBayes or indistinguishability errorModel errorModel errorEstimation errorEstimation error– Parameters are estimated from a finite samParameters are estimated from a finite sampleple– Vanishes in the limit of infinite training data Vanishes in the limit of infinite training data (ML and Bayes would have the same total cl(ML and Bayes would have the same total classification error)assification error)

Page 38: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

3838

Invariance and Invariance and Non-informative PriorsNon-informative Priors

Guidance in creating priorsGuidance in creating priorsInvarianceInvariance– Translation invarianceTranslation invariance– Scale invarianceScale invarianceNon-informative with respect to an Non-informative with respect to an invarianceinvariance– Much better than accommodating Much better than accommodating

arbitrary transformation in a MAP arbitrary transformation in a MAP estimatorestimator

– Of great use in Bayesian estimation Of great use in Bayesian estimation

Page 39: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

3939

Gibbs AlgorithmGibbs Algorithm

classifier optimal Bayes theoferror expected themost twiceat iserror icationmisclassif thes,assumption given weak

Algorithm] Gibbs[)|()|(Let )|( toaccording apick

)|()|()|(

0

0

xx

xx

pDpDp

dDppDp

Page 40: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

4040

Sufficient StatisticsSufficient StatisticsStatisticStatistic– Any function of samplesAny function of samples

Sufficient statisticSufficient statistic s s of samplesof samples DD – ss Contains all information relevant to estimat Contains all information relevant to estimat

ing some parameter ing some parameter – Definition: Definition: pp((DD||ss, , )) is independent of is independent of – If If can be regarded as a random variable can be regarded as a random variable

)|()|(

)|(),|(),|( ss

sss pDp

pDpDp

Page 41: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

4141

Factorization TheoremFactorization TheoremA statistic A statistic ss is sufficient for is sufficient for if and only if if and only if PP((DD||)) can be written as the product can be written as the product

PP((DD||)) = = gg((ss, , ) ) hh((DD)) for some functions for some functions gg(.,.)(.,.) and and hh(.)(.)

Page 42: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

4242

Example: Multivariate GaussianExample: Multivariate Gaussian

for sufficient are 1ˆ thusand

21exp

21

2exp

21exp

21)|(

),(~)|(

11

1

12/2/

1

11

1

12/12/

n

kkn

n

kk

n

kk

tknnd

n

kk

tt

kt

k

n

kd

n

n

Dp

Np

xxs

xΣxΣ

xΣΣ

xΣxΣ

Σx

Page 43: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

4343

Proof of Factorization Theorem: Proof of Factorization Theorem: The “Only if” PartThe “Only if” Part

),()()|()()|(),|(

)|(),|()|,()( oft independen is ),( ,for sufficient is Suppose

ssss

ssss

s

ss

gDhPDhPDP

PDPDPD|P

D|P

Page 44: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

4444

Proof of Factorization Theorem: Proof of Factorization Theorem: The “if” PartThe “if” Part

for sufficient is and , oft independen

)()(

)(),()(),(

)|()|(

)|,()|,(

)|()|,(),|(

)(|),(

s

ss

ss

sss

ss

DDDDDD

DD

DhDh

DhgDhg

DPDP

DPDP

PDPDP

DDDD

Page 45: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

4545

Kernel DensityKernel DensityFactoring of Factoring of PP((DD||)) into into gg((ss,,))hh((DD)) is not u is not uniquenique– If If ff((ss)) is any function, is any function, gg’(’(ss,,)=)=ff((ss))gg((ss,,)) and and hh’’

((DD) = ) = hh((DD)/)/ff((ss)) are equivalent factors are equivalent factors

Ambiguity is removed by defining the kerAmbiguity is removed by defining the kernel density invariant to such scalingnel density invariant to such scaling

dggg

),(),(),(

sss

Page 46: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

4646

Example: Multivariate GaussianExample: Multivariate Gaussian

)ˆ(1)ˆ(21exp

12

1),ˆ(

)ˆ2(2

exp),ˆ(

1ˆ ),(),ˆ(

21exp

21

2exp)|(

),(~)|(

1

2/12/

11

1

1

12/2/

1

11

nt

nd

n

ntt

n

n

kknn

n

kk

tknnd

n

kk

tt

nn

g

ng

nDhg

nDp

Np

ΣΣ

ΣΣ

x

xΣxΣ

xΣΣ

Σx

Page 47: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

4747

Kernel Density and Kernel Density and Parameter EstimationParameter Estimation

Maximum-likelihoodMaximum-likelihood– maximization of maximization of gg((ss,,))BayesianBayesian

– If prior knowledge of If prior knowledge of is vague, is vague, pp(()) tend to tend to be uniform, and be uniform, and pp((||DD)) is approximately the is approximately the same as the kernel densitysame as the kernel density

– If If pp((xx||)) is identifiable, is identifiable, gg((ss,,)) peaks sharply a peaks sharply at some value, and t some value, and pp(()) is continuous as well is continuous as well as non-zero there, as non-zero there, pp((||DD)) approaches the ke approaches the kernel density rnel density

dpgpg

dpDppDpDp

)(),()(),(

)()|()()|()|(

ss

Page 48: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

4848

Sufficient Statistics for Sufficient Statistics for Exponential FamilyExponential Family

n

kk

tn

kk

n

kk

n

kk

t

t

Dh

ngn

Dhg

nDp

p

1

1

11

)()(

)()(exp),(,)(1)(),(

)()()()(exp)|(

)()()(exp)()|(

x

sbasxcs

s

xxcba

xcbaxx

Page 49: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

4949

Error Rate and DimensionalityError Rate and Dimensionality

2

1

212

211

212

2/

2/

case,t independenlly conditiona In the

,21)(

rateerror Bayes ies,probabilitprior equalWith

2,1),,(~)|(case normal temultivaria class-woConsider t

tindependenlly statistica are features Suppose

2

d

i i

ii

t

r

u

jj

r

rdueeP

jNp

Σ

Σx

Page 50: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

5050

Accuracy and DimensionalityAccuracy and Dimensionality

Page 51: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

5151

Effects of Additional FeaturesEffects of Additional FeaturesIn practice, beyond a certain point, In practice, beyond a certain point, inclusion of additional features leads inclusion of additional features leads to worse rather than better to worse rather than better performanceperformanceSources of difficultySources of difficulty– Wrong modelsWrong models– Number of design or training samples is Number of design or training samples is

finite and thus the distributions are not finite and thus the distributions are not estimated accuratelyestimated accurately

Page 52: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

5252

Computational Complexity for Computational Complexity for Maximum-Likelihood EstimationMaximum-Likelihood Estimation

)1()()()()(

2ln2

)(lnˆln21ˆˆˆ

21)(

)( :matrix a oft determinan find

)( :matrix a of inverse find

)(:ˆˆ1ˆ

)(:1ˆ

32

1

3

3

2

1

1

OnOdOndOndO

dPg

dndOdd

dOdd

ndOn

ndOn

nt

n

n

k

tnknk

n

kkn

ΣxΣxx

xxΣ

x

Page 53: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

5353

Computational Complexity for Computational Complexity for ClassificationClassification

learningan simpler th)( :tionclassificafor Total

)(:decision )(max)(: vectorseparation by the

matrix covariance inverse heMultiply t)(:ˆ Compute

Given

2

2

dO

cOgdO

dO

ii

n

x

xx

Page 54: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

5454

Approaches for Approaches for Inadequate SamplesInadequate Samples

Reduce dimensionalityReduce dimensionality– Redesign feature extractorRedesign feature extractor– Select appropriate subset of featuresSelect appropriate subset of features– Combine the existing featuresCombine the existing features– Pool the available data by assuming all Pool the available data by assuming all

classes share the same covariance matrixclasses share the same covariance matrixLook for a better estimate for Look for a better estimate for – Use Bayesian estimate and diagonal Use Bayesian estimate and diagonal 00

– Threshold sample covariance matrixThreshold sample covariance matrix– Assume statistical independenceAssume statistical independence

Page 55: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

5555

Shrinkage Shrinkage (Regularized Discriminant Analysis)(Regularized Discriminant Analysis)

10,)-(1)(matrixidentity the toward shrink"" or,

10,1

1onecommon thematrix to covariance individual shrink""

matrix covariance same assumingby estimated is questionin categories on theindex an is

IΣΣΣ

ΣΣΣ

Σ

nnnn

ci

i

iii

Page 56: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

5656

Concept of OverfittingConcept of Overfitting

Page 57: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

5757

Best Representative PointBest Representative Point

)( minimizes

)()()(

1

minimized is )(

such that find,,,Given

000

1 1

220

1

2000

1

1

2000

01

xmx

mxmx

mxmxx

xm

xxx

xxx

J

J

n

J

n

k

n

kk

n

kk

n

kk

n

kk

n

Page 58: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

5858

Projection Along a LineProjection Along a Line

Page 59: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

5959

Best Projection to a Line Through Best Projection to a Line Through the Sample Meanthe Sample Mean

)(

)(2

)();,,(

minimize Toerror with by Represent

Line

1

2

11

22

1

211

mxe

mxmxee

xeme

emxemx

kt

k

n

kk

n

kk

tk

n

kk

n

kkkn

kk

a

aa

aaaJ

aa

Page 60: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

6060

Best Representative DirectionBest Representative Direction

eSeeeSee

eSee

mxSee

mxemxmxe

mxe

e

e

0)1( maximize :method Lagrange

1 subject to Maximize

))((

2)(

minimize to Find

2

1

2

1

2

1

1

2

1

2

1

21

uu

aaJ

tt

t

n

kk

t

n

kk

n

k

tkk

t

n

kk

n

kk

n

kk

Page 61: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

6161

Principal Component Analysis Principal Component Analysis (PCA)(PCA)

seigenvaluelargest thehaving of rseigenvecto theare ,,

),,(

minimize to',,1, Find

: space Projection

'1

2

1

'

1'1'

'

1

d'd'

aJ

di

a

d

n

kk

d

iikidd

i

d

iii

See

xemee

e

emx

Page 62: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

6262

Concept of Concept of Fisher Linear DiscriminantFisher Linear Discriminant

Page 63: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

6363

Fisher Linear Discriminant AnalysisFisher Linear Discriminant Analysis

22

21

221

22

21

22

~~~~

)( maximize To

~~ :scatter class-Within

)~(~,1~

2,1,1

on separation maximalget to Find

ssmm

J

ss

msn

m

in

iD

tii

t

D

t

ii

Dii

t

ii

i

w

xwmwxw

xm

xwyw

xx

x

Page 64: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

6464

Fisher Linear Discriminant AnalysisFisher Linear Discriminant Analysis

tB

Bttt

WWt

Dx

tiii

it

Dxi

tti

mm

ss

s

i

i

2121

221

221

212

22

1

22

~~,~~

~

mmmmS

wSwmwmw

SSSwSw

mxmxS

wSwmwxw

Page 65: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

6565

Fisher Linear Discriminant AnalysisFisher Linear Discriminant Analysis

scales] [ignoring )(

)( ofdirection thein always is ))((

problem) eigenvalue ed(generaliz when maximized is

quotient,Rayleigh dgeneralize ,)(

211

21

2121

1

mmSw

mmwmmmmwS

wwSS

wSwS

wSwwSww

W

tB

BW

WB

Wt

Bt

J

Page 66: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

6666

Fisher Linear Discriminant Analysis Fisher Linear Discriminant Analysis for Multivariate Normalfor Multivariate Normal

analysis]nt discriminalinear Fisher to[solution

, and , ,for estimationWith

0

boundarydecision optimalmatrix covariance same Assume

211

21

211

0

mmSw

ΣΣw

xw

Σ

W

t w

Page 67: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

6767

Concept of Multidimensional DiscriConcept of Multidimensional Discriminant Analysisminant Analysis

Page 68: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

6868

Multiple Discriminant AnalysisMultiple Discriminant Analysis

ii

i

i

Di

D

tiii

c

iiW

Wt

c

i D

ti

ti

tW

c

iii

D

ti

ttii

n

nnn

ciy

c-dc

xx

x

x

xmmxmxSSS

WSWmxWmxWS

mmxWm

xWyxw

1,,

~~~

~1~,1~

1,,1,

subspaceldimensiona-)1( tospace ldimensiona- from Projection

problem class-Consider

1

1

1

Page 69: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

6969

Multiple Discriminant AnalysisMultiple Discriminant Analysis

BW

c

i

tiiiW

c

i D

tii

c

i D

tii

c

i D

tiiii

tT

i

c

ii

n

nnn

ii

i

SSmmmmS

mmmmmxmx

mmmxmmmx

mxmxS

mxm

xx

x

x

x

1

11

1

1

11

Page 70: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

7070

Multiple Discriminant AnalysisMultiple Discriminant Analysis

WSW

WSW

S

SW

W

WSWmmmmS

Wt

Bt

W

B

Bt

c

i

tiiiB

J

n

~

~)(let

)directions principal in the variances ofproduct thet to(equivalenmatrix scatter theof

tdeterminan theisscatter of measurescalar simpleA scatter class- withon thescatter to class-between

theof ratio themaximize toation transformaSeek

~~~~~1

Page 71: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

7171

Multiple Discriminant AnalysisMultiple Discriminant Analysis

etc. matrices, scalingor rotation by multipliedbecan it since unique,not is optimal

eigenvaluelargest the torelatedr eigenvecto dgeneralize theis and

satisfies optimal of Columns

W

wSwSW

iWiiB

Page 72: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

7272

Expectation-Maximization (EM)Expectation-Maximization (EM)Finding the maximum-likelihood estimate of Finding the maximum-likelihood estimate of the parameters of an underlying distribution the parameters of an underlying distribution – from a given data set when the data is from a given data set when the data is

incomplete or has missing valuesincomplete or has missing valuesTwo main applicationsTwo main applications– When the data indeed has missing valuesWhen the data indeed has missing values– When optimizing the likelihood function is When optimizing the likelihood function is

analytically intractable but when the likelihood analytically intractable but when the likelihood function can be simplified by assuming the function can be simplified by assuming the existence of and values for additional but existence of and values for additional but missing (or hidden) parametersmissing (or hidden) parameters

Page 73: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

7373

Expectation-Maximization (EM)Expectation-Maximization (EM)Full sample Full sample DD = { = {xx11, . . ., , . . ., xxnn}}

xxkk = { = { xxkgkg, , xxkbkb } }Separate individual features into Separate individual features into DDgg an and d DDbb

– DD is the union of is the union of DDgg and and DDbbForm the functionForm the function igbgDi DDDpEQ

b ;|);,(ln);(

Page 74: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

7474

Expectation-Maximization (EM)Expectation-Maximization (EM)begin initialize begin initialize 00, , TT,, i i 0 0 do do i i i + i + 11 E step: Compute E step: Compute QQ((; ; ii)) M step: M step: ii+1+1 arg max arg max QQ((,,ii))

until until QQ((ii+1+1;;ii)-)-QQ((ii;;ii-1-1) ) TT return return ii+1+1

end end

Page 75: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

7575

Expectation-Maximization (EM)Expectation-Maximization (EM)

Page 76: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

7676

Example: 2D ModelExample: 2D Model

1100

,

matrix covariancediagonal with modelGaussian 2D Assume

4*

,22

,01

,20

,,,

0

22

21

2

1

41

4321

xD

D

b

xxxx

Page 77: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

7777

Example: 2D ModelExample: 2D Model

'41

0'41

41

041

413

1

41420

41

3

14

041

0

|4

|4

|4

ln)|(ln

)4;|(

)|(ln)|(ln

,|);,(ln);(41

dxx

pK

dxK

xp

xpp

dxxxp

pp

DxDpEQ

kk

kk

ggx

x

xx

Page 78: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

7878

Example: 2D ModelExample: 2D Model

0.2938.0

0.275.0

2ln2

42

1)|(ln

)4(21exp

21|

4ln1

)|(ln);(

1

2122

22

21

21

3

1

4122

4141

3

1

0

kk

kk

p

dxxx

pK

pQ

x

x

Page 79: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

7979

Example: 2D ModelExample: 2D Model

0.200667.0

0.20.1

at converges algorithm the,iterations 3After

Σ

Page 80: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

8080

Generalized Expectation-Generalized Expectation-Maximization (GEM)Maximization (GEM)

Instead of maximizing Instead of maximizing QQ((; ; ii), we find s), we find some ome ii+1+1 such thatsuch thatQQ((ii+1+1;;ii)>)>QQ((;;ii) )

and is also guaranteed to convergeand is also guaranteed to convergeConvergence will not as rapidConvergence will not as rapidOffers great freedom to choose computaOffers great freedom to choose computationally simpler stepstionally simpler steps– e.g., using maximum-likelihood value of unke.g., using maximum-likelihood value of unknown values, if they lead to a greater likelihnown values, if they lead to a greater likelihoodood

Page 81: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

8181

Hidden Markov Model (HMM)Hidden Markov Model (HMM)Used for problems of making a series of Used for problems of making a series of decisionsdecisions– e.ge.g., speech or gesture recognition., speech or gesture recognitionProblem states at time Problem states at time tt are influenced d are influenced directly by a state at irectly by a state at t-t-11More reference:More reference:– L. A. Rabiner and B. W. Juang, L. A. Rabiner and B. W. Juang, FundamentalFundamentals of Speech Recognitions of Speech Recognition, Prentice-Hall, 1993,, Prentice-Hall, 1993, Chapter 6. Chapter 6.

Page 82: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

8282

First Order Markov ModelsFirst Order Markov Models

1321223213

6312231

6 )|(,,,,,,.,.

)(,),2(),1( states of sequence

aaaaaPge

TT

Page 83: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

8383

First Order Hidden Markov ModelsFirst Order Hidden Markov Models

jkjk

T

bttvPvvvvvvge

Tvvv

))(|)((,,,,,,.,.

)(,),2(),1( states visibleof Sequence

3241146 V

V

Page 84: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

8484

Hidden Markov Model ProbabilitiesHidden Markov Model Probabilities

1,1

))(|)((:state visiblea ofemission ofy probabilit

))(|)1(( :yprobabilit transition1: state absorbingor final 000

k

jkj

ij

jkjk

ijij

ba

ttvPb

ttPaa

Page 85: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

8585

Hidden Markov Model ComputationHidden Markov Model ComputationEvaluation problemEvaluation problem– Given Given aaijij and and bbjkjk, determine , determine PP((VVTT||))

Decoding problemDecoding problem– Given Given VVTT, determine the most likely sequenc, determine the most likely sequence of hidden states that lead to e of hidden states that lead to VVTT

Learning problemLearning problem– Given training observations of visible symbolGiven training observations of visible symbols and the coarse structure but not the probas and the coarse structure but not the probabilities, determine bilities, determine aaijij and and bbjkjk

Page 86: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

8686

EvaluationEvaluation

max

max

1 1

1

1

1

))1(|)(())(|)(()(

))(|)(()|(

))1(|)(()(

)()|()(

r

r

T

t

T

T

t

Tr

T

T

t

Tr

Tr

r

r

Tr

TT

ttPttvPP

ttvPP

ttPP

PPP

V

V

VV

Page 87: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

8787

HMM ForwardHMM Forward

)()()|)(()),(()(

state initial,1state initial,0

)0(

)1(

)),1(())1(|)(())(|(

)),(()(

))1(|)(())(|)(()(

000

1)(

1

1

1 1

max

TTTT

j

c

iiijtjkv

ti

c

iijjk

tjj

r

r

T

t

T

PPTPTPT

jj

tab

tPttPtvP

tPt

ttPttvPP

VVVV

V

V

V

Page 88: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

8888

HMM Forward and TrellisHMM Forward and Trellis

Page 89: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

8989

HMM ForwardHMM Forward

endstate finalfor )()(return

until

)1()(

,,0 ,1for

)0(,,,,0 initialize

0

1)(

TP

Tt

atbt

cjtt

bat

T

c

i ijitjkvj

jT

jkij

V

V

Page 90: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

9090

HMM BackwardHMM Backward

)()()|)0(()),0(()0(

0,10,0

)(

)1(

)),1(())(|)1(())1(|(

)),(()(

))1(|)(())(|)(()(

1)1(

)1(

1

1 1

max

TTTinit

Tinitinit

i

c

jjijtjkv

tTj

c

jijjk

tTii

r

r

T

t

T

PPPP

ii

T

tab

tPttPtvP

tPt

ttPttvPP

VVVV

V

V

V

Page 91: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

9191

HMM BackwardHMM Backward

endstate initialfor )0()(return

0 until

)1()(

,,1 ,1for

)(,,,, initialize

0

0 )1(

T

c

j tjkvijji

jT

jkij

P

t

batt

citt

TbaTt

V

V

Page 92: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

9292

Example 3: Hidden Markov ModelExample 3: Hidden Markov Model

Page 93: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

9393

Example 3: Hidden Markov ModelExample 3: Hidden Markov Model

2.01.02.05.001.07.01.01.002.01.04.03.00

00001

1.00.01.08.01.02.05.02.04.01.03.02.0

0001

jk

ij

b

a

Page 94: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

9494

Example 3: Hidden Markov ModelExample 3: Hidden Markov Model

Page 95: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

9595

Left-to-Right Models for SpeechLeft-to-Right Models for Speech

)()()|()|( T

TT

PPPP

VVV

Page 96: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

9696

HMM DecodingHMM Decoding

Page 97: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

9797

Problem of Local OptimizationProblem of Local OptimizationThis decoding algorithm depends This decoding algorithm depends only on the single previous time step, only on the single previous time step, not the full sequencenot the full sequenceNot guarantee that the path is Not guarantee that the path is indeed allowableindeed allowable

Page 98: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

9898

HMM DecodingHMM Decoding

endreturn

until

to Append

)(maxarg until

)1()(

1for 0 ,1for

{},0 initialize

'

1)(

PathTt

Path

tj'cj

atbt

jjjttPatht

j

jj

c

i ijitjkvj

Page 99: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

9999

Example 4: HMM DecodingExample 4: HMM Decoding

Page 100: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

100100

Forward-Backward AlgorithmForward-Backward AlgorithmDetermines model parameters, Determines model parameters, aaijij and and bbjkjk,, from an ensemble of training samples from an ensemble of training samplesAn instance of a generalized expectatioAn instance of a generalized expectation-maximization algorithmn-maximization algorithmNo known method for the optimal or moNo known method for the optimal or most likely set of parameters from datast likely set of parameters from data

Page 101: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

101101

Probability of TransitionProbability of Transition

)|()()1(

)|()|),(),1((

),|)(),1((

),|)(),1(()(

Tjjkiji

T

Tji

Tji

Tjiij

Ptbat

PttP

ttP

ttPt

V

VV

V

V

Page 102: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

102102

Improved Estimate for Improved Estimate for aaijij

T

t k ik

T

tij

ij

ij

T

t k iki

T

tij

ji

t

ta

a

t

t

tt

1

1

1

1

)(

)(ˆ

: of Estimate

)(:

from tionsany transi ofnumber expected Total

)(

:sequence in the any timeat )( and )1(statebetween ns transitioofnumber Expected

Page 103: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

103103

Improved Estimate for Improved Estimate for bbjkjk

T

t lil

T

vtvt lil

jk

t

tb k

1

)( ,1

)(

)(ˆ

Page 104: 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

104104

Forward-Backward AlgorithmForward-Backward Algorithm(Baum-Welch Algorithm)(Baum-Welch Algorithm)

end

)();(return

)1()(),1()(max until

)(ˆ)(

)(ˆ)(

)1( and )1( all from )(ˆ all compute

)1( and )1( all from )(ˆ all compute 1 do

0, threshold, sequence training,, initialize

,,

zbbzaa

zbzbzaza

zbzb

zaza

zbzazb

zbzazazz

zba

jkjkijij

jkjkijijkji

jkjk

ijij

jkijjk

jkijij

Tjkij

V