machine learning & data mining - yisong yue · machine learning & data mining cs/cns/ee 155...

MachineLearning&DataMiningCS/CNS/EE155

Lecture11:HiddenMarkovModels

1

KaggleCompe==onPart1

2

KaggleCompe==onPart2

3

Announcements

•  UpdatedKaggleReportDueDate:– 9pmonMondayFeb13th

•  NextRecita=onTuesdayFeb14th– 7pm-8pm– Recapofconceptscoveredtoday

4

SequencePredic=on(POSTagging)

•  x=“FishSleep”•  y=(N,V)

•  x=“TheDogAteMyHomework”•  y=(D,N,V,D,N)

•  x=“TheFoxJumpedOverTheFence”•  y=(D,N,V,P,D,N)

5

Challenges

•  Mul=variableOutput– Makemul=plepredic=onssimultaneously

•  VariableLengthInput/Output– Sentencelengthsnotfixed

6

Mul=variateOutputs

•  x=“FishSleep”•  y=(N,V)•  Mul=classpredic=on:

•  Howmanyclasses?

7

POSTags:Det,Noun,Verb,Adj,Adv,Prep

w =

w1w2!wK

!

"

#####

$

%

&&&&&

f (x |w,b) =

w1T x − b1

w2T x − b2!

wKT x − bK

"

#

$$$$$

%

&

'''''

PredictviaLargestScore:

argmaxk

w1T x − b1

w2T x − b2!

wKT x − bK

"

#

$$$$$

%

&

'''''

ReplicateWeights: ScoreAllClasses:

b =

b1b2!bK

!

"

#####

$

%

&&&&&

Mul=classPredic=on

•  x=“FishSleep”•  y=(N,V)•  Mul=classpredic=on:– Allpossiblelength-Msequencesasdifferentclass–  (D,D),(D,N),(D,V),(D,Adj),(D,Adv),(D,Pr)(N,D),(N,N),(N,V),(N,Adj),(N,Adv),…

•  LMclasses!– Length2:62=36!

8


L=6

Mul=classPredic=on

•  x=“FishSleep”•  y=(N,V)•  Mul=classpredic=on:– Allpossiblelength-Msequencesasdifferentclass–  (D,D),(D,N),(D,V),(D,Adj),(D,Adv),(D,Pr)(N,D),(N,N),(N,V),(N,Adj),(N,Adv),…

•  LMclasses!– Length2:62=36!

9


L=6

Exponen=alExplosionin#Classes!(NotTractableforSequencePredic=on)

WhyisNaïveMul=classIntractable?

–  (D,D,D),(D,D,N),(D,D,V),(D,D,Adj),(D,D,Adv),(D,D,Pr)–  (D,N,D),(D,N,N),(D,N,V),(D,N,Adj),(D,N,Adv),(D,N,Pr)–  (D,V,D),(D,V,N),(D,V,V),(D,V,Adj),(D,V,Adv),(D,V,Pr)–  …–  (N,D,D),(N,D,N),(N,D,V),(N,D,Adj),(N,D,Adv),(N,D,Pr)–  (N,N,D),(N,N,N),(N,N,V),(N,N,Adj),(N,N,Adv),(N,N,Pr)–  …

10

POSTags:Det,Noun,Verb,Adj,Adv,Prepx=“Ifishoken”

Assumepronounsarenounsforsimplicity.

WhyisNaïveMul=classIntractable?

–  (D,D,D),(D,D,N),(D,D,V),(D,D,Adj),(D,D,Adv),(D,D,Pr)–  (D,N,D),(D,N,N),(D,N,V),(D,N,Adj),(D,N,Adv),(D,N,Pr)–  (D,V,D),(D,V,N),(D,V,V),(D,V,Adj),(D,V,Adv),(D,V,Pr)–  …–  (N,D,D),(N,D,N),(N,D,V),(N,D,Adj),(N,D,Adv),(N,D,Pr)–  (N,N,D),(N,N,N),(N,N,V),(N,N,Adj),(N,N,Adv),(N,N,Pr)–  …

11

TreatsEveryCombina=onAsDifferentClass(Learnmodelforeachcombina=on)

Exponen=allyLargeRepresenta=on!

(Exponen=alTimetoConsiderEveryClass)(Exponen=alStorage)

POSTags:Det,Noun,Verb,Adj,Adv,Prepx=“Ifishoken”


IndependentClassifica=on

•  Treateachwordindependently(assump=on)–  Independentmul=classpredic=onperword– Predictforx=“I”independently– Predictforx=“fish”independently– Predictforx=“oken”independently– Concatenatepredic=ons.

12

x=“Ifishoken” POSTags:Det,Noun,Verb,Adj,Adv,Prep



•  Treateachwordindependently(assump=on)–  Independentmul=classpredic=onperword– Predictforx=“I”independently– Predictforx=“fish”independently– Predictforx=“oken”independently– Concatenatepredic=ons.

13



#Classes=#POSTags(6inourexample)

Solvableusingstandardmul=classpredic=on.


•  Treateachwordindependently–  Independentmul=classpredic=onperword

14



P(y|x) x=“I” x=“fish” x=“oNen”

y=“Det” 0.0 0.0 0.0

y=“Noun” 1.0 0.75 0.0

y=“Verb” 0.0 0.25 0.0

y=“Adj” 0.0 0.0 0.4

y=“Adv” 0.0 0.0 0.6

y=“Prep” 0.0 0.0 0.0

PredicQon:(N,N,Adv)

Correct:(N,V,Adv)

Whythemistake?

ContextBetweenWords

•  IndependentPredic=onsIgnoreWordPairs–  InIsola=on:

•  “Fish”ismorelikelytobeaNoun

– ButCondi=onedonFollowinga(pro)Noun…•  “Fish”ismorelikelytobeaVerb!

– “1stOrder”Dependence(ModelAllPairs)•  2ndOrderConsidersAllTriplets•  ArbitraryOrder=Exponen=alSize(NaïveMul=class)

15



1stOrderHiddenMarkovModel

•  x=(x1,x2,x4,x4,…,xM)(sequenceofwords)•  y=(y1,y2,y3,y4,…,yM)(sequenceofPOStags)

•  P(xi|yi)Probabilityofstateyigenera=ngxi•  P(yi+1|yi)Probabilityofstateyitransi=oningtoyi+1•  P(y1|y0)y0isdefinedtobetheStartstate•  P(End|yM)PriorprobabilityofyMbeingthefinalstate–  Notalwaysused

16

GraphicalModelRepresenta=on

17

Y1

X1

Y2

X2

YM

XM

…

…

P x, y( ) = P(End | yM ) P(yi | yi−1)i=1

M

∏ P(xi | yi )i=1

M

∏

Op=onal

Y0 YEnd



18


M

∏ P(xi | yi )i=1

M

∏“JointDistribu=on”

Op=onal



19

P x | y( ) = P(xi | yi )i=1

M

∏

“Condi=onalDistribu=ononxgiveny”

GivenaPOSTagSequencey:CancomputeeachP(xi|y)independently!(xicondi=onallyindependentgivenyi)


20

ModelsAllState-StatePairs(allPOSTag-Tagpairs)ModelsAllState-Observa=onPairs(allTag-Wordpairs)

SameComplexityasIndependentMul=class

Addi=onalComplexityof(#POSTags)2


Rela=onshiptoNaïveBayes

21

Graphical)Model)Representa2on)

14)

Y1#

X1)

Y2#

X2)

YM#

XM)

…#

…#


M

∏ P(xi | yi )i=1

M

∏

Op2onal)

Y0# YEnd#

Graphical)Model)Representa2on)

15)

Y1#

X1)

Y2#

X2)

YM#

XM)

…#

…#


M

∏ P(xi | yi )i=1

M

∏

Op2onal)ReducestoasequenceofdisjointNaïveBayesmodels(ifweignoretransi=onprobabili=es)

P(word|state/tag)

•  Two-wordlanguage:“fish”and“sleep”•  Two-taglanguage:“Noun”and“Verb”

SlidesborrowedfromRalph Grishman 22

P(x|y) y=“Noun” y=“Verb”

x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

GivenTagSequencey:

P(“fishsleep”|(N,V))=0.8*0.5P(“fishfish”|(N,V))=0.8*0.5P(“sleepfish”|(V,V))=0.8*0.5P(“sleepsleep”|(N,N))=0.2*0.5

Sampling

•  HMMsare“genera=ve”models– Modelsjointdistribu=onP(x,y)– Cangeneratesamplesfromthisdistribu=on– Firstconsidercondi=onaldistribu=onP(x|y)

– WhataboutsamplingfromP(x,y)?

23


x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

GivenTagSequencey=(N,V):

Sampleeachwordindependently:SampleP(x1|N)(0.8Fish,0.2Sleep)SampleP(x2|V)(0.5Fish,0.5Sleep)

ForwardSamplingofP(y,x)

24

A Simple POS HMM

start noun verb end 0.8

0.2

0.8 0.7

0.1

0.2

0.1 0.1

Slides'borrowed'from'Ralph Grishman'' 12'


x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5


M

∏ P(xi | yi )i=1

M

∏

SlidesborrowedfromRalph Grishman

Ini=alizey0=StartIni=alizei=01.  i=i+12.  SampleyifromP(yi|yi-1)3.  Ifyi==End:Quit4.  SamplexifromP(xi|yi)5.  GotoStep1

ExploitsCondi=onalInd.RequiresP(End|yi)

ForwardSamplingofP(y,x|L)

25


x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

P x, y |M( ) = P(End | yM ) P(yi | yi−1)i=1

M

∏ P(xi | yi )i=1

M

∏

SlidesborrowedfromRalph Grishman

Ini=alizey0=StartIni=alizei=01.  i=i+12.  If(i==M):Quit3.  SampleyifromP(yi|yi-1)4.  SamplexifromP(xi|yi)5.  GotoStep1

ExploitsCondi=onalInd.AssumesnoP(End|yi)

A Simple POS HMM

start noun verb 0.8

0.2

0.91

0.333

0.667

0.09




26

P xk+1:M , yk+1:M | x1:k, y1:k( ) = P xk+1:M , yk+1:M | yk( )“Memory-lessModel”–onlyneedsyktomodelrestofsequence

ViterbiAlgorithm

27

MostCommonPredic=onProblem

•  Giveninputsentence,predictPOSTagseq.

•  Naïveapproach:– Tryallpossibley’s– Chooseonewithhighestprobability– ExponenQalQme:LMpossibley’s

28

argmaxy

P y | x( )

DynamicProgramming

•  Input:x=(x1,x2,x3,…,xM)•  Computed:bestlength-kprefixendingineachTag:

–  Examples:

•  Claim:

31

Y k (V ) = argmaxy1:k−1

P(y1:k−1⊕V, x1:k )#

$%

&

'(⊕V Y k (N ) = argmax

y1:k−1P(y1:k−1⊕ N, x1:k )

#

$%

&

'(⊕ N

SequenceConcatena=on

Y k+1(V ) = argmaxy1:k∈ Y k T( ){ }T

P(y1:k ⊕V, x1:k+1)#

$%%

&

'((⊕V

= argmaxy1:k∈ Y k T( ){ }T

P(y1:k, x1:k )P(yk+1 =V | yk )P(xk+1 | yk+1 =V )#

$%%

&

'((⊕V

Pre-computed RecursiveDefiniQon!

Y 2 (V ) = argmaxy1∈ Y1 T( ){ }T

P(y1, x1)P(y2 =V | y1)P(x 2 | y2 =V )"

#$$

%

&''⊕V

32

Ŷ1(V)

Ŷ1(D)

Ŷ1(N)

StoreeachŶ1(Z)&P(Ŷ1(Z),x1)

Ŷ2(V)

Ŷ2(D)

Ŷ2(N)

Solve:

y1=V

y1=D

y1=N

Ŷ1(Z)isjustZ

Y 2 (V ) = argmaxy1∈ Y1 T( ){ }T

P(y1, x1)P(y2 =V | y1)P(x 2 | y2 =V )"

#$$

%

&''⊕V

33

Ŷ1(V)

Ŷ1(D)

Ŷ1(N)


Ŷ2(V)

Ŷ2(D)

Ŷ2(N)

y1=N

Ŷ1(Z)isjustZ Ex:Ŷ2(V)=(N,V)

Solve:

34

Ŷ1(V)

Ŷ1(D)

Ŷ1(N)


Ŷ2(V)

Ŷ2(D)

Ŷ2(N)

StoreeachŶ2(Z)&P(Ŷ2(Z),x1:2)

Ex:Ŷ2(V)=(N,V)

Ŷ3(V)

Ŷ3(D)

Ŷ3(N)

Y 3(V ) = argmaxy1:2∈ Y 2 T( ){ }T

P(y1:2, x1:2 )P(y3 =V | y2 )P(x3 | y3 =V )"

#$$

%

&''⊕VSolve:

y2=V

y2=D

y2=N

35

Ŷ1(V)

Ŷ1(D)

Ŷ1(N)


Ŷ2(V)

Ŷ2(D)

Ŷ2(N)


Ex:Ŷ2(V)=(N,V)

Ŷ3(V)

Ŷ3(D)

Ŷ3(N)

Claim:Onlyneedtochecksolu=onsofŶ2(Z),Z=V,D,N

y2=V

y2=D

y2=N

SupposeŶ3(V)=(V,V,V)……provethatŶ3(V)=(N,V,V)hashigherprob.

Proofdependson1storderproperty•  Prob.of(V,V,V)&(N,V,V)differin3terms•  P(y1|y0),P(x1|y1),P(y2|y1)•  Noneofthesedependony3!

Solve: Y 3(V ) = argmaxy1:2∈ Y 2 T( ){ }T

P(y1:2, x1:2 )P(y3 =V | y2 )P(x3 | y3 =V )"

#$$

%

&''⊕V

36

Ŷ1(V)

Ŷ1(D)

Ŷ1(N)


Ŷ2(V)

Ŷ2(D)

Ŷ2(N)


Ex:Ŷ2(V)=(N,V)

Ŷ3(V)

Ŷ3(D)

Ŷ3(N)

Y M (V ) = argmaxy1:M−1∈ Y M−1 T( ){ }T

P(y1:M−1, x1:M−1)P(yM =V | yM−1)P(xM | yM =V )P(End | yM =V )#

$%%

&

'((⊕V


Ex:Ŷ3(V)=(D,N,V)

ŶM(V)

ŶM(D)

ŶM(N)

…

Op=onal

ViterbiAlgorithm

•  Solve:

•  Fork=1..M–  Itera=velysolveforeachŶk(Z)

•  ZloopingovereveryPOStag.

•  PredictbestŶM(Z)•  AlsoknownasMeanAPosteriori(MAP)inference

37

argmaxy


P(y, x)P(x)

= argmaxy

P(y, x)

= argmaxy

P(x | y)P(y)

NumericalExample


0.2

0.8 0.7

0.1

0.2

0.1 0.1


x=(FishSleep)

0 1 2 3

start 1

verb 0

noun 0

end 0SlidesborrowedfromRalph Grishman 39

A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0

verb 0 .2 * .5

noun 0 .8 * .8

end 0 0

Token 1: fish


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0

verb 0 .1

noun 0 .64

end 0 0

Token 1: fish


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0

verb 0 .1 .1*.1*.5

noun 0 .64 .1*.2*.2

end 0 0 -

Token 2: sleep

(if ‘fish’ is verb)


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0

verb 0 .1 .005

noun 0 .64 .004

end 0 0 -

Token 2: sleep

(if ‘fish’ is verb)


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0

verb 0 .1 .005.64*.8*.5

noun 0 .64 .004.64*.1*.2

end 0 0 -

Token 2: sleep

(if ‘fish’ is a noun)


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0

verb 0 .1 .005.256

noun 0 .64 .004.0128

end 0 0 -

Token 2: sleep

(if ‘fish’ is a noun)


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0

verb 0 .1 .005.256

noun 0 .64 .004.0128

end 0 0 -

Token 2: sleep take maximum, set back pointers


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0

verb 0 .1 .256

noun 0 .64 .0128

end 0 0 -

Token 2: sleep take maximum, set back pointers


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0 0

verb 0 .1 .256 -

noun 0 .64 .0128 -

end 0 0 - .256*.7.0128*.1

Token 3: end


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0 0

verb 0 .1 .256 -

noun 0 .64 .0128 -

end 0 0 - .256*.7.0128*.1

Token 3: end take maximum, set back pointers


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0 0

verb 0 .1 .256 -

noun 0 .64 .0128 -

end 0 0 - .256*.7

Decode: fish = noun sleep = verb


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

0 1 2 3

start 1 0 0 0

verb 0 .1 .256 -

noun 0 .64 .0128 -

end 0 0 - .256*.7

Decode: fish = noun sleep = verb


A Simple POS HMM


0.2

0.8 0.7

0.1

0.2

0.1 0.1



x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

Whatmightgowrongforlongsequences?

Underflow!Smallnumbersgetrepeatedlymul=plied

together–exponen=allysmall!

ViterbiAlgorithm(w/LogProbabili=es)

•  Solve:

•  Fork=1..M–  Itera=velysolveforeachlog(Ŷk(Z))

•  ZloopingovereveryPOStag.

•  Predictbestlog(ŶM(Z))–  Log(ŶM(Z))accumulatesaddiQvely,notmulQplicaQvely

52

argmaxy


P(y, x)P(x)

= argmaxy

P(y, x)

= argmaxy

logP(x | y)+ logP(y)

Recap:IndependentClassifica=on

•  Treateachwordindependently–  Independentmul=classpredic=onperword

53



P(y|x) x=“I” x=“fish” x=“oNen”

y=“Det” 0.0 0.0 0.0

y=“Noun” 1.0 0.75 0.0

y=“Verb” 0.0 0.25 0.0

y=“Adj” 0.0 0.0 0.4

y=“Adv” 0.0 0.0 0.6

y=“Prep” 0.0 0.0 0.0

PredicQon:(N,N,Adv)

Correct:(N,V,Adv)

MistakeduetonotmodelingmulQplewords.

Recap:Viterbi

•  Modelspairwisetransi=onsbetweenstates– Pairwisetransi=onsbetweenPOSTags– “1storder”model

54


M

∏ P(xi | yi )i=1

M

∏

x=“Ifishoken” Independent:(N,N,Adv)

HMMViterbi:(N,V,Adv)*AssumingwedefinedP(x,y)properly

TrainingHMMs

55

SupervisedTraining

•  Given:

•  Goal:Es=mateP(x,y)usingS

•  MaximumLikelihood!

56

S = (xi, yi ){ }i=1N

WordSequence(Sentence)

POSTagSequence


M

∏ P(xi | yi )i=1

M

∏

Aside:MatrixFormula=on

•  DefineTransi=onMatrix:A–  Aab=P(yi+1=a|yi=b)or–Log(P(yi+1=a|yi=b))

•  Observa=onMatrix:O–  Owz=P(xi=w|yi=z)or–Log(P(xi=w|yi=z))

57


x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

P(ynext|y) y=“Noun” y=“Verb”

ynext=“Noun” 0.09 0.667

ynext=“Verb” 0.91 0.333

Aside:MatrixFormula=on

58


M

∏ P(xi | yi )i=1

M

∏


M

∏ P(xi | yi )i=1

M

∏

= AEnd,yM

Ayi ,yi−1

i=1

M

∏ Oxi ,yi

i=1

M

∏

− log(P(x, y)) = !AEnd,yM

+ !Ayi ,yi−1

i=1

M

∑ + !Oxi ,yi

i=1

M

∑ Logprob.formula=onEachentryofÃisdefineas–log(A)

MaximumLikelihood

•  Es=mateeachcomponentseparately:

•  (Derivedviaminimizingneg.loglikelihood)

59

Aab =1

yji+1=a( )∧ yj

i =b( )"#

$%i=0

M j

∑j=1

N

∑

1yji =b"

#$%

i=0

M j

∑j=1

N

∑Owz =

1x ji =w( )∧ yj

i =z( )"#

$%i=1

M j

∑j=1

N

∑

1yji =z"

#$%

i=1

M j

∑j=1

N

∑

argmaxA,O

P x, y( )(x,y)∈S∏ = argmax

A,OP(End | yM ) P(yi | yi−1)

i=1

M

∏ P(xi | yi )i=1

M

∏(x,y)∈S∏

Recap:SupervisedTraining

•  MaximumLikelihoodTraining– Coun=ngsta=s=cs– Supereasy!– Why?

•  Whataboutunsupervisedcase?

60

argmaxA,O



i=1

M

∏ P(xi | yi )i=1

M

∏(x,y)∈S∏

Condi=onalIndependenceAssump=ons

•  Everythingdecomposestoproductsofpairs–  I.e.,P(yi+1=a|yi=b)doesn’tdependonanythingelse

•  Canjustes=matefrequencies:–  Howokenyi+1=awhenyi=bovertrainingset–  NotethatP(yi+1=a|yi=b)isacommonmodelacrossallloca=onsofallsequences.

61

argmaxA,O



i=1

M

∏ P(xi | yi )i=1

M

∏(x,y)∈S∏

Condi=onalIndependenceAssump=ons

•  Everythingdecomposestoproductsofpairs–  I.e.,P(yi+1=a|yi=b)doesn’tdependonanythingelse

•  Canjustes=matefrequencies:–  Howokenyi+1=awhenyi=bovertrainingset–  NotethatP(yi+1=a|yi=b)isacommonmodelacrossallloca=onsofallsequences.

62

argmaxA,O



i=1

M

∏ P(xi | yi )i=1

M

∏(x,y)∈S∏

#Parameters:Transi=onsA:#Tags2

Observa=onsO:#Wordsx#Tags

Avoidsdirectlymodelword/wordpairings

#Tags=10s#Words=10000s

UnsupervisedTraining

•  Whataboutifnoy’s?–  Justatrainingsetofsentences

•  S=llwanttoes=mateP(x,y)– How?– Why?

63

S = xi{ }i=1N


argmax P xi( )i∏ = argmax P xi, y( )

y∑

i∏

WhyUnsupervisedTraining?

•  SupervisedDatahardtoacquire–  Requireannota=ngPOStags

•  UnsupervisedDataplen=ful–  Justgrabsometext!

•  MightjustworkforPOSTagging!–  Learny’sthatcorrespondtoPOSTags

•  Canbeusedforothertasks–  Detectoutliersentences(sentenceswithlowprob.)–  Samplingnewsentences.

64

EMAlgorithm(Baum-Welch)

•  Ifwehady’sèmaxlikelihood.•  Ifwehad(A,O)èpredicty’s

1.  Ini=alizeAandOarbitrarily

2.  Predict prob.ofy’sforeachtrainingx

3.  Usey’stoes=matenew(A,O)

4.  RepeatbacktoStep1un=lconvergence

65hzp://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm

ExpectaQonStep

MaximizaQonStep

ChickenvsEgg!

Expecta=onStep

•  Given(A,O)•  Fortrainingx=(x1,…,xM)– PredictP(yi)foreachy=(y1,…yM)

– Encodescurrentmodel’sbeliefsabouty– “MarginalDistribu=on”ofeachyi

66

x1 x2 … xL

P(yi=Noun) 0.5 0.4 … 0.05

P(yi=Det) 0.4 0.6 … 0.25

P(yi=Verb) 0.1 0.0 … 0.7

Recall:MatrixFormula=on

•  DefineTransi=onMatrix:A–  Aab=P(yi+1=a|yi=b)or–Log(P(yi+1=a|yi=b))

•  Observa=onMatrix:O–  Owz=P(xi=w|yi=z)or–Log(P(xi=w|yi=z))

67


x=“fish” 0.8 0.5

x=“sleep” 0.2 0.5

P(ynext|y) y=“Noun” y=“Verb”

ynext=“Noun” 0.09 0.667

ynext=“Verb” 0.91 0.333

Maximiza=onStep

•  Max.LikelihoodoverMarginalDistribu=on

68

Aab =P(yj

i = b, yji+1 = a)

i=0

M j

∑j=1

N

∑

P(yji = b)

i=0

M j

∑j=1

N

∑Owz =

1x ji =w!

"#$P(yj

i = z)i=1

M j

∑j=1

N

∑

P(yji = z)

i=1

M j

∑j=1

N

∑

Aab =1

yji+1=a( )∧ yj

i =b( )"#

$%i=0

M j

∑j=1

N

∑

1yji =b"

#$%

i=0

M j

∑j=1

N

∑Owz =

1x ji =w( )∧ yj

i =z( )"#

$%i=1

M j

∑j=1

N

∑

1yji =z"

#$%

i=1

M j

∑j=1

N

∑Supervised:

Unsupervised:

MarginalsMarginals

Marginals

Compu=ngMarginals(Forward-BackwardAlgorithm)

•  SolvingE-Step,requirescomputemarginals

•  CansolveusingDynamicProgramming!– SimilartoViterbi

69

x1 x2 … xL

P(yi=Noun) 0.5 0.4 … 0.05

P(yi=Det) 0.4 0.6 … 0.25

P(yi=Verb) 0.1 0.0 … 0.7

Nota=on

70

αz (i) = P(x1:i, yi = Z | A,O)

βz (i) = P(xi+1:M | yi = Z,A,O)

Probabilityofobservingprefixx1:iandhavingthei-thstatebeyi=Z

Probabilityofobservingsuffixxi+1:Mgiventhei-thstatebeingyi=Z

hzp://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm

P(yi = z | x) = az (i)βz (i)az ' (i)βz ' (i)

z '∑

Compu=ngMarginals=CombiningtheTwoTerms

Nota=on

71

αz (i) = P(x1:i, yi = Z | A,O)

βz (i) = P(xi+1:M | yi = Z,A,O)

Probabilityofobservingprefixx1:iandhavingthei-thstatebeyi=Z

Probabilityofobservingsuffixxi+1:Mgiventhei-thstatebeingyi=Z

hzp://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm

Compu=ngMarginals=CombiningtheTwoTerms

P(yi = b, yi−1 = a | x) = aa (i−1)P(yi = b | yi−1 = a)P(xi | yi = b)βb(i)

aa ' (i−1)P(yi = b ' | yi−1 = a ')P(xi | yi = b ')βb ' (i)

a ',b '∑

Forward(sub-)Algorithm

•  Solveforevery:

•  Naively:

•  Canbecomputedrecursively(likeViterbi)

72

αz (i) = P(x1:i, yi = Z | A,O)

αz (i) = P(x1:i, yi = Z | A,O) = P(x1:i, yi = Z, y1:i−1 | A,O)

y1:i−1∑

αz (1) = P(y1 = z | y0 )P(x1 | y1 = z) =O

x1,zAz,start

ExponenQalTime!

αz (i+1) =Oxi+1,zα j (i)

j=1

L

∑ Az, j

Viterbieffec=velyreplacessumwithmax

Backward(sub-)Algorithm

•  Solveforevery:

•  Naively:

•  Canbecomputedrecursively(likeViterbi)

73

βz (i) = P(xi+1:M | yi = Z,A,O) = P(xi+1:M , yi+1:M | yi = Z,A,O)

yi+1:L∑

βz (L) =1

ExponenQalTime!

βz (i) = β j (i+1)j=1

L

∑ Aj,zOxi+1, j

βz (i) = P(xi+1:M | yi = Z,A,O)

Forward-BackwardAlgorithm

•  RunsForward

•  RunsBackward

•  Foreachtrainingx=(x1,…,xM)– ComputeseachP(yi)fory=(y1,…,yM)

74

αz (i) = P(x1:i, yi = Z | A,O)

βz (i) = P(xi+1:M | yi = Z,A,O)

P(yi = z | x) = az (i)βz (i)az ' (i)βz ' (i)

z '∑

Recap:UnsupervisedTraining

•  Trainusingonlywordsequences:

•  y’sare“hiddenstates”– Allpairwisetransi=onsarethroughy’s– HencehiddenMarkovModel

•  TrainusingEMalgorithm– Convergetolocalop=mum

75

S = xi{ }i=1N


Ini=aliza=on

•  Howtochoose#hiddenstates?– Byhand– CrossValida=on

•  P(x)onvalida=ondata•  CancomputeP(x)viaforwardalgorithm:

76

P(x) = P(x, y)y∑ = αz (M )

z∑ P(End | yM = z)

Recap:SequencePredic=on&HMMs

•  Modelspairwisedependencesinsequences

•  Compact:onlymodelpairwisebetweeny’s•  MainLimitaQon:Lotsofindependenceassump=ons– Poorpredic=veaccuracy

77


Independent:(N,N,Adv)HMMViterbi:(N,V,Adv)

NextWeek

•  DeepGenera=veModels–  (RecentApplica=onsLecture)

•  MoreonUnsupervisedLearning

•  Recita=onTUESDAY(7pm):– RecapofViterbiandForward/Backward

78

machine learning & data mining - yisong yue · machine learning & data mining cs/cns/ee 155...

Documents