machine learning & data mining - yisong yue · machine learning & data mining cs/cns/ee 155...
TRANSCRIPT
Announcements
• UpdatedKaggleReportDueDate:– 9pmonMondayFeb13th
• NextRecita=onTuesdayFeb14th– 7pm-8pm– Recapofconceptscoveredtoday
4
SequencePredic=on(POSTagging)
• x=“FishSleep”• y=(N,V)
• x=“TheDogAteMyHomework”• y=(D,N,V,D,N)
• x=“TheFoxJumpedOverTheFence”• y=(D,N,V,P,D,N)
5
Challenges
• Mul=variableOutput– Makemul=plepredic=onssimultaneously
• VariableLengthInput/Output– Sentencelengthsnotfixed
6
Mul=variateOutputs
• x=“FishSleep”• y=(N,V)• Mul=classpredic=on:
• Howmanyclasses?
7
POSTags:Det,Noun,Verb,Adj,Adv,Prep
w =
w1w2!wK
!
"
#####
$
%
&&&&&
f (x |w,b) =
w1T x − b1
w2T x − b2!
wKT x − bK
"
#
$$$$$
%
&
'''''
PredictviaLargestScore:
argmaxk
w1T x − b1
w2T x − b2!
wKT x − bK
"
#
$$$$$
%
&
'''''
ReplicateWeights: ScoreAllClasses:
b =
b1b2!bK
!
"
#####
$
%
&&&&&
Mul=classPredic=on
• x=“FishSleep”• y=(N,V)• Mul=classpredic=on:– Allpossiblelength-Msequencesasdifferentclass– (D,D),(D,N),(D,V),(D,Adj),(D,Adv),(D,Pr)(N,D),(N,N),(N,V),(N,Adj),(N,Adv),…
• LMclasses!– Length2:62=36!
8
POSTags:Det,Noun,Verb,Adj,Adv,Prep
L=6
Mul=classPredic=on
• x=“FishSleep”• y=(N,V)• Mul=classpredic=on:– Allpossiblelength-Msequencesasdifferentclass– (D,D),(D,N),(D,V),(D,Adj),(D,Adv),(D,Pr)(N,D),(N,N),(N,V),(N,Adj),(N,Adv),…
• LMclasses!– Length2:62=36!
9
POSTags:Det,Noun,Verb,Adj,Adv,Prep
L=6
Exponen=alExplosionin#Classes!(NotTractableforSequencePredic=on)
WhyisNaïveMul=classIntractable?
– (D,D,D),(D,D,N),(D,D,V),(D,D,Adj),(D,D,Adv),(D,D,Pr)– (D,N,D),(D,N,N),(D,N,V),(D,N,Adj),(D,N,Adv),(D,N,Pr)– (D,V,D),(D,V,N),(D,V,V),(D,V,Adj),(D,V,Adv),(D,V,Pr)– …– (N,D,D),(N,D,N),(N,D,V),(N,D,Adj),(N,D,Adv),(N,D,Pr)– (N,N,D),(N,N,N),(N,N,V),(N,N,Adj),(N,N,Adv),(N,N,Pr)– …
10
POSTags:Det,Noun,Verb,Adj,Adv,Prepx=“Ifishoken”
Assumepronounsarenounsforsimplicity.
WhyisNaïveMul=classIntractable?
– (D,D,D),(D,D,N),(D,D,V),(D,D,Adj),(D,D,Adv),(D,D,Pr)– (D,N,D),(D,N,N),(D,N,V),(D,N,Adj),(D,N,Adv),(D,N,Pr)– (D,V,D),(D,V,N),(D,V,V),(D,V,Adj),(D,V,Adv),(D,V,Pr)– …– (N,D,D),(N,D,N),(N,D,V),(N,D,Adj),(N,D,Adv),(N,D,Pr)– (N,N,D),(N,N,N),(N,N,V),(N,N,Adj),(N,N,Adv),(N,N,Pr)– …
11
TreatsEveryCombina=onAsDifferentClass(Learnmodelforeachcombina=on)
Exponen=allyLargeRepresenta=on!
(Exponen=alTimetoConsiderEveryClass)(Exponen=alStorage)
POSTags:Det,Noun,Verb,Adj,Adv,Prepx=“Ifishoken”
Assumepronounsarenounsforsimplicity.
IndependentClassifica=on
• Treateachwordindependently(assump=on)– Independentmul=classpredic=onperword– Predictforx=“I”independently– Predictforx=“fish”independently– Predictforx=“oken”independently– Concatenatepredic=ons.
12
x=“Ifishoken” POSTags:Det,Noun,Verb,Adj,Adv,Prep
Assumepronounsarenounsforsimplicity.
IndependentClassifica=on
• Treateachwordindependently(assump=on)– Independentmul=classpredic=onperword– Predictforx=“I”independently– Predictforx=“fish”independently– Predictforx=“oken”independently– Concatenatepredic=ons.
13
x=“Ifishoken” POSTags:Det,Noun,Verb,Adj,Adv,Prep
Assumepronounsarenounsforsimplicity.
#Classes=#POSTags(6inourexample)
Solvableusingstandardmul=classpredic=on.
IndependentClassifica=on
• Treateachwordindependently– Independentmul=classpredic=onperword
14
x=“Ifishoken” POSTags:Det,Noun,Verb,Adj,Adv,Prep
Assumepronounsarenounsforsimplicity.
P(y|x) x=“I” x=“fish” x=“oNen”
y=“Det” 0.0 0.0 0.0
y=“Noun” 1.0 0.75 0.0
y=“Verb” 0.0 0.25 0.0
y=“Adj” 0.0 0.0 0.4
y=“Adv” 0.0 0.0 0.6
y=“Prep” 0.0 0.0 0.0
PredicQon:(N,N,Adv)
Correct:(N,V,Adv)
Whythemistake?
ContextBetweenWords
• IndependentPredic=onsIgnoreWordPairs– InIsola=on:
• “Fish”ismorelikelytobeaNoun
– ButCondi=onedonFollowinga(pro)Noun…• “Fish”ismorelikelytobeaVerb!
– “1stOrder”Dependence(ModelAllPairs)• 2ndOrderConsidersAllTriplets• ArbitraryOrder=Exponen=alSize(NaïveMul=class)
15
x=“Ifishoken” POSTags:Det,Noun,Verb,Adj,Adv,Prep
Assumepronounsarenounsforsimplicity.
1stOrderHiddenMarkovModel
• x=(x1,x2,x4,x4,…,xM)(sequenceofwords)• y=(y1,y2,y3,y4,…,yM)(sequenceofPOStags)
• P(xi|yi)Probabilityofstateyigenera=ngxi• P(yi+1|yi)Probabilityofstateyitransi=oningtoyi+1• P(y1|y0)y0isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate– Notalwaysused
16
GraphicalModelRepresenta=on
17
Y1
X1
Y2
X2
YM
XM
…
…
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
Op=onal
Y0 YEnd
1stOrderHiddenMarkovModel
• P(xi|yi)Probabilityofstateyigenera=ngxi• P(yi+1|yi)Probabilityofstateyitransi=oningtoyi+1• P(y1|y0)y0isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate– Notalwaysused
18
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏“JointDistribu=on”
Op=onal
• P(xi|yi)Probabilityofstateyigenera=ngxi• P(yi+1|yi)Probabilityofstateyitransi=oningtoyi+1• P(y1|y0)y0isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate– Notalwaysused
1stOrderHiddenMarkovModel
19
P x | y( ) = P(xi | yi )i=1
M
∏
“Condi=onalDistribu=ononxgiveny”
GivenaPOSTagSequencey:CancomputeeachP(xi|y)independently!(xicondi=onallyindependentgivenyi)
1stOrderHiddenMarkovModel
20
ModelsAllState-StatePairs(allPOSTag-Tagpairs)ModelsAllState-Observa=onPairs(allTag-Wordpairs)
SameComplexityasIndependentMul=class
Addi=onalComplexityof(#POSTags)2
• P(xi|yi)Probabilityofstateyigenera=ngxi• P(yi+1|yi)Probabilityofstateyitransi=oningtoyi+1• P(y1|y0)y0isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate– Notalwaysused
Rela=onshiptoNaïveBayes
21
Graphical)Model)Representa2on)
14)
Y1#
X1)
Y2#
X2)
YM#
XM)
…#
…#
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
Op2onal)
Y0# YEnd#
Graphical)Model)Representa2on)
15)
Y1#
X1)
Y2#
X2)
YM#
XM)
…#
…#
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
Op2onal)ReducestoasequenceofdisjointNaïveBayesmodels(ifweignoretransi=onprobabili=es)
P(word|state/tag)
• Two-wordlanguage:“fish”and“sleep”• Two-taglanguage:“Noun”and“Verb”
SlidesborrowedfromRalph Grishman 22
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
GivenTagSequencey:
P(“fishsleep”|(N,V))=0.8*0.5P(“fishfish”|(N,V))=0.8*0.5P(“sleepfish”|(V,V))=0.8*0.5P(“sleepsleep”|(N,N))=0.2*0.5
Sampling
• HMMsare“genera=ve”models– Modelsjointdistribu=onP(x,y)– Cangeneratesamplesfromthisdistribu=on– Firstconsidercondi=onaldistribu=onP(x|y)
– WhataboutsamplingfromP(x,y)?
23
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
GivenTagSequencey=(N,V):
Sampleeachwordindependently:SampleP(x1|N)(0.8Fish,0.2Sleep)SampleP(x2|V)(0.5Fish,0.5Sleep)
ForwardSamplingofP(y,x)
24
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
SlidesborrowedfromRalph Grishman
Ini=alizey0=StartIni=alizei=01. i=i+12. SampleyifromP(yi|yi-1)3. Ifyi==End:Quit4. SamplexifromP(xi|yi)5. GotoStep1
ExploitsCondi=onalInd.RequiresP(End|yi)
ForwardSamplingofP(y,x|L)
25
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
P x, y |M( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
SlidesborrowedfromRalph Grishman
Ini=alizey0=StartIni=alizei=01. i=i+12. If(i==M):Quit3. SampleyifromP(yi|yi-1)4. SamplexifromP(xi|yi)5. GotoStep1
ExploitsCondi=onalInd.AssumesnoP(End|yi)
A Simple POS HMM
start noun verb 0.8
0.2
0.91
0.333
0.667
0.09
Slides'borrowed'from'Ralph Grishman'' 19'
• P(xi|yi)Probabilityofstateyigenera=ngxi• P(yi+1|yi)Probabilityofstateyitransi=oningtoyi+1• P(y1|y0)y0isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate– Notalwaysused
1stOrderHiddenMarkovModel
26
P xk+1:M , yk+1:M | x1:k, y1:k( ) = P xk+1:M , yk+1:M | yk( )“Memory-lessModel”–onlyneedsyktomodelrestofsequence
MostCommonPredic=onProblem
• Giveninputsentence,predictPOSTagseq.
• Naïveapproach:– Tryallpossibley’s– Chooseonewithhighestprobability– ExponenQalQme:LMpossibley’s
28
argmaxy
P y | x( )
Bayes’sRule
29
argmaxy
P y | x( ) = argmaxy
P(y, x)P(x)
= argmaxy
P(y, x)
= argmaxy
P(x | y)P(y)
P x | y( ) = P(xi | yi )i=1
L
∏
P y( ) = P(End | yL ) P(yi | yi−1)i=1
L
∏
30
argmaxy
P(y, x) = argmaxy
P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
= argmaxyM
argmaxy1:M−1
P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
= argmaxyM
argmaxy1:M−1
P(yM | yM−1)P(xM | yM )P(y1:M−1 | x1:M−1)
P x1:k | y1:k( ) = P(xi | yi )i=1
k
∏
P y1:k( ) = P(yi+1 | yi )i=1
k
∏
ExploitMemory-lessProperty:ThechoiceofyMonlydependsony1:M-1viaP(yM|yM-1)!
P y1:k | x1:k( ) = P(x1:k | y1:k )P(y1:k )
DynamicProgramming
• Input:x=(x1,x2,x3,…,xM)• Computed:bestlength-kprefixendingineachTag:
– Examples:
• Claim:
31
Y k (V ) = argmaxy1:k−1
P(y1:k−1⊕V, x1:k )#
$%
&
'(⊕V Y k (N ) = argmax
y1:k−1P(y1:k−1⊕ N, x1:k )
#
$%
&
'(⊕ N
SequenceConcatena=on
Y k+1(V ) = argmaxy1:k∈ Y k T( ){ }T
P(y1:k ⊕V, x1:k+1)#
$%%
&
'((⊕V
= argmaxy1:k∈ Y k T( ){ }T
P(y1:k, x1:k )P(yk+1 =V | yk )P(xk+1 | yk+1 =V )#
$%%
&
'((⊕V
Pre-computed RecursiveDefiniQon!
Y 2 (V ) = argmaxy1∈ Y1 T( ){ }T
P(y1, x1)P(y2 =V | y1)P(x 2 | y2 =V )"
#$$
%
&''⊕V
32
Ŷ1(V)
Ŷ1(D)
Ŷ1(N)
StoreeachŶ1(Z)&P(Ŷ1(Z),x1)
Ŷ2(V)
Ŷ2(D)
Ŷ2(N)
Solve:
y1=V
y1=D
y1=N
Ŷ1(Z)isjustZ
Y 2 (V ) = argmaxy1∈ Y1 T( ){ }T
P(y1, x1)P(y2 =V | y1)P(x 2 | y2 =V )"
#$$
%
&''⊕V
33
Ŷ1(V)
Ŷ1(D)
Ŷ1(N)
StoreeachŶ1(Z)&P(Ŷ1(Z),x1)
Ŷ2(V)
Ŷ2(D)
Ŷ2(N)
y1=N
Ŷ1(Z)isjustZ Ex:Ŷ2(V)=(N,V)
Solve:
34
Ŷ1(V)
Ŷ1(D)
Ŷ1(N)
StoreeachŶ1(Z)&P(Ŷ1(Z),x1)
Ŷ2(V)
Ŷ2(D)
Ŷ2(N)
StoreeachŶ2(Z)&P(Ŷ2(Z),x1:2)
Ex:Ŷ2(V)=(N,V)
Ŷ3(V)
Ŷ3(D)
Ŷ3(N)
Y 3(V ) = argmaxy1:2∈ Y 2 T( ){ }T
P(y1:2, x1:2 )P(y3 =V | y2 )P(x3 | y3 =V )"
#$$
%
&''⊕VSolve:
y2=V
y2=D
y2=N
35
Ŷ1(V)
Ŷ1(D)
Ŷ1(N)
StoreeachŶ1(Z)&P(Ŷ1(Z),x1)
Ŷ2(V)
Ŷ2(D)
Ŷ2(N)
StoreeachŶ2(Z)&P(Ŷ2(Z),x1:2)
Ex:Ŷ2(V)=(N,V)
Ŷ3(V)
Ŷ3(D)
Ŷ3(N)
Claim:Onlyneedtochecksolu=onsofŶ2(Z),Z=V,D,N
y2=V
y2=D
y2=N
SupposeŶ3(V)=(V,V,V)……provethatŶ3(V)=(N,V,V)hashigherprob.
Proofdependson1storderproperty• Prob.of(V,V,V)&(N,V,V)differin3terms• P(y1|y0),P(x1|y1),P(y2|y1)• Noneofthesedependony3!
Solve: Y 3(V ) = argmaxy1:2∈ Y 2 T( ){ }T
P(y1:2, x1:2 )P(y3 =V | y2 )P(x3 | y3 =V )"
#$$
%
&''⊕V
36
Ŷ1(V)
Ŷ1(D)
Ŷ1(N)
StoreeachŶ1(Z)&P(Ŷ1(Z),x1)
Ŷ2(V)
Ŷ2(D)
Ŷ2(N)
StoreeachŶ2(Z)&P(Ŷ2(Z),x1:2)
Ex:Ŷ2(V)=(N,V)
Ŷ3(V)
Ŷ3(D)
Ŷ3(N)
Y M (V ) = argmaxy1:M−1∈ Y M−1 T( ){ }T
P(y1:M−1, x1:M−1)P(yM =V | yM−1)P(xM | yM =V )P(End | yM =V )#
$%%
&
'((⊕V
StoreeachŶ3(Z)&P(Ŷ3(Z),x1:3)
Ex:Ŷ3(V)=(D,N,V)
ŶM(V)
ŶM(D)
ŶM(N)
…
Op=onal
ViterbiAlgorithm
• Solve:
• Fork=1..M– Itera=velysolveforeachŶk(Z)
• ZloopingovereveryPOStag.
• PredictbestŶM(Z)• AlsoknownasMeanAPosteriori(MAP)inference
37
argmaxy
P y | x( ) = argmaxy
P(y, x)P(x)
= argmaxy
P(y, x)
= argmaxy
P(x | y)P(y)
NumericalExample
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
SlidesborrowedfromRalph Grishman 38
x=(FishSleep)
0 1 2 3
start 1
verb 0
noun 0
end 0SlidesborrowedfromRalph Grishman 39
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0
verb 0 .2 * .5
noun 0 .8 * .8
end 0 0
Token 1: fish
SlidesborrowedfromRalph Grishman 40
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0
verb 0 .1
noun 0 .64
end 0 0
Token 1: fish
SlidesborrowedfromRalph Grishman 41
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0
verb 0 .1 .1*.1*.5
noun 0 .64 .1*.2*.2
end 0 0 -
Token 2: sleep
(if ‘fish’ is verb)
SlidesborrowedfromRalph Grishman 42
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0
verb 0 .1 .005
noun 0 .64 .004
end 0 0 -
Token 2: sleep
(if ‘fish’ is verb)
SlidesborrowedfromRalph Grishman 43
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0
verb 0 .1 .005.64*.8*.5
noun 0 .64 .004.64*.1*.2
end 0 0 -
Token 2: sleep
(if ‘fish’ is a noun)
SlidesborrowedfromRalph Grishman 44
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0
verb 0 .1 .005.256
noun 0 .64 .004.0128
end 0 0 -
Token 2: sleep
(if ‘fish’ is a noun)
SlidesborrowedfromRalph Grishman 45
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0
verb 0 .1 .005.256
noun 0 .64 .004.0128
end 0 0 -
Token 2: sleep take maximum, set back pointers
SlidesborrowedfromRalph Grishman 46
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0
verb 0 .1 .256
noun 0 .64 .0128
end 0 0 -
Token 2: sleep take maximum, set back pointers
SlidesborrowedfromRalph Grishman 47
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0 0
verb 0 .1 .256 -
noun 0 .64 .0128 -
end 0 0 - .256*.7.0128*.1
Token 3: end
SlidesborrowedfromRalph Grishman 48
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0 0
verb 0 .1 .256 -
noun 0 .64 .0128 -
end 0 0 - .256*.7.0128*.1
Token 3: end take maximum, set back pointers
SlidesborrowedfromRalph Grishman 49
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0 0
verb 0 .1 .256 -
noun 0 .64 .0128 -
end 0 0 - .256*.7
Decode: fish = noun sleep = verb
SlidesborrowedfromRalph Grishman 50
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
0 1 2 3
start 1 0 0 0
verb 0 .1 .256 -
noun 0 .64 .0128 -
end 0 0 - .256*.7
Decode: fish = noun sleep = verb
SlidesborrowedfromRalph Grishman 51
A Simple POS HMM
start noun verb end 0.8
0.2
0.8 0.7
0.1
0.2
0.1 0.1
Slides'borrowed'from'Ralph Grishman'' 12'
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
Whatmightgowrongforlongsequences?
Underflow!Smallnumbersgetrepeatedlymul=plied
together–exponen=allysmall!
ViterbiAlgorithm(w/LogProbabili=es)
• Solve:
• Fork=1..M– Itera=velysolveforeachlog(Ŷk(Z))
• ZloopingovereveryPOStag.
• Predictbestlog(ŶM(Z))– Log(ŶM(Z))accumulatesaddiQvely,notmulQplicaQvely
52
argmaxy
P y | x( ) = argmaxy
P(y, x)P(x)
= argmaxy
P(y, x)
= argmaxy
logP(x | y)+ logP(y)
Recap:IndependentClassifica=on
• Treateachwordindependently– Independentmul=classpredic=onperword
53
x=“Ifishoken” POSTags:Det,Noun,Verb,Adj,Adv,Prep
Assumepronounsarenounsforsimplicity.
P(y|x) x=“I” x=“fish” x=“oNen”
y=“Det” 0.0 0.0 0.0
y=“Noun” 1.0 0.75 0.0
y=“Verb” 0.0 0.25 0.0
y=“Adj” 0.0 0.0 0.4
y=“Adv” 0.0 0.0 0.6
y=“Prep” 0.0 0.0 0.0
PredicQon:(N,N,Adv)
Correct:(N,V,Adv)
MistakeduetonotmodelingmulQplewords.
Recap:Viterbi
• Modelspairwisetransi=onsbetweenstates– Pairwisetransi=onsbetweenPOSTags– “1storder”model
54
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
x=“Ifishoken” Independent:(N,N,Adv)
HMMViterbi:(N,V,Adv)*AssumingwedefinedP(x,y)properly
SupervisedTraining
• Given:
• Goal:Es=mateP(x,y)usingS
• MaximumLikelihood!
56
S = (xi, yi ){ }i=1N
WordSequence(Sentence)
POSTagSequence
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
Aside:MatrixFormula=on
• DefineTransi=onMatrix:A– Aab=P(yi+1=a|yi=b)or–Log(P(yi+1=a|yi=b))
• Observa=onMatrix:O– Owz=P(xi=w|yi=z)or–Log(P(xi=w|yi=z))
57
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
P(ynext|y) y=“Noun” y=“Verb”
ynext=“Noun” 0.09 0.667
ynext=“Verb” 0.91 0.333
Aside:MatrixFormula=on
58
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
P x, y( ) = P(End | yM ) P(yi | yi−1)i=1
M
∏ P(xi | yi )i=1
M
∏
= AEnd,yM
Ayi ,yi−1
i=1
M
∏ Oxi ,yi
i=1
M
∏
− log(P(x, y)) = !AEnd,yM
+ !Ayi ,yi−1
i=1
M
∑ + !Oxi ,yi
i=1
M
∑ Logprob.formula=onEachentryofÃisdefineas–log(A)
MaximumLikelihood
• Es=mateeachcomponentseparately:
• (Derivedviaminimizingneg.loglikelihood)
59
Aab =1
yji+1=a( )∧ yj
i =b( )"#
$%i=0
M j
∑j=1
N
∑
1yji =b"
#$%
i=0
M j
∑j=1
N
∑Owz =
1x ji =w( )∧ yj
i =z( )"#
$%i=1
M j
∑j=1
N
∑
1yji =z"
#$%
i=1
M j
∑j=1
N
∑
argmaxA,O
P x, y( )(x,y)∈S∏ = argmax
A,OP(End | yM ) P(yi | yi−1)
i=1
M
∏ P(xi | yi )i=1
M
∏(x,y)∈S∏
Recap:SupervisedTraining
• MaximumLikelihoodTraining– Coun=ngsta=s=cs– Supereasy!– Why?
• Whataboutunsupervisedcase?
60
argmaxA,O
P x, y( )(x,y)∈S∏ = argmax
A,OP(End | yM ) P(yi | yi−1)
i=1
M
∏ P(xi | yi )i=1
M
∏(x,y)∈S∏
Condi=onalIndependenceAssump=ons
• Everythingdecomposestoproductsofpairs– I.e.,P(yi+1=a|yi=b)doesn’tdependonanythingelse
• Canjustes=matefrequencies:– Howokenyi+1=awhenyi=bovertrainingset– NotethatP(yi+1=a|yi=b)isacommonmodelacrossallloca=onsofallsequences.
61
argmaxA,O
P x, y( )(x,y)∈S∏ = argmax
A,OP(End | yM ) P(yi | yi−1)
i=1
M
∏ P(xi | yi )i=1
M
∏(x,y)∈S∏
Condi=onalIndependenceAssump=ons
• Everythingdecomposestoproductsofpairs– I.e.,P(yi+1=a|yi=b)doesn’tdependonanythingelse
• Canjustes=matefrequencies:– Howokenyi+1=awhenyi=bovertrainingset– NotethatP(yi+1=a|yi=b)isacommonmodelacrossallloca=onsofallsequences.
62
argmaxA,O
P x, y( )(x,y)∈S∏ = argmax
A,OP(End | yM ) P(yi | yi−1)
i=1
M
∏ P(xi | yi )i=1
M
∏(x,y)∈S∏
#Parameters:Transi=onsA:#Tags2
Observa=onsO:#Wordsx#Tags
Avoidsdirectlymodelword/wordpairings
#Tags=10s#Words=10000s
UnsupervisedTraining
• Whataboutifnoy’s?– Justatrainingsetofsentences
• S=llwanttoes=mateP(x,y)– How?– Why?
63
S = xi{ }i=1N
WordSequence(Sentence)
argmax P xi( )i∏ = argmax P xi, y( )
y∑
i∏
WhyUnsupervisedTraining?
• SupervisedDatahardtoacquire– Requireannota=ngPOStags
• UnsupervisedDataplen=ful– Justgrabsometext!
• MightjustworkforPOSTagging!– Learny’sthatcorrespondtoPOSTags
• Canbeusedforothertasks– Detectoutliersentences(sentenceswithlowprob.)– Samplingnewsentences.
64
EMAlgorithm(Baum-Welch)
• Ifwehady’sèmaxlikelihood.• Ifwehad(A,O)èpredicty’s
1. Ini=alizeAandOarbitrarily
2. Predict prob.ofy’sforeachtrainingx
3. Usey’stoes=matenew(A,O)
4. RepeatbacktoStep1un=lconvergence
65hzp://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm
ExpectaQonStep
MaximizaQonStep
ChickenvsEgg!
Expecta=onStep
• Given(A,O)• Fortrainingx=(x1,…,xM)– PredictP(yi)foreachy=(y1,…yM)
– Encodescurrentmodel’sbeliefsabouty– “MarginalDistribu=on”ofeachyi
66
x1 x2 … xL
P(yi=Noun) 0.5 0.4 … 0.05
P(yi=Det) 0.4 0.6 … 0.25
P(yi=Verb) 0.1 0.0 … 0.7
Recall:MatrixFormula=on
• DefineTransi=onMatrix:A– Aab=P(yi+1=a|yi=b)or–Log(P(yi+1=a|yi=b))
• Observa=onMatrix:O– Owz=P(xi=w|yi=z)or–Log(P(xi=w|yi=z))
67
P(x|y) y=“Noun” y=“Verb”
x=“fish” 0.8 0.5
x=“sleep” 0.2 0.5
P(ynext|y) y=“Noun” y=“Verb”
ynext=“Noun” 0.09 0.667
ynext=“Verb” 0.91 0.333
Maximiza=onStep
• Max.LikelihoodoverMarginalDistribu=on
68
Aab =P(yj
i = b, yji+1 = a)
i=0
M j
∑j=1
N
∑
P(yji = b)
i=0
M j
∑j=1
N
∑Owz =
1x ji =w!
"#$P(yj
i = z)i=1
M j
∑j=1
N
∑
P(yji = z)
i=1
M j
∑j=1
N
∑
Aab =1
yji+1=a( )∧ yj
i =b( )"#
$%i=0
M j
∑j=1
N
∑
1yji =b"
#$%
i=0
M j
∑j=1
N
∑Owz =
1x ji =w( )∧ yj
i =z( )"#
$%i=1
M j
∑j=1
N
∑
1yji =z"
#$%
i=1
M j
∑j=1
N
∑Supervised:
Unsupervised:
MarginalsMarginals
Marginals
Compu=ngMarginals(Forward-BackwardAlgorithm)
• SolvingE-Step,requirescomputemarginals
• CansolveusingDynamicProgramming!– SimilartoViterbi
69
x1 x2 … xL
P(yi=Noun) 0.5 0.4 … 0.05
P(yi=Det) 0.4 0.6 … 0.25
P(yi=Verb) 0.1 0.0 … 0.7
Nota=on
70
αz (i) = P(x1:i, yi = Z | A,O)
βz (i) = P(xi+1:M | yi = Z,A,O)
Probabilityofobservingprefixx1:iandhavingthei-thstatebeyi=Z
Probabilityofobservingsuffixxi+1:Mgiventhei-thstatebeingyi=Z
hzp://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm
P(yi = z | x) = az (i)βz (i)az ' (i)βz ' (i)
z '∑
Compu=ngMarginals=CombiningtheTwoTerms
Nota=on
71
αz (i) = P(x1:i, yi = Z | A,O)
βz (i) = P(xi+1:M | yi = Z,A,O)
Probabilityofobservingprefixx1:iandhavingthei-thstatebeyi=Z
Probabilityofobservingsuffixxi+1:Mgiventhei-thstatebeingyi=Z
hzp://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm
Compu=ngMarginals=CombiningtheTwoTerms
P(yi = b, yi−1 = a | x) = aa (i−1)P(yi = b | yi−1 = a)P(xi | yi = b)βb(i)
aa ' (i−1)P(yi = b ' | yi−1 = a ')P(xi | yi = b ')βb ' (i)
a ',b '∑
Forward(sub-)Algorithm
• Solveforevery:
• Naively:
• Canbecomputedrecursively(likeViterbi)
72
αz (i) = P(x1:i, yi = Z | A,O)
αz (i) = P(x1:i, yi = Z | A,O) = P(x1:i, yi = Z, y1:i−1 | A,O)
y1:i−1∑
αz (1) = P(y1 = z | y0 )P(x1 | y1 = z) =O
x1,zAz,start
ExponenQalTime!
αz (i+1) =Oxi+1,zα j (i)
j=1
L
∑ Az, j
Viterbieffec=velyreplacessumwithmax
Backward(sub-)Algorithm
• Solveforevery:
• Naively:
• Canbecomputedrecursively(likeViterbi)
73
βz (i) = P(xi+1:M | yi = Z,A,O) = P(xi+1:M , yi+1:M | yi = Z,A,O)
yi+1:L∑
βz (L) =1
ExponenQalTime!
βz (i) = β j (i+1)j=1
L
∑ Aj,zOxi+1, j
βz (i) = P(xi+1:M | yi = Z,A,O)
Forward-BackwardAlgorithm
• RunsForward
• RunsBackward
• Foreachtrainingx=(x1,…,xM)– ComputeseachP(yi)fory=(y1,…,yM)
74
αz (i) = P(x1:i, yi = Z | A,O)
βz (i) = P(xi+1:M | yi = Z,A,O)
P(yi = z | x) = az (i)βz (i)az ' (i)βz ' (i)
z '∑
Recap:UnsupervisedTraining
• Trainusingonlywordsequences:
• y’sare“hiddenstates”– Allpairwisetransi=onsarethroughy’s– HencehiddenMarkovModel
• TrainusingEMalgorithm– Convergetolocalop=mum
75
S = xi{ }i=1N
WordSequence(Sentence)
Ini=aliza=on
• Howtochoose#hiddenstates?– Byhand– CrossValida=on
• P(x)onvalida=ondata• CancomputeP(x)viaforwardalgorithm:
76
P(x) = P(x, y)y∑ = αz (M )
z∑ P(End | yM = z)
Recap:SequencePredic=on&HMMs
• Modelspairwisedependencesinsequences
• Compact:onlymodelpairwisebetweeny’s• MainLimitaQon:Lotsofindependenceassump=ons– Poorpredic=veaccuracy
77
x=“Ifishoken” POSTags:Det,Noun,Verb,Adj,Adv,Prep
Independent:(N,N,Adv)HMMViterbi:(N,V,Adv)