learning weighted finite‐state transducersnasmith/slides/spflodd.10-27-11.pdf · •...
TRANSCRIPT
Background
• ThislectureisbasedonapaperbyJasonEisneratACL2002,“ParameterEsImaIonforProbabilisIcFinite‐StateTransducers.”– Thisisperhapsthemostunder‐appreciatedpaperinthepasttenyearsofNLP.
• Fulldisclosure:JasonEisnerwasmyPh.D.advisor.– He’soneofthesmartestpeopleIhaveevermet.
Finite‐StateAutomaton
Nota%on Defini%on
Q finitesetofstates
Σ finitevocabulary
q0∈Q startstate
F⊆Q setoffinalstates
δ:Q⨉Σ*→2QtransiIonfuncIon;possiblenextstatesgivencurrentstateandinputsymbol(s)
Finite‐StateAutomaton(MaybeBe`er?)
Nota%on Defini%on
Q finitesetofstates
Σ finitevocabulary
q0∈Q startstate
F⊆Q setoffinalstates
A⊆Q⨉Σ*⨉QsetoftransiIons(sourcestate,symbolsequence,targetstate)
Finite‐StateAutomaton
• Automatonthatrecognizesaregularlanguage– KeytransformaIons:removeε‐transiIons,determinize,minimize
• ImplementaIonofaregularexpression• RegularlanguagesareclosedundernumerousoperaIons– ConcatenaIon,union,intersecIon,Kleene*,difference,reverse,complement,…
• Correspondtoregulargrammars(type3inChomskyhierarchy)
• Pumpinglemma:necessarycondiIonforalanguagetoberegular
Full850‐WordDicIonary
states final states
arcs
Union 5303 850 5302
Removeε‐transiIons 4454 850 4453
Determinize 2609 848 2608
Minimize 744 42 1535
GeneralizaIons
• Finite‐staterecognizerisafuncIonfromΣ*→{0,1}– Meaning:fsa(s)=1⇔sisinthelanguage
• Otherra%onalrela%ons…– Finite‐statetransducer:Σ*→Δ*– WeightedFSA:Σ*→ℝ– WeightedFST:Σ*→Δ*×ℝ
• WFSAsandWFSTscanbeconsideredprobabilis%c(butdon’thavetobe)
RelaIonsonStrings• Arela%onisasetof(input,output)pairs.
– Moregeneralthanfunc-onsbecauseyoucanrepresentambiguityandopIonality!
– ForstandardFSAs,thinkofinput=output.• RaIonalrelaIonsareaspecialkindofrelaIonwithawiderangeof
closureproperIes.• RaIonalrelaIonscanbeunderstoodasadeclaraIveprogramming
paradigm:– sourcecodeisaregularexpression– objectcodeisa2‐tapeautomatoncalledafinite‐statetransducer
(FST)– opImizaIonisaccomplishedbydeterminizaIonandminimizaIon– supportsnondeterminism,parallelprocessingoverinfinitesetsof
inputstrings,reversecomputaIonfromoutputstoinputs
Finite‐StateAutomataandTransducers
FSA FST Defini%on
Q finitesetofstates
Σ finite(input)vocabulary
Δ finiteoutputvocabulary
q0∈Q startstate
F⊆Q setoffinalstates
δ:Q⨉Σ*→2QtransiIonfuncIon;possiblenextstatesgivencurrentstateandinputsymbol(s)
δ:Q⨉Σ*⨉Δ*→2Q
…andoutputsymbol(s)
WeightedRelaIons
• Assignscoresto(input,output)pairs.– SomeImesinterpretedasp(input,output)– SomeImesinterpretedasp(output|input)– SomeImesneither
• ThisideaunifiesmanyNLPapproaches:– sequencelabeling– chunking– normalizaIon– segmentaIon– alignment– speechrecogniIon(PereiraandRiley,1997)– machinetranslaIon(KnightandAl‐Onaizan,1998)
WeightedFSTs
FST Weights Defini%on
Q finitesetofstates
Σ finiteinputvocabulary
Δ finiteoutputvocabulary
q0∈Q startstate
F⊆Q stopweights setoffinalstates
δ:Q⨉Σ*⨉Δ*→2Q
arcweights
transiIonfuncIon;possiblenextstatesgivencurrentstateandinputsymbol(s)andoutputsymbol(s)
WFSTs
FSTsWFSAs
weightsarein{0,1}
representsets(languages),aspecialkindofrelaIonwhereoutput=input
FSAs
WeightsandScores
• WeightsareassignedtotransiIonsandtoendingapathineachstate.
• ScoreofapathistheproductofthetransiIonweightsandthestopweight.
• “Zero”meansthesamethingas“impossible.”
Eisner’sRunningExample
Twopathsfor(aabb,xz):
pathscore=0.0002646
pathscore=0.0002646
scoreof(aabb,xz)=0.0005296
WeightedRelaIonsandProbabiliIes
• Letf(x,y)bethefuncIoncorrespondingtoaWFST’sassignmentofascoretothe(input,output)pair(x,y).– Iffsumstooneoverall(x,y),thenitisajointdistribuIon.
– Iffsumstooneoverally,foreachx,thenitisacondiIonaldistribuIon.
ParameterizingtheWFST
• OpIon1:everytransiIonorstopgetsaparameter.– OpIon1A:makesurecompeIngchoices(transiIonsfromqandstoppinginq)sumto1.
13freeparameters
WFSTComposiIon
• LetfandgdefinetheweightedrelaIonsfortwoWFSTssuchthatf’soutputalphabetandg’sinputalphabetarethesame.
Then:f∘g(x,z)=∑yf(x,y)⨉g(y,z)
– Eitherforgorbothcanbeaset(insteadofarelaIon).– Eitherforgcanbeunweighted(scoresare0or1).– Ifbothareunweightedsets(FSAs)thenthisisintersec%on.
• IffisajointdistribuIonp(x,y)andgisacondiIonaldistribuIonp(z|y),wenowhaveaprobabilisIcmodeloverthreestringrandomvariables.
WFSTComposiIon
(4,6)self‐loopisreallya→p→x
(5,6)self‐loopisreallyb→{p,q}→x
(4,7)self‐loopisreallya→p→ε
(5,7)self‐loopsarereallyb→p→εandb→q→z
Aside1
• EisnersuggestsanotherwaytowritedownweightedregularrelaIons,asprobabilisIcregularexpressions.– Buildupfromatomicexpressionsa:b,withainΣ*andbinΔ*
– ConcatenaIon,probabilisIcunion,probabilisIcclosure.
• AlmostnoworkonthisinNLPorML,asfarasI’veseen.
NoisyChannelModel
idealizedoutputofpredictor
ChannelWFST
observableinputtopredictor
SourceWFSA
∘
∘
∘
∘
∘
HistoricalNote
• Unweighted FSTsweredevelopedlargelyfordesigningandimplemenIngmodelsofthemorphologyofnaturallanguages.– HugeamountofworkatXerox.– AlsousedininformaIonextracIon.
• Veryusefulforhand‐construcIonofmorphologicalrulesindividually,thenassemblethembyconcatenaIon,union,composiIon,etc.
ParameterizaIons
1. Everyarcgetsoneprobability2. Every“original”arcgetsoneprobability3. Log‐lineardistribuIonwithsharedfeatures
allovertheWFST– Thisisreallythemostgeneral,sincefeatures
couldbeidenIIesofarcsorof“original”arcs!
Exercise
• HowtorepresentanHMMasaWFST?– MEMM?
– Chain‐structuredCRF?• HowtorepresentstochasIceditdistanceasaWFST?– ElegantwaytodesignawiderangeofeditoperaIons:composiIonofWFSTs
BacktoLearning
• Wewantageneral methodlearningtheparametersfromdata,evenwhenalllayersarenotknown.– EMforHMMsiswellknown(Baum,1972)– EMforstochasIceditdistancesiswellknown(RistadandYianilos,1996)
• IfourWFSTwasconstructedbycomposingsimplermachines,wemightwanttokeeptheoriginalparameterizaIon.– I.e.,learnweightsforindividualmachinesjointly.
VeryGeneralFormulaIonofLearning
• FlexibilityinparameterizaIon,including– oneprobabilityperarc– oneprobabilityper“original”arc– log‐lineardistribuIonoverarcsfromastate,withparametertyingthroughouttheWFST
• Learnfromsamplesof(input,output)pairswherexi⊆Σ*andyi⊆Δ*(pathsnotobserved).– supervised:eachxiandeachyiisasinglestring– unsupervised:yi=Δ*– lotsofpossibiliIesinbetween
MaximumLikelihoodEsImaIon
• YoucanviewthisasageneralizaIonofBaum‐WelchtrainingorEMforstochasIceditdistances.
• Eachexample’stotalscoreisapathsumoverthescoresofpathsthat– areallowedbytheWFST– matchtheinputsetxi– matchtheoutputsetyi
• RecallthatfmightbeajointorcondiIonalprobabilitydistribuIonandθmightbelog‐linearormulInomialparameters.
max!
!
i
f!(xi, yi) = max!
!
i
"
a!Paths(xi,yi)
|a|!
j=1
weight!(aj)
ExpectaIonMaximizaIon
• Estep(oneexample):findthedistribuIonoverpathsgivenxiandyi:
• Mstep:updateθtomakethosepathsmorelikely(exactformdependsonparameterizaIon).
max!
!
a
E!(t!1) [freq(a)] log weight!(a)
!
a!Paths(xi,yi)
|a|"
j=1
weight!(aj)
!
a!Paths
|a|"
j=1
weight!(aj)
MStep
• ParameterizaIon1(oneprobabilityperarc):
• ParameterizaIon2(oneprobabilityperoriginalarc):“unwind”p(a)intoproductoforiginalarcprobabiliIes.
max!
!
a
E!(t!1) [freq(a)] log weight!(a)
max!
!
a
E!(t!1) [freq(a)] log !a
“Unwinding”Example
(4,6)self‐loopisreallya→p→x 0’sself‐loopwitha:xisreally
4’sself‐loopwitha:pand6’sself‐loopwithp:x
MStep
• ParameterizaIon1(oneprobabilityperarc):
• ParameterizaIon2(oneprobabilityperoriginalarc):“unwind”p(a)intoproductoforiginalarcprobabiliIes.
• ParameterizaIon3(log‐linearandmostgeneral):
max!
!
a
E!(t!1) [freq(a)] log weight!(a)
max!
!
a
E!(t!1) [freq(a)] log !a
max!
!
a
E!(t!1) [freq(a)]
"
#!!g(a)! log!
a""Competitors(a)
exp!!g(a#)
$
%
ExpectaIonMaximizaIon
• Estep(oneexample):findthedistribuIonoverpathsgivenxiandyi:
• Mstep:updateθtomakethosepathsmorelikely(exactformdependsonparameterizaIon).
max!
!
a
E!(t!1) [freq(a)] log weight!(a)
!
a!Paths(xi,yi)
|a|"
j=1
weight!(aj)
!
a!Paths
|a|"
j=1
weight!(aj)
EStep
• ThelikelihoodvalueforoneexampleistheraIooftwopathsums.– ThedenominatorpathsumisthesameforallexamplesinthegeneraIvecase.
• ButtheEstep’srealjobistocalculatesufficientstaIsIcsthattheMstepneeds!
!
a!Paths(xi,yi)
|a|"
j=1
weight!(aj)
!
a!Paths
|a|"
j=1
weight!(aj)forageneraIvemodel,Paths(Σ*, Δ*)foracondiIonalmodel,Paths(xi, Δ*)
BestPath
• Generalidea:takexandbuildagraph.• Scoreofapathfactorsintotheedges.
• Decodingisfindingthebest path.
TheViterbialgorithmisaninstanceoffindingabestpath!
FlashbacktoOctober4,whenItalkedaboutdecoding…
BestPath
• Generalidea:takexandbuildagraph.• Scoreofapathfactorsintotheedges.
• Decodingisfindingthebest path
TheViterbialgorithmisaninstanceoffindingabestpath
Sum
log!
y
exp log!
y
exp
sumscoring
sum!
forward
ExpectedFeatureCounts
• TheEstep’srealjob,inthemostgeneralcase,istocalculatetheexpectedfeaturecountsintheexamples,underthecurrentmodel.
• We’veseenthisbefore!– Forward‐backward;youcandoitthatway– Eisnersuggestsadifferentway,wheretheusual“plus‐Imes”semiringisextendedandtheexpectaIonsareobtainedinasinglepass.
Semirings
Thetuple(K,⊕,⊗,0,1)isasemiringif:– Kisasetofvalues– ⊕isacommutaIve,binaryoperaIonK×K→Kwith
idenItyelement0– ⊗isabinaryoperaIonK×K→KwithidenIty
element1• forcomposiIon,⊗mustbecommutaIve
– ⊗distributesover⊕:a⊗(b⊕c)=(a⊗b)⊕(a⊗c). – 0annihilatesK:a⊗0=0– tohandleinfinitesetsofcyclicpaths,weneedaunary
closureoperator*suchthata*=1⊕a⊕(a⊗a)⊕(a⊗a⊗a)⊕…,
SemiringsYouKnow
interpretaIonofweights wanttocompute weights “plus” “Imes”
probabiliIes
p(s)
[0,1]
+ ×
bestpathprob. max ×
log‐probabiliIes
logp(s)
(‐∞,0]
log+ +
bestpathlog‐prob.
max +
costs min‐costpath [0,+∞) min +
Boolean sinlanguage? {0,1} ∨ ∧
strings sitself Σ* set‐union concat.
TheExpectaIonSemiring
• Insteadofaweightstoringjusta“forward”scoreorprobability,alsostoreavalueinV.– Forus,Visvectorsof(expected)featurecounts.– Assigntoeacharcavaluecorrespondingtoitslocalfeaturevector.
• DefineoperaIons:
• Result:finalvaluecontainspathsumandfeatureexpectaIons.
(p,v)! (p!,v!) = (pp!, pv! + p!v)(p,v)" (p!,v!) = (p + p!,v! + v!)
(p,v)" = (p", p"vp")
What’sReallyHappening?
• WearemanipulaIngweightedrela%ons,notWFSTs.
• TheexpectaIonsemiring’svaluesarescoresandgradients ofscoreswithrespecttoθ– Forward‐backward(e.g.,forCRFs)isdoingthesamething,onlyusingthechainruleforderivaIvestodefineasecondpass(thebackwardpass).
– TheexpectaIonsemiringletsyouavoidthebackwardpassandtheper‐arcproductsofforwardandbackwardprobabiliIes.
– Butit’sprobablyslowerinpracIce.
OtherGoodiesinthePaperandLaterWork
• Analysisasthe“algebraicpath”problem,linkstoarangeofspeedups(e.g.,foracyclicgraphs).
• ViterbivariantoftheexpectaIonsemiring.• ProbabilisIcregularexpressionsidea.
– PotenIalforrapidincorporaIonofexpertintuiIonsintodata‐drivensystems?
• LiandEisner(2009)goestosecond‐orderexpectaIonsemirings!
• DreyerandEisner(2009)usesWFSTstodefinefactorsingraphicalmodels!
ClosingNotesonLearning
• Eisner’sapproachisforMLE(andMAP),buthisalgorithmsareactuallyinferencemethodsforWFSTs.– BychangingtomaximizaIonandincorporaIngcosts,youcandoperceptron,structuredSVM,andothererror‐drivenlearning.
• Wedidn’ttalkatallaboutlearningthestructureoffinite‐statemodels!– There’sarichformalliteratureonthis,andnottoomanypapersthata`emptitforrealproblems.
– IgaveoneclassiccitaIononthewiki(StolckeandOmohundro,1993).
Toolkits
• FSMlibraries(AT&T)– Freebinaries– Implementspre`ymucheverythingyouneedtobuildweightedand
unweightedFSrecognizersandtransducers…excepttraining!
• XeroxFStoolkit– Webdemo;so�warecanbepurchased– Noweights
• RWTHFSAtoolkit– Newer,open‐source– Notsurewhat’simplemented
• OpenFST(Google)– NewincarnaIonofFSMlibraries– Freeandopensource!
NotesonAlgorithms
limitaIons polyIme
ε‐removalMohri2002
✓
determinizeMohri1997
notalltransducers
minimizeEisner2003
notallsemirings ✓