harmony fst for finnish vowel - courses.helsinki.fi
TRANSCRIPT
LECTURE 5: FINITE-STATE METHODSAND STATISTICAL NLP
Mark Granroth-Wilding
LAST LECTURE
⢠Finite-state automata (FSAs)
⢠Finite-state transducers (FSTs) to produce output
⢠Intro to morphology: word-internal structure
⢠Now: application of FSTs to morphological analysis (andgeneration)
1
FSTs FOR MORPHOLOGY
⢠Can encode morphological rules as FSTs
⢠Example: Finnish vowel harmony
⢠A reminder:⢠Back vowels: a, o, u⢠Front vowels: a, o, y⢠Middle vowels: e, i⢠Word contain back+middle or front+middle⢠Never: back+front⢠Affixes come in back and front forms: e.g. na/na
2
FSA FOR FINNISH VOWELHARMONY
Accepts only valid combinations of front/back/middle
1 2
3
a, o, u
a, o, y
other
C, a, o, u,e, i
C, a, o, y,e, i
3
FST FOR FINNISH VOWELHARMONY
⢠Can be used in NLG system
⢠Affixes added by other process
⢠Generic form: harmony left till later (now)
Aâa/a Oâo/o Uâu/y
punaise+ssA â punaisessapaa+ssA â paassa
4
FST FOR FINNISH VOWEL HARMONY
1 2
3
a:a, o:o, u:u,A:a, O:o, U:u
a:a, o:o, y:y,A:a, O:o, U:y
other:other
C:C, a:a, o:o, u:u,e:e, i:i
A:a, O:o, U:u
C:C, a:a, o:o, y:y,e:e, i:i
A:a, O:o, U:y
p u n a i s e s s Ap u n a i s e s s a
5
FST FOR FINNISH VOWEL HARMONY
1 2
3
a:a, o:o, u:u,A:a, O:o, U:u
a:a, o:o, y:y,A:a, O:o, U:y
other:other
C:C, a:a, o:o, u:u,e:e, i:i
A:a, O:o, U:u
C:C, a:a, o:o, y:y,e:e, i:i
A:a, O:o, U:y
p a a s s Ap a a s s a
5
OTHER USES OF FSTs
⢠Morphological generation
⢠Text preprocessing, tokenization, NER, . . .
⢠Simple dialogue systems:flow of possible questions, responses, . . .
⢠Dialogue models:tracking dialogue state, user knowledge
More on dialogue in lecture 11
6
PRACTICAL FSTs
⢠Vowel harmony: very simple example, not even complete
⢠Real morphology much more complex
⢠Huge, complicated FSTs
⢠Divide into smaller components: compose
⢠Limited ambiguity: few possible analyses per word
⢠Real morphological analysis forFinnish, English, Swedish, Turkish, Italian, German, . . . :
HFST: https://github.com/hfstWeb demo
7
PRACTICAL FSTs
https://github.com/hfst
Downloadable FST for Finnish morphology analysis
âź1k nodes âFull analyser contains 2.5M nodes!
8
PRACTICAL FSTs
https://github.com/hfst
Downloadable FST for Finnish morphology analysis
âź1k nodes âFull analyser contains 2.5M nodes!
8
THEORETICAL LIMITATIONS OFFSTs
⢠No memory: just current state
⢠Canât match brackets
this sentence ( is ( well ) bracketed ) .
this sentence ( is ( not ) well ) bracketed ) .
?⢠No non-linear structure
⢠Recursion/embedding: important for language
9
THEORETICAL LIMITATIONS OFFSTs
⢠Canât generate separate analyses for:
(un+lock)+able vs un+(lock+able)
⢠Embedding creates long-range
More in next lecture!
dependencies
Alice, with her big boots on, walks quicklyAthletes, after much training, walk quickly
10
EXPRESSIVITY OF LANGUAGEPROCESSING SYSTEMS
⢠Formal grammars categorized by expressive power(expressivity)
FSAs < CFGs < Turing machine
E.g. programminglanguages
⢠Greater expressivity â more languagesâ higher processing requirements
⢠FSA stores only current stateCFG additionally requires stack memory
this sentence ( is ( not ) well ) bracketed ) .
?
11
THE CHOMSKY HIERARCHY
⢠Noam Chomsky tried to define expressivity of NL⢠capture all dependencies, constraints, semantic connections⢠no more complexity than needed
⢠Aim: characterize constraints of human processing system
⢠Define in terms of hierarchy of formal grammars
⢠Increasing expressive powers
12
THE CHOMSKY HIERARCHY
Regular
Context-free
Context-sensitive
Recursively enumerableType 0
Type 1
Type 2
Type 3Regular
expressions
Context-freegrammars
Context-sensitivegrammars
Turingmachines
Other,fine-grained
sub-categories
13
THE CHOMSKY HIERARCHY
Morphology:
⢠mostly regular
⢠typically use FSTs
Syntax:
⢠open question
⢠⼠CF, < CS
⢠English (many others) mostly CF
⢠small exceptions Regular
Context-free
Mildly CS?
Context-sensitive
Recursively enumerable
14
BEYOND REGULAR EXPRESSIVITYGroup discussion
⢠Real human language doesnât use nested brackets (often)
⢠Whatâs wrong with FSTs?
⢠Come up with example of a linguistic phonomenon that canâtbe captured by FSAs/FSTs
⢠Any (human) language⢠Think about memory:
What does it need to keep track of from earlier insentence/dialogue/. . . ?
⢠What gets lost if you try to do analysis with FST?
⢠Why does this matter?Canât we just use more expressive grammars?
15
LANGUAGE MODELLING
⢠Language model (LM): probability distribution oversentences of a language
⢠Probability of sentence appearing in languageâs text
p(w0, . . . ,wN)
But he may make trouble withyour father
Highly plausible
Each night Arkilu departed,leaving a furry man-creature onguard
Less probable
(But both real, taken from same text)
16
LANGUAGE MODELLING
Uses of LMs:
⢠Speech recognition: plausibility judgement of possible outputs
⢠Word prediction for input device / communication aid
⢠Spelling correction
⢠Much more. . .
⢠One example of statistics in NLP
⢠Estimate probabilities from texts seen before
⢠Use linguistic corpora
17
MARKOV LM
Simple statistical LM: Markov chain
Markov assumption: probability of word depends only onprevious word
⢠Far from true: long-range dependencies
⢠But, how well does it work in practice?
Simply model: p(wi |wiâ1)Probability of sentence:
p(W nâ10 ) =
nâ1â
i=0
p(wi |wiâ1)
18
MARKOV LM
AKA bigram model
⢠bigram: pair of consecutive words (wi wi+1)
⢠model only looks at bigram statistics
Estimate probabilities from corpus counts
p(wi |wiâ1) ' C (wiâ1 wi )
C (wiâ1)
Can represent structure of model as plate diagram:
w0
p(w0|START)
w1
p(w1|w0)
w2
p(w2|w1)
. . .
19
N-GRAM MODELS
⢠Bigram model: predictions have limited context â Markovassumption
⢠Condition on longer context: n-gram⢠p(wi |wiâ2,wiâ1) â trigram⢠p(wi |wiâ3,wiâ2,wiâ1) â 4-gram
⢠Estimate as before:
p(wi |wiâ3,wiâ2,wiâ1) ' C (wiâ3 wiâ2 wiâ1 wi )
C (wiâ3 wiâ2 wiâ1)
⢠Better predictions if enough examples of wiâ3 wiâ2 wiâ1 intraining data
20
REMINDER: ZIPFâS LAW
0 20 40 60 80 1000
2000
4000
6000
8000
10000
Most frequent:,
Next:the
âLong tailâ
21
ZIPFâS LAW
0 1000 2000 3000 4000 5000 6000 7000
0
2000
4000
6000
8000
10000
⢠A few n-grams are very common
⢠Many n-grams are very rare (long tail)
For n-gram models:
⢠Always some unseen n-grams
⢠Diminishing returns of more data
22
DATA SPARSITY
⢠How to assign probability to unseen words?
p(phogen|tiâ2, tiâ1)
⢠What predictions to make in unseen context?
p(ti |phogen, cridget)
⢠General problem: models must be robust to things not seen intraining data
1. Do we have any information to inform the decision?
2. If no, what then?
23
DATA SPARSITY
Robust to things not seen in training data
1. Do we have any information to inform the decision?
2. If no, what then?
Some solutions for n-gram models:
1. Backoff: unseen/rare n-gram, try using (n â 1)-gram
2. Smoothing: reserve some probability for things never seenbefore
24
LANGUAGE MODELLINGAPPLICATIONOne example: speech recognition
Speech
Finally a small settle-
ment loomed ahead.
Final ya smalls set all
mens loom da head.
-221.6
-4840.1
LM
25
LANGUAGE MODELLING ACCURACY?
⢠Remember: LM is prob dist over sentences
p(W )
⢠Typically modelled as dist over words given earlier words
p(wi |w0, . . . ,wiâ1)
⢠Two LMs produce different predictionsp(wi |w0, . . . ,wiâ1, LM0) p(wi |w0, . . . ,wiâ1, LM1)
⢠Could compare using accuracy
argmaxw (p(w |w0, . . . ,wiâ1))?= wi
⢠Large vocabulary: top prediction rarely correct
26
EVALUATING A LANGUAGE MODEL
⢠Observed word should get high probability
⢠but others are not wrong
⢠Compare probabilities of observed words
⢠Perplexity: measure mean probability of observed words
⢠How surprised model is on average by each word
2â1mlog2(p(w0,...,wmâ1))
⢠For LM whose context is previous words:
2â1m
âmi=0 log2(p(wi |w0,...,wi ))
⢠Lower is better
27
EXAMPLE LM PERPLEXITIES
Evaluation on Penn Treebank texts (890k toks)
Model PerplexityAWD-LSTM, mixture of softmaxes (2018) 54.44LSTM (2014) 82.7Smoothed 5-gram (1996) 141.2
Evaluation on WikiText-2 (Wikipedia text, 2M toks)
Model PerplexityAWD-LSTM, mixture of softmaxes (2018) 61.45Variational LSTM (2016) 87.0
28
PERPLEXITY
⢠What does 54.44 PPL mean?
⢠Is it good?
⢠Depends on:⢠test corpus: how hard is it to predict?⢠training corpus: how big / representative?
⢠Compare models on same training + test corpus
29
POS TAGGINGReminder
⢠Distinguish syntactic function of words in broad classes
For the present
Noun
, we are. . . vs. The
Adjective
present situation. . .
⢠NLP subtask: part-of-speech (POS) tagging
⢠Shallow syntactic analysis: no explicit structure
⢠Includes some disambiguation important to meaning
⢠Useful practical first analysis step
30
POS TAGS
Some parts of speech:
POS ExampleNoun The dog ate the boneVerb The dog ate the boneAdjective The big dog ate the tasty boneAdverb The dog ate the bone quicklyPronoun He ate my boneDeterminer The dog ate that bone
Typically make slighly more fine-grained distinctionsSome words have one possible POS, most have several
31
EXERCISE: POS-TAGGINGAMBIGUITYIn small groups
POS ExampleNoun The dog ate the boneVerb The dog ate the boneAdjective The big dog ate the boneAdverb He ate the bone quicklyPronoun He ate my boneDeterminer The dog ate that bonePreposition He chewed with his teethCoordinatingconjunction
He chewed and he growled
Example
Return now to yourquarters and I will sendyou word of the outcome
⢠POS tag this sentence, using the POSs above
⢠What other POSs could each word take (in other contexts)?
32
AMBIGUITY OF INTERPRETATION
What is the mean temperature inKumpula?
SELECT day_mean FROM daily_forecast:
WHERE station = âHelsinki Kumpulaâ:
AND date = â2019-05-21â;
SELECT day_mean FROM weekly_forecast
WHERE station = âHelsinki Kumpulaâ
AND week = âw22â;
SELECT MEAN(day_temp)
FROM weather_history
WHERE station = âHelsinki Kumpulaâ
AND year = â2019â;
. . .?
⢠Many forms of ambiguity
⢠Every level/step of analysis
⢠The big challenge of NLP
33
SOME AMBIGUITIES
Easily misinterpreted headlines1:
⢠Ban on Nude Dancing on Governorâs Desk
⢠Juvenile Court to Tr
Same POS tagDifferent meanings
y Shoo
DifferentPOS tags
ting Defendant 2 interpretations
⢠Kids Make Nutritious Snacks
Several types of ambiguity â what?
Can POS-tagging ambiguity explain these?
1Thanks to Dan Klein & Roger Levy
34
AMBIGUITY
Juvenile Court to Try Shooting Defendant
⢠Two interpretations by human reader
⢠NLP system should output both
⢠Most syntactic parsers will produce dozens of structures!
⢠Majority have no intelligible interpretation
35
LOCAL AMBIGUITY
Ambiguity often resolved by immediate context
Turning to Doggo, Myles extended his left palm.
⢠Most ambiguities resolved in this way for humans
⢠Simple context sometimes enough
⢠Sometimes require sophisticated reasoning/knowledge:
Time flies like an arrow.Fruit flies like a banana.
36
LOCAL AMBIGUITY
Ambiguity often resolved by immediate context
Turning to Doggo, Myles extended his left palm.
⢠Most ambiguities resolved in this way for humans
⢠Simple context sometimes enough
⢠Sometimes require sophisticated reasoning/knowledge:
Time flies like an arrow.Fruit flies like a banana.
36
GARDEN-PATH SENTENCES
⢠Start encourages one interpretation
⢠Continuation forces re-analysis
Thanks to @lanegreen
37
GARDEN-PATH SENTENCES
⢠Start encourages one interpretation
⢠Continuation forces re-analysis
Thanks to @lanegreen
Thanks to @AlexBledsoe
37
CORPORA
Corpus (pl. corpora)
The body of written or spoken material upon which a linguisticanalysis is based
⢠Why do we need corpora?⢠Test linguistic hypotheses⢠Evaluate tools: annotated/labelled corpus⢠Train statistical models Statistical NLP:
today⢠Most often: collection of text⢠Also: speech, video, numeric data, . . . , combinations
⢠Vary in:source, language(s), domain, size, quality, annotations
38
STATISTICAL MODELS
Why use statistical models for NLP?
⢠Older systems were rule-based
⢠Used long-studied linguistic knowledge, but:⢠Lots of rules⢠Complex interactions⢠Narrow domain⢠Hard to handle varied (âincorrectâ) and changing language
⢠Statistics from data can help
39
STATISTICAL MODELS
Help model ambiguity / uncertainty, as in POS tagging
⢠Multiple interpretations
⢠Weights/confidences derived from data
⢠Combinatorial effectsâ influence of context
⢠Express uncertainty in output
40
STATISTICAL MODELS
⢠Statistics over previous analyses can help estimate confidences
⢠Often use probabilistic models
⢠Local ambiguity: probabilities/confidences
⢠Multiple hypotheses about meaning/structure
⢠Update hypotheses as larger units are combined
Statistical models ' machine learning (ML)
41
STATISTICAL MODELS
⢠Can try to learn everything from data
⢠Practical and theoretical difficulties
⢠Some success in recent work
Advanced statisticalNLP: lecture 13
⢠Mostly: focussed statistical modelling of sub-tasks
⢠Supervised & unsupervised learning
⢠Collect statistics (learn) from corpora⢠annotated (supervised) / raw data (unsupervised)
42
POS-TAGGING AMBIGUITY
Example
Return now to your quarters and I will send you word of theoutcome.
noun, verb, adjective, adverb, pronoun, determiner, . . .
⢠You POS-tagged this sentence
⢠Several words could have taken multiple tags
⢠Typically, many possible tags per word: some rare but allowed
43
POS-TAGGING STATISTICS
⢠Collect statistics about tag sequences from annotated corpus
⢠tag-word ocurrences⢠typical tag sequences
Potential benefits from statistical model:
⢠Improbable tag-word combinations outweighed by probabletag sequences
⢠And vice versa
⢠Broad domain models: learn from real data
⢠Robust:⢠never seen word, fall back on default tags⢠use tag that looks plausible in context
44
HIDDEN MARKOV MODEL
⢠Probabilistic model, widely used for POS tagging
⢠Maybe seen before: refresher here
⢠Model probability of (word,tag) sequence using⢠tag-word statistics⢠tag sequence statistics
⢠Makes some simplifying assumptions
45
HIDDEN MARKOV MODEL (HMM)
Prn
p(t0|Start)
Vb
p(t1|t0)
Det
p(t2|t1)
Nn
p(t3|t2). . .
this
p(this|t0)
is
p(is|t1)
a
p(a|t2)
sentence
p(sentence|t3)
States:POS tags
⢠Plate diagram: conditional prob dists
⢠Markov assumption on tag sequence: p(ti |tiâ1)
⢠Conditional independence of words: p(wi | ti )⢠Independence assumptions pretty bad
⢠But works quite well in practice
46
HIDDEN MARKOV MODEL (HMM)
t0
p(t0|Start)
t1
p(t1|t0)
t2
p(t2|t1)
t3
p(t3|t2). . .
this
p(this|t0)
is
p(is|t1)
a
p(a|t2)
sentence
p(sentence|t3)
⢠Know p(ti |tiâ1) and p(wi |ti )⢠Input sentence, POS tags unknown⢠Choose t0, t1, . . . that maximizes sentence probability⢠Efficient inference with Viterbi algorithm
(dynamic programming)
47
HMM PARAMETER ESTIMATION
⢠Gold standard POS-tagged corpus: supervised training
⢠Estimate prob dists using maximum likelihood
⢠Frequentist estimate
p(tx |ty ) ' C (ty tx)
C (ty )
p(wz |tx) ' C (wz â tx)
C (tx)
48
MORE CONTEXT: N-GRAMS
⢠Markov assumption: bigram model of tag sequences
⢠Short context
⢠Can extend context as with n-gram LM
⢠Condition on longer tag history: p(ti |tiâ1, tiâ2)
⢠Apply same methods as before to deal with sparse data⢠Backoff to shorter contexts⢠Smoothing of low counts
49
N-GRAM TAGGING MODEL
⢠Still an HMM: many more states
⢠Markov assumption over n-grams
⢠Each state represents (n-1)-gram
⢠Parameter estimation and inference as before
t0
p(t0|Start)
t0 t1
p(t1|t0, Start)
t1 t2
p(t2|t0, t1)
t2 t3
p(t3|t1, t2). . .
this
p(this|t0)
is
p(is|t1)
a
p(a|t2)
sentence
p(sentence|t3)
50
STATISTICAL NLPTo be continued. . .
Next lecture:
⢠Some difficulties with n-gram HMMs: data sparsity
⢠Unsupervised/semi-supervised learning
⢠Other statistical models
Later (lecture 13):
⢠More sophisticated ML
⢠RNNs, CNNs, . . .
51
THIS WEEKâS ASSIGNMENT
⢠Estimating HMM probabilities
⢠Semi-supervised learning: domain transfer
⢠Language modelling
⢠Simple generation
⢠Start now!
⢠Help session: Thursday, 9-11
⢠Due next Monday
52
SUMMARY
⢠Limitations of finite-state methods for NLP
⢠The Chomsky Hierarchy and expressivity
⢠Language modelling
⢠POS tagging: theory and applications
⢠Statistical models in NLP
⢠HMM for statistical POS tagging
53
READING MATERIAL
⢠Regular expressions: J&M3 2.1
⢠Morphology: J&M3 2.4.4 (J&M2 p79-80)
⢠Language modelling: J&M3 ch 3
⢠POS tagging: J&M3 8.1, 8.3 (J&M2 p157-8, 167-9)
⢠HMMs for POS tagging: J&M3 8.4
⢠More detail on HMMs: J&M3 appendix A
Ambiguity game, with excellent descriptions of linguistic concepts:http://madlyambiguous.osu.edu:1035/
54