1
Part-of-Speech Tagging
Chapter 8(8.1-8.4.6)
9/19/2019 Speech and Language Processing - Jurafsky and Martin 2
Outline
Parts of speech (POS) Tagsets POS Tagging Rule-based tagging Probabilistic (HMM) tagging
2
Garden Path Sentences
The old dog the footsteps of the young
9/19/2019 Speech and Language Processing - Jurafsky and Martin 3
9/19/2019 Speech and Language Processing - Jurafsky and Martin 4
Parts of Speech
Traditional parts of speech Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc Called: parts-of-speech, lexical categories,
word classes, morphological classes, lexical tags... Lots of debate within linguistics about the
number, nature, and universality of these We’ll completely ignore this debate.
3
5
Parts of Speech
Traditional parts of speech ~ 8 of them
9/19/2019 Speech and Language Processing - Jurafsky and Martin 6
POS examples
N noun chair, bandwidth, pacing V verb study, debate, munch ADJ adjective purple, tall, ridiculous ADV adverb unfortunately, slowly P preposition of, by, to PRO pronoun I, me, mine DET determiner the, a, that, those
4
9/19/2019 Speech and Language Processing - Jurafsky and Martin 7
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag
thekoalaput the keysonthetable
9/19/2019 Speech and Language Processing - Jurafsky and Martin 8
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag
the DETkoalaput the keysonthetable
5
9/19/2019 Speech and Language Processing - Jurafsky and Martin 9
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag
the DETkoala Nput the keysonthetable
9/19/2019 Speech and Language Processing - Jurafsky and Martin 10
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag
the DETkoala Nput Vthe keysonthetable
6
9/19/2019 Speech and Language Processing - Jurafsky and Martin 11
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag
the DETkoala Nput Vthe DETkeysonthetable
9/19/2019 Speech and Language Processing - Jurafsky and Martin 12
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag
the DETkoala Nput Vthe DETkeys Nonthetable
7
9/19/2019 Speech and Language Processing - Jurafsky and Martin 13
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag
the DETkoala Nput Vthe DETkeys Non Pthetable
9/19/2019 Speech and Language Processing - Jurafsky and Martin 14
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag
the DETkoala Nput Vthe DETkeys Non Pthe DETtable
8
9/19/2019 Speech and Language Processing - Jurafsky and Martin 15
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag
the DETkoala Nput Vthe DETkeys Non Pthe DETtable N
9/19/2019 Speech and Language Processing - Jurafsky and Martin 16
Why is POS Tagging Useful?
First step of many practical tasks, e.g. Speech synthesis (aka text to speech)
How to pronounce “lead”? OBject obJECT CONtent conTENT
Parsing Need to know if a word is an N or V before you can parse
Information extraction Finding names, relations, etc.
Language modeling Backoff
9
Why is POS Tagging Difficult?
Words often have more than one POS: back The back door = adjective On my back = Win the voters back = Promised to back the bill =
Why is POS Tagging Difficult?
Words often have more than one POS: back The back door = adjective On my back = noun Win the voters back = Promised to back the bill =
10
Why is POS Tagging Difficult?
Words often have more than one POS: back The back door = adjective On my back = noun Win the voters back = adverb Promised to back the bill =
Why is POS Tagging Difficult?
Words often have more than one POS: back The back door = adjective On my back = noun Win the voters back = adverb Promised to back the bill = verb
The POS tagging problem is to determine the POS tag for a particular instance of a word.
11
POS Tagging
Input: Plays well with others Ambiguity: NNS/VBZ UH/JJ/NN/RB IN
NNS Output: Plays/VBZ well/RB with/IN others/NNS
Penn Treebank POS tags
POS tagging performance
How many tags are correct? (Tag accuracy) About 97% currently But baseline is already 90% Baseline is performance of stupidest possible
method Tag every word with its most frequent tag Tag unknown words as nouns
Partly easy because Many words are unambiguous You get points for them (the, a, etc.) and for
punctuation marks!
12
Deciding on the correct part of speech can be difficult even for people
Mrs/NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBG
All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT corner/NN
Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD
How difficult is POS tagging?
About 11% of the word types in the Brown corpus are ambiguous with regard to part of speech But they tend to be very common words.
E.g., that I know that he is honest = IN Yes, that play was nice = DT You can’t go that far = RB
40% of the word tokens are ambiguous
13
Review Backoff/Interpolation
Parts of Speech What?
Part of Speech Tagging What? Why? Easy or hard? Evaluation
9/19/2019 Speech and Language Processing - Jurafsky and Martin 25
9/19/2019 Speech and Language Processing - Jurafsky and Martin 26
Open vs. Closed Classes
Closed class: why? Determiners: a, an, the Prepositions: of, in, by, … Auxiliaries: may, can, will had, been, … Pronouns: I, you, she, mine, his, them, … Usually function words (short common words which play
a role in grammar) Open class: why? English has 4: Nouns, Verbs, Adjectives, Adverbs Many languages have these 4, but not all!
14
9/19/2019 Speech and Language Processing - Jurafsky and Martin 27
Open vs. Closed Classes
Closed class: a small fixed membership Determiners: a, an, the Prepositions: of, in, by, … Auxiliaries: may, can, will had, been, … Pronouns: I, you, she, mine, his, them, … Usually function words (short common words which play
a role in grammar) Open class: new ones can be created all the time English has 4: Nouns, Verbs, Adjectives, Adverbs Many languages have these 4, but not all!
9/19/2019 Speech and Language Processing - Jurafsky and Martin 28
Open Class Words
Nouns Proper nouns (Pittsburgh, Pat Gallagher)
English capitalizes these. Common nouns (the rest). Count nouns and mass nouns
Count: have plurals, get counted: goat/goats, one goat, two goats Mass: don’t get counted (snow, salt, communism) (*two snows)
Adverbs: tend to modify things Unfortunately, John walked home extremely slowly yesterday Directional/locative adverbs (here,home, downhill) Degree adverbs (extremely, very, somewhat) Manner adverbs (slowly, slinkily, delicately)
Verbs In English, have morphological affixes (eat/eats/eaten)
15
9/19/2019 Speech and Language Processing - Jurafsky and Martin 29
Closed Class Words
Examples: prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I, .. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, …
9/19/2019 Speech and Language Processing - Jurafsky and Martin 30
Prepositions from CELEX
16
9/19/2019 Speech and Language Processing - Jurafsky and Martin 31
POS TaggingChoosing a Tagset
There are so many parts of speech, potential distinctions we can draw
To do POS tagging, we need to choose a standard set of tags to work with
Could pick very coarse tagsets N, V, Adj, Adv.
More commonly used set is finer grained, the “Penn TreeBank tagset”, 45 tags
Even more fine-grained tagsets exist
9/19/2019 Speech and Language Processing - Jurafsky and Martin 32
Penn TreeBank POS Tagset
17
9/19/2019 Speech and Language Processing - Jurafsky and Martin 33
Using the Penn Tagset
The/? grand/? jury/? commmented/? on/? a/? number/? of/? other/? topics/? ./?
9/19/2019 Speech and Language Processing - Jurafsky and Martin 34
Using the Penn Tagset
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
18
9/19/2019 Speech and Language Processing - Jurafsky and Martin 35
Recall POS Tagging Difficulty
Words often have more than one POS: back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word.
These examples from Dekang Lin
9/19/2019 Speech and Language Processing - Jurafsky and Martin 36
How Hard is POS Tagging? Measuring Ambiguity
19
Tagging Whole Sentences with POS is Hard too
Ambiguous POS contexts E.g., Time flies like an arrow.
Possible POS assignments Time/[V,N] flies/[V,N] like/[V,Prep] an/Det
arrow/N Time/N flies/V like/Prep an/Det arrow/N Time/V flies/N like/Prep an/Det arrow/N Time/N flies/N like/V an/Det arrow/N…..
37
How Do We Disambiguate POS?
Many words have only one POS tag (e.g. is, Mary, smallest)Others have a single most likely tag
(e.g. Dog is less used as a V)Tags also tend to co-occur regularly with
other tags (e.g. Det, N) In addition to conditional probabilities of
words P(w1|wn-1), we can look at POS likelihoods P(t1|tn-1) to disambiguate sentences and to assess sentence likelihoods
38
20
More and Better Features Feature-based tagger
Can do surprisingly well just looking at a word by itself: Word the: the DT Lowercased word Importantly: importantly RB Prefixes unfathomable: un- JJ Suffixes Importantly: -ly RB Capitalization Meridian: CAP NNP Word shapes 35-year: d-x JJ
Overview: POS Tagging Accuracies
Rough accuracies:
Most freq tag: ~90% / ~50%
Trigram HMM: ~95% / ~55%
Maxent P(t|w): 93.7% / 82.6%
Upper bound: ~98% (human)
Most errors on unknown
words
21
9/19/2019 Speech and Language Processing - Jurafsky and Martin 41
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the
dictionary Write rules by hand to selectively remove
tags Leaving the correct tag for each word.
9/19/2019 Speech and Language Processing - Jurafsky and Martin 42
Start With a Dictionary• she:• promised:• to• back:• the:• bill:
22
9/19/2019 Speech and Language Processing - Jurafsky and Martin 43
Start With a Dictionary• she: PRP• promised: VBN,VBD• to TO• back: VB, JJ, RB, NN• the: DT• bill: NN, VB
9/19/2019 Speech and Language Processing - Jurafsky and Martin 44
Assign Every Possible Tag
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
23
9/19/2019 Speech and Language Processing - Jurafsky and Martin 45
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”
NNRBJJ VB
PRP VBD TO VB DT NNShe promised to back the bill
VBN
POS tag sequences
Some tag sequences are more likely occur than others POS Ngram view
https://books.google.com/ngrams/graph?content=_ADJ_+_NOUN_%2C_ADV_+_NOUN_%2C+_ADV_+_VERB_
46
Existing methods often model POS tagging as a sequence tagging problem
24
9/19/2019 Speech and Language Processing - Jurafsky and Martin 47
POS Tagging as Sequence Classification
We are given a sentence (an “observation” or “sequence of observations”) Secretariat is expected to race tomorrow
What is the best sequence of tags that corresponds to this sequence of observations? Probabilistic view: Consider all possible sequences of tags Out of this universe of sequences, choose the
tag sequence which is most probable given the observation sequence of n words w1…wn.
How do you predict the tags?
Two types of information are useful Relations between words and tags
Relations between tags and tags DT NN, DT JJ NN…
48
25
9/19/2019 Speech and Language Processing - Jurafsky and Martin 49
Getting to HMMs(Hidden Markov Models)
We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest.
Hat ^ means “our estimate of the best one” Argmaxx f(x) means “the x such that f(x) is maximized”
9/19/2019 Speech and Language Processing - Jurafsky and Martin 50
Getting to HMMs
This equation is guaranteed to give us the best tag sequence
But how to make it operational? How to compute this value? Intuition of Bayesian classification: Use Bayes rule to transform this equation into
a set of other probabilities that are easier to compute
26
9/19/2019 Speech and Language Processing - Jurafsky and Martin 51
Using Bayes Rule
Statistical POS tagging
What is the most likely sequence of tags for the given sequence of words w
52
P( DT JJ NN | a smart dog) =
27
Statistical POS tagging
What is the most likely sequence of tags for the given sequence of words w
53
9/19/2019 Speech and Language Processing - Jurafsky and Martin 54
Likelihood and Prior
28
9/19/2019 Speech and Language Processing - Jurafsky and Martin 55
Two Kinds of Probabilities
Tag transition probabilities p(ti|ti-1) Determiners likely to precede adjs and nouns That/DT flight/NN The/DT yellow/JJ hat/NN So we expect P(NN|DT) and P(JJ|DT) to be high But P(DT|JJ) to be:
Compute P(NN|DT) by counting in a labeled corpus:
9/19/2019 Speech and Language Processing - Jurafsky and Martin 56
Two Kinds of Probabilities
Word likelihood (emission) probabilities p(wi|ti) VBZ (3sg Pres verb) likely to be “is” Compute P(is|VBZ) by counting in a labeled
corpus:
29
Put them together
Two independent assumptions Approximate P(t) by a bi(or N)-gram model Assume each word depends only on its POS
tag
57
Table representation
58
30
Prediction in generative model
Inference: What is the most likely sequence of tags for the given sequence of words w
What are the latent states that most likely generate the sequence of word w
59
9/19/2019 Speech and Language Processing - Jurafsky and Martin 60
Example: The Verb “race”
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR People/NNS continue/VB to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NNfor/IN outer/JJ space/NN How do we pick the right tag?
31
9/19/2019 Speech and Language Processing - Jurafsky and Martin 61
Disambiguating “race”
9/19/2019 Speech and Language Processing - Jurafsky and Martin 62
Example P(NN|TO) = .00047 P(VB|TO) = .83
P(race|NN) = .00057 P(race|VB) = .00012
P(NR|VB) = .0027 P(NR|NN) = .0012
P(VB|TO)P(NR|VB)P(race|VB) = .00000027 P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
So we (correctly) choose the verb reading
32
9/19/2019 Speech and Language Processing - Jurafsky and Martin 63
Hidden Markov Models
What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)
9/19/2019 Speech and Language Processing - Jurafsky and Martin 64
Definitions A weighted finite-state automaton adds
probabilities to the arcs The sum of the probabilities leaving any arc must
sum to one A Markov chain is a special case of a WFSA in
which the input sequence uniquely determines which states the automaton will go through
Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous
sequences
33
9/19/2019 Speech and Language Processing - Jurafsky and Martin 65
Markov Chain for Weather
Weather continued
9/19/2019 66
34
9/19/2019 Speech and Language Processing - Jurafsky and Martin 67
Markov Chain for Words
9/19/2019 Speech and Language Processing - Jurafsky and Martin 68
Markov Chain: “First-order observable Markov Model” A set of states Q = q1, q2…qN; the state at time t is qt
Transition probabilities: a set of probabilities A = a01a02…an1…ann. Each aij represents the probability of transitioning
from state i to state j The set of these is the transition probability matrix A
Current state only depends on previous state
P(qi | q1...qi1) P(qi | qi1)
35
9/19/2019 Speech and Language Processing - Jurafsky and Martin 69
Markov Chain for Weather
What is the probability of 4 consecutive rainy days? Sequence is rainy-rainy-rainy-rainy I.e., state sequence is 3-3-3-3 P(3,3,3,3) =
9/19/2019 Speech and Language Processing - Jurafsky and Martin 70
Markov Chain for Weather
What is the probability of 4 consecutive rainy days? Sequence is rainy-rainy-rainy-rainy I.e., state sequence is 3-3-3-3 P(3,3,3,3) = 3a33a33a33 = 0.2 x (0.6)3 = 0.0432
36
Review Tagsets What? Example(s)
Baseline(s) for tagging evaluation
Two types of probabilities for POS tagging assumptions
Markov Chain vs Hidden Markov Model9/19/2019 Speech and Language Processing - Jurafsky and Martin 71
9/19/2019 Speech and Language Processing - Jurafsky and Martin 72
HMM for Ice Cream
You are a climatologist in the year 2799 Studying global warming You can’t find any records of the weather
in Pittsburgh for summer of 2018 But you find a diary Which lists how many ice-creams someone
ate every date that summer Our job: figure out how hot it was
37
9/19/2019 Speech and Language Processing - Jurafsky and Martin 73
Hidden Markov Model For Markov chains, the output symbols are the
same as the states. See hot weather: we’re in state hot
But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags
So we need an extension! A Hidden Markov Model is an extension of a Markov
chain in which the input symbols are not the same as the states.
This means we don’t know which state we are in.
9/19/2019 Speech and Language Processing - Jurafsky and Martin 74
States Q = q1, q2…qN;
Observations O= o1, o2…oN;
Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
Transition probabilities Transition probability matrix A = {aij}
Observation likelihoods Output probability matrix B={bi(k)}
Special initial probability vector i P(q1 i) 1 i N
aij P(qt j |qt1 i) 1 i, j N
bi(k) P(Xt ok | qt i)
Hidden Markov Models
38
9/19/2019 Speech and Language Processing - Jurafsky and Martin 75
Task
Given Ice Cream Observation Sequence:
1,2,3,2,2,2,3… Produce: Weather Sequence: H,C,H,H,H,C…
Weather/Ice Cream HMM Hidden States: Transition probabilities: Observations:
39
Weather/Ice Cream HMM Hidden States: {Hot,Cold} Transition probabilities (A Matrix) between H
and C Observations: {1,2,3} # of ice creams eaten
per day
9/19/2019 Speech and Language Processing - Jurafsky and Martin 78
HMM for Ice Cream
40
9/19/2019 Speech and Language Processing - Jurafsky and Martin 79
Back to POS Tagging: Transition Probabilities
9/19/2019 Speech and Language Processing - Jurafsky and Martin 80
Observation Likelihoods
41
What can HMMs Do?
Likelihood: Given an HMM λ and an observation sequence O, determine the likelihood P(O, λ): language modelingDecoding: Given an observation
sequence O and an HMM λ, discover the best hidden state sequence Q: Given seq of ice creams, what was the most likely weather on those days? (tagging)Learning: Given an observation sequence
O and the set of states in the HMM, learn the HMM parameters
9/19/2019 81
9/19/2019 Speech and Language Processing - Jurafsky and Martin 82
Decoding Ok, now we have a complete model that can
give us what we need. Recall that we need to get
We could just enumerate all paths given the input and use the model to assign probabilities to each. Not a good idea. In practice: Viterbi Algorithm (dynamic programming)
42
Viterbi Algorithm
Intuition: since state transition out of a state only depend on the current state (and not previous states), we can record for each state the optimal path We record Cheapest cost to state at step Backtrace for that state to best predecessor
9/19/2019 Speech and Language Processing - Jurafsky and Martin 83
9/19/2019 Speech and Language Processing - Jurafsky and Martin 84
Viterbi Summary
Create an array With columns corresponding to inputs Rows corresponding to possible states
Sweep through the array in one pass filling the columns left to right using our transition probs and observations probs Dynamic programming key is that we
need only store the MAX prob path to each cell (not all paths).
43
9/19/2019 Speech and Language Processing - Jurafsky and Martin 85
Viterbi Example
Another Viterbi Example
Analyzing “Fish sleep” Done in class
9/19/2019 Speech and Language Processing - Jurafsky and Martin 86
44
9/19/2019 Speech and Language Processing - Jurafsky and Martin 87
Evaluation
So once you have your POS tagger running how do you evaluate it? Overall error rate with respect to a gold-
standard test set. Error rates on particular tags Error rates on particular words Tag confusions...
Need a baseline – just the most frequent tag is 90% accurate!
9/19/2019 Speech and Language Processing - Jurafsky and Martin 88
Error Analysis Look at a confusion matrix
See what errors are causing problems Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
45
9/19/2019 Speech and Language Processing - Jurafsky and Martin 89
Evaluation
The result is compared with a manually coded “Gold Standard” Typically accuracy reaches 96-97% This may be compared with result for a
baseline tagger (one that uses no context). Important: 100% is impossible even for
human annotators.
More Complex Issues
Tag indeterminacy: when ‘truth’ isn’t clearCaribbean cooking, child seat
Tagging multipart wordswouldn’t --> would/MD n’t/RB
How to handle unknown words Assume all tags equally likely Assume same tag distribution as all other
singletons in corpus Use morphology, word length,….
46
Other Tagging Tasks
Noun Phrase (NP) Chunking [the student] said [the exam] is hard Three tabs B = beginning of NP I = continuing in NP O = other word
Tagging result The/B student/I said/O the/B exam/I is/0 hard/0
9/19/2019 Speech and Language Processing - Jurafsky and Martin 91
9/19/2019 Speech and Language Processing - Jurafsky and Martin 92
Summary
Parts of speech Tagsets Part of speech tagging Rule-Based, HMM Tagging Other methods later in course