corpora and statistical methods lecture...
TRANSCRIPT
Albert Gatt
Corpora and Statistical Methods
Lecture 5
In this lecture
We begin to consider the problem of lexical acquisition
beyond collocations
syntax-semantics interface:
verb subcategorisation frames
prepositional phrase attachment ambiguity
verb subcat preferences
semantic similarity (―thesaurus relations‖)
We also introduce some measures for evaluation
The problem of evaluation: How are the results of
automatic acquisition to be assessed?
Basic rationale
For a given classification problem, we have:
a ―gold standard‖ against which to compare
our system’s results, compared to the target gold standard:
false positives (fp)
false negatives (fn)
true positives (tp)
true negatives (tn)
Performance typically measured in terms of precision and
recall.
Precision
Definition:
proportion of items that are correctly classified
i.e. proportion of true positives out of all the system’s classifications
fptp
tp(P) precision
Recall
Definition:
proportion of the actual target (―gold standard‖) items that our
system classifies correctly
fntp
tp(R) recall
total no. of items that should be
correctly classified, including those the system doesn’t get
Combining precision and recall
Typically use the F-measure as a global estimate of
performance against gold standard
We need some factor (alpha) to weight precision and recall; 0.5 gives
them equal weighting
RP
F1
11
1
Fallout
We can also measure fallout: proportion of mistaken classifications
tnfp
fpfallout
total no. of negatives according to the system (true
and false)
Why precision and recall? We could also use simpler measures:
accuracy: % of things we got right
error: % of things we got wrong
Problems:
tn is usually very large, whereas tp, fn, fp are smaller. Precision and recall are more sensitive to these small figures.
Accuracy is only sensitive to the number of errors. F-measure distinguishes true positives from false positives.
Evaluation with humans in the loop
Precision and recall rely on a ―gold standard‖, i.e. a pre-annotated corpus.
Another form of evaluation is against human subjects: Correlational: correlation of output against human judgements;
Task-based: use of the output by humans in a task.
e.g. how easily can humans read generated text?
depends on whether there is a well-defined task
Lexical acquisition: overview
Lexical acquisition
Involves discovering properties of words or classes of words.
Examples:
verbs like eat take an object NP denoting some kind of food
nouns like house, theatre and shack denote kinds of edifices, are
intuitively ―related‖, so should behave similarly in syntax
modifiers like with the icing are likely candidates for attachment to cake
but not to eat
What is a Lexicon? Early generative grammar:
lexicon = words + exceptional behaviour
The idea was:
we have general principles governing syntax, morphology etc
the lexicon is rather ―boring‖, it’s only a repository of what isn’t
covered by the general principles
What is a lexicon? Contemporary theories:
grammar knowledge is knowledge of the lexicon (HPSG, Tree
Adjoining Grammar, Categorial Grammar)
lexicon as interface between all the components of the language
faculty (Jackendoff 2002)
Semantic Bootstrapping: Pinker 1989 suggests that lexical
acquisition is a prerequisite to syntax acquisition
Applications (sample) PP Attachment ambiguities: the children ate the cake with a spoon the children ate the cake with the icing seems to depend on different lexical preferences: cake–icing vs. eat—
spoon
Verb subcategorisation preferences: I (gave/sent) the book to Melanie I (gave/sent) Melanie the book
Lexicography: Semantic classes, e.g. HUMAN/ROLE like {professor, lecturer, reader} Should exhibit the same syntactic behaviour.
Application 1: Verb Subcategorisation
Problem definition
Verbs have subcategorisation frames:
verbs with similar semantic arguments (AGENT, PATIENT etc) can
be grouped together
different semantic arguments can be expressed differently in syntax
e.g. send, give etc allow the dative alternation:
send X to Y / send Y X
give X to Y / give Y X
should be distinguished from donate etc, which don’t (cf. I donated
money to the charity vs. *I donated the charity money)
Uses for parsing
Example:
she told the lady where she had grown up
she found the place where she had grown up
Is the where-clause a clausal argument, or an adverbial
adjunct?
depends on the verb: tell has a [V NP S] subcat frame, find
doesn’t.
Existing resources: Verbnet
Verbnet: online verb lexicon for English
groups verbs into semantic classes
gives subcat information and thematic roles
http://verbs.colorado.edu/~mpalmer/projects/verbnet.html
Verbnet is based on Levin’s (1993) classification of English
verbs.
Verbnet example: class admit-65
Members: admit, allow, include, permit, welcome
<FRAME>…<SYNTAX>
<NP value="Agent“/><VERB/><NP value="Theme“/>
</SYNTAX>…</FRAME>
e.g. she admitted us
<FRAME>…<SYNTAX>
<NP value="Agent“/><VERB/><NP value="Theme“/><NP value=“Location“/>
</SYNTAX>…</FRAME>
e.g. she allowed us here
Verbnet and other resources
Other resources: Framenet
http://framenet.icsi.berkeley.edu/
verbs annotated with detailed semantic and syntactic info
lexical database + annotated corpus examples
Though very large, such resources are not exhaustive.
Automatic acquisition would help to expand them.
Brent’s (1993) algorithm Aim: discover the subcat frames of verbs from a corpus.
Ingredients: Cues: a set of patterns of words & syntactic categories which indicate
the presence of a frame: essentially a regular expression
Hypothesis testing: compare the hypothesis (H0) that a given frame is not appropriate for a verb. Reject H0 if cue co-occurs with the verb with high likelihood.
Example cue [NP NP] frame (e.g. the woman entered the room)
(OBJ|SUBJ_OBJ|CAP) (PUNC|CC) [NP NP] frame
OBJ = object personal pronoun (him etc)
SUBJ_OBJ = subject or object pers. pro (you)
CAP = word in uppercase
PUNC = punctuation mark
CC = subordinating conjunction (if, because etc)
Example match: greet Steve-CAP ,-PUNC
Rationale behind cues
If the cue applies to a verb very frequently, we conclude that
corresponding frame applies to it.
Very unlikely for a phrase to match the cue [NP NP] in the
absence of a transitive verb.
Hypothesis Testing Let c be a cue for frame F
Let v be a verb occurring n times in the corpus
Suppose v occurs m ≤n times with cue c
Note: the cue may be wrong, i.e. a false positive!
Hypothesis testing – Step 1 Assume a binomial distribution, based on the indicator random variable
v(f):
v(f) = 1 if the combination of v+c is a true indicator of the presence of frame
f
v(f) = 0 if the v+c combination is there, but we don’t really have frame f
ε = the probability of error (false positive), i.e. the probability that v(f) = 0
given v+c
Hypothesis testing – step 2
Calculate the probability of error:
likelihood that v does not permit frame f given that v occurs with cue c m times or more
basically an ―n choose k‖ problem: what are the chances that v doesn’t permit f given m occurrences of v+c?
need an estimate of the error rate of cue c, i.e. the probability that cue c is a false indicator of frame F
Probability of error in rejecting H0
n
mr
rn
ffEr
nmcvCfvPp )1()),(|0)((
frequency of v withframe c
error rate of the cue (false positives):chances of finding c when f is not the case
prob. that v does notpermit frame f
Explanation
If ε is the probability that cue c falsely indicates frame f then:
given that v+c occurs m times or more out of n;
we risk an incorrect rejection of H0 with probability pE, having
observed v+c m times
Accepting or rejecting H0 Brent (1993) proposed a threshold value. If the probability of
error is less than the threshold, then we reject H0
e.g. set threshold at 0.02
System has good precision, but low recall
many low-frequency verbs not assigned frames due to lack of evidence.
Improvements Manning (1993): applies POS tagging before running Brent’s
cue detection.
NB: this combines two error-prone systems (cues + tagger)!
Example: cue c has ε=0.25. c occurs 11/80 times with v then pE = 0.011 < 0.02, so H0 is still rejected
I.e. given appropriate hypothesis testing, an unreliable cue can be useful if it occurs enough times.
Application 2: PP Attachment ambiguity
PP Attachment
Pervasive problem for NL Parsing:
PP follows an object NP
Problem is whether PP attaches to VP or NP
Heuristics for improvement:
lexical co-occurrence likelihoods (cake + (with) icing vs. eat + (with)
spoon)
local operations: preference for attaching PP as low as possible in the
tree (i.e. to the NP)
Approach 1
Moscow sent 5000 soldiers into Afghanistan
Compute co-occurrence counts between:
verb & preposition (send + into)
noun & preposition (soldier + into)
Compare the two hypotheses using log-likelihood ratio:
)|(
)|(log
npP
vpP
Limitations of Approach 1 Lexical co-occurrence stats ignore syntactic preferences.
The preference seems to be to attach new material to the ―last seen‖ syntactic node Lynn Frazier’s minimal attachment principle
This predicts preference for PP attachment to object NP, unless there is strong evidence for the contrary.
Why minimal attachment is important
Chrysler confirmed that it would end its venture with Maserati.
PP of interest: with Maserati
occurs frequently with end (e.g. the play ended with a song)
occurs frequently with venture too
So simple frequencies of lexical co-occurrence will not be able to decide (or risk the wrong decision)
Approach 2: Hindle & Rooth (1993)
Event space of interest:
potentially ambiguous sentences with PPs
Given a PP headed by p, a VP headed by v and an NP headed
by n, define two indicator random variables:
VA = 1 iff PP attaches to VP
NA = 1 iff PP attaches to NP
possible in principle for both to be 1:
he put the book [on WW2] [on the table]
VA = 1, NA = 1
Hindle & Rooth - II
Given the sequence [v… n… PP], we calculate the
probability that VA = 1 and NA = 1, given the verb and
noun:
P(VA & NA|v & n) = P(VA|v)P(NA|n)
NB: We assume that attachment to NP or to VP are independent
Hindle and Rooth - III To determine whether on WW2 attaches to NP (the book) or
VP (put):
P(attach(p)=n|v,n)
= P(VA=0 OR NA=1 | v) * P(NA=1 | n)
= 1 * P(NA = 1 | n)
= P(NA=1 | n)
same for P(attach(p)=v|v,n)
Some explanation
Why do we only need to consider NA for P(NA=1|n)?
any one PP can only attach to VP or NP, not both
(VA = 1 and NA = 1 is only true if a sentence has multiple PPs)
If VA = 1 and NA = 1 for any sentence, then:
first PP must attach to the NP
second PP must attach to VP
otherwise, we’d have crossing branches
However, to determine, for a specific PP within the sentence, whether VA=1, we need to exclude the possibility that NA=1.
this accounts for cases where there are 2 pps, both attaching to the NP
Final step
Once we’ve computed, for a given PP, the probability that VA = 1 and the probability that NA=1, we use log likelihood to compare them:
If value is negative, we choose NP attachment; if positive, we choose VP attachment
)|1(
)|0()|1(log
),|)((
),|)((log),,(
2
2
nNAP
vNAPvVAP
nvnpattachP
nvvpattachPpnv
Estimating the initial probabilities
The Hindle & Rooth model needs prior estimates for: P(VA=1| v) P(NA=1 | n)
This is plain old conditional probability, but where do the frequencies come from? We need to disambiguate all ambiguous PPs to count them. But that’s exactly the initial problem! OK if we have a treebank, but often we don’t.
Hindle and Rooth’s solution
Build an initial model by looking only at unambiguous cases.
The road to London is..
She sent him into the nursery…
Apply the initial model to ambiguous cases if the λ value
exceeds a threshold (e.g. 0.2 for VP and -0.2 for NP)
For each remaining ambiguous case, divide it between the
two counts for NA and VA:
i.e. add 0.5 to each count
Other attachment ambiguities Noun compounds:
left-branching: [[statistical parsing] practitioner]
= ―someone who does statistical parsing‖
right-branching: [statistical [parsing algorithm]]
= ―a parsing algorithm which is statistical‖
Could apply a Hindle&Rooth solution, but data sparseness problem is great for these complex N-compounds.
Indeterminacy
we signed an agreement with X
VP-attachment: we signed the agreement in the presence of/in the company of/
together with X
NP-attachment: we signed an agreement between us and X
Probably, both are true, and one must be true for the other to be true.
So is this a real ambiguity? Indeterminacy?