corpora and statistical methods lecture...

45
Albert Gatt Corpora and Statistical Methods Lecture 5

Upload: others

Post on 22-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Albert Gatt

Corpora and Statistical Methods

Lecture 5

Page 2: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

In this lecture

We begin to consider the problem of lexical acquisition

beyond collocations

syntax-semantics interface:

verb subcategorisation frames

prepositional phrase attachment ambiguity

verb subcat preferences

semantic similarity (―thesaurus relations‖)

We also introduce some measures for evaluation

Page 3: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

The problem of evaluation: How are the results of

automatic acquisition to be assessed?

Page 4: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Basic rationale

For a given classification problem, we have:

a ―gold standard‖ against which to compare

our system’s results, compared to the target gold standard:

false positives (fp)

false negatives (fn)

true positives (tp)

true negatives (tn)

Performance typically measured in terms of precision and

recall.

Page 5: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Precision

Definition:

proportion of items that are correctly classified

i.e. proportion of true positives out of all the system’s classifications

fptp

tp(P) precision

Page 6: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Recall

Definition:

proportion of the actual target (―gold standard‖) items that our

system classifies correctly

fntp

tp(R) recall

total no. of items that should be

correctly classified, including those the system doesn’t get

Page 7: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Combining precision and recall

Typically use the F-measure as a global estimate of

performance against gold standard

We need some factor (alpha) to weight precision and recall; 0.5 gives

them equal weighting

RP

F1

11

1

Page 8: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Fallout

We can also measure fallout: proportion of mistaken classifications

tnfp

fpfallout

total no. of negatives according to the system (true

and false)

Page 9: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Why precision and recall? We could also use simpler measures:

accuracy: % of things we got right

error: % of things we got wrong

Problems:

tn is usually very large, whereas tp, fn, fp are smaller. Precision and recall are more sensitive to these small figures.

Accuracy is only sensitive to the number of errors. F-measure distinguishes true positives from false positives.

Page 10: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Evaluation with humans in the loop

Precision and recall rely on a ―gold standard‖, i.e. a pre-annotated corpus.

Another form of evaluation is against human subjects: Correlational: correlation of output against human judgements;

Task-based: use of the output by humans in a task.

e.g. how easily can humans read generated text?

depends on whether there is a well-defined task

Page 11: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Lexical acquisition: overview

Page 12: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Lexical acquisition

Involves discovering properties of words or classes of words.

Examples:

verbs like eat take an object NP denoting some kind of food

nouns like house, theatre and shack denote kinds of edifices, are

intuitively ―related‖, so should behave similarly in syntax

modifiers like with the icing are likely candidates for attachment to cake

but not to eat

Page 13: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

What is a Lexicon? Early generative grammar:

lexicon = words + exceptional behaviour

The idea was:

we have general principles governing syntax, morphology etc

the lexicon is rather ―boring‖, it’s only a repository of what isn’t

covered by the general principles

Page 14: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

What is a lexicon? Contemporary theories:

grammar knowledge is knowledge of the lexicon (HPSG, Tree

Adjoining Grammar, Categorial Grammar)

lexicon as interface between all the components of the language

faculty (Jackendoff 2002)

Semantic Bootstrapping: Pinker 1989 suggests that lexical

acquisition is a prerequisite to syntax acquisition

Page 15: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Applications (sample) PP Attachment ambiguities: the children ate the cake with a spoon the children ate the cake with the icing seems to depend on different lexical preferences: cake–icing vs. eat—

spoon

Verb subcategorisation preferences: I (gave/sent) the book to Melanie I (gave/sent) Melanie the book

Lexicography: Semantic classes, e.g. HUMAN/ROLE like {professor, lecturer, reader} Should exhibit the same syntactic behaviour.

Page 16: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Application 1: Verb Subcategorisation

Page 17: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Problem definition

Verbs have subcategorisation frames:

verbs with similar semantic arguments (AGENT, PATIENT etc) can

be grouped together

different semantic arguments can be expressed differently in syntax

e.g. send, give etc allow the dative alternation:

send X to Y / send Y X

give X to Y / give Y X

should be distinguished from donate etc, which don’t (cf. I donated

money to the charity vs. *I donated the charity money)

Page 18: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Uses for parsing

Example:

she told the lady where she had grown up

she found the place where she had grown up

Is the where-clause a clausal argument, or an adverbial

adjunct?

depends on the verb: tell has a [V NP S] subcat frame, find

doesn’t.

Page 19: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Existing resources: Verbnet

Verbnet: online verb lexicon for English

groups verbs into semantic classes

gives subcat information and thematic roles

http://verbs.colorado.edu/~mpalmer/projects/verbnet.html

Verbnet is based on Levin’s (1993) classification of English

verbs.

Page 20: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Verbnet example: class admit-65

Members: admit, allow, include, permit, welcome

<FRAME>…<SYNTAX>

<NP value="Agent“/><VERB/><NP value="Theme“/>

</SYNTAX>…</FRAME>

e.g. she admitted us

<FRAME>…<SYNTAX>

<NP value="Agent“/><VERB/><NP value="Theme“/><NP value=“Location“/>

</SYNTAX>…</FRAME>

e.g. she allowed us here

Page 21: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Verbnet and other resources

Other resources: Framenet

http://framenet.icsi.berkeley.edu/

verbs annotated with detailed semantic and syntactic info

lexical database + annotated corpus examples

Though very large, such resources are not exhaustive.

Automatic acquisition would help to expand them.

Page 22: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Brent’s (1993) algorithm Aim: discover the subcat frames of verbs from a corpus.

Ingredients: Cues: a set of patterns of words & syntactic categories which indicate

the presence of a frame: essentially a regular expression

Hypothesis testing: compare the hypothesis (H0) that a given frame is not appropriate for a verb. Reject H0 if cue co-occurs with the verb with high likelihood.

Page 23: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Example cue [NP NP] frame (e.g. the woman entered the room)

(OBJ|SUBJ_OBJ|CAP) (PUNC|CC) [NP NP] frame

OBJ = object personal pronoun (him etc)

SUBJ_OBJ = subject or object pers. pro (you)

CAP = word in uppercase

PUNC = punctuation mark

CC = subordinating conjunction (if, because etc)

Example match: greet Steve-CAP ,-PUNC

Page 24: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Rationale behind cues

If the cue applies to a verb very frequently, we conclude that

corresponding frame applies to it.

Very unlikely for a phrase to match the cue [NP NP] in the

absence of a transitive verb.

Page 25: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Hypothesis Testing Let c be a cue for frame F

Let v be a verb occurring n times in the corpus

Suppose v occurs m ≤n times with cue c

Note: the cue may be wrong, i.e. a false positive!

Page 26: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Hypothesis testing – Step 1 Assume a binomial distribution, based on the indicator random variable

v(f):

v(f) = 1 if the combination of v+c is a true indicator of the presence of frame

f

v(f) = 0 if the v+c combination is there, but we don’t really have frame f

ε = the probability of error (false positive), i.e. the probability that v(f) = 0

given v+c

Page 27: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Hypothesis testing – step 2

Calculate the probability of error:

likelihood that v does not permit frame f given that v occurs with cue c m times or more

basically an ―n choose k‖ problem: what are the chances that v doesn’t permit f given m occurrences of v+c?

need an estimate of the error rate of cue c, i.e. the probability that cue c is a false indicator of frame F

Page 28: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Probability of error in rejecting H0

n

mr

rn

ffEr

nmcvCfvPp )1()),(|0)((

frequency of v withframe c

error rate of the cue (false positives):chances of finding c when f is not the case

prob. that v does notpermit frame f

Page 29: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Explanation

If ε is the probability that cue c falsely indicates frame f then:

given that v+c occurs m times or more out of n;

we risk an incorrect rejection of H0 with probability pE, having

observed v+c m times

Page 30: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Accepting or rejecting H0 Brent (1993) proposed a threshold value. If the probability of

error is less than the threshold, then we reject H0

e.g. set threshold at 0.02

System has good precision, but low recall

many low-frequency verbs not assigned frames due to lack of evidence.

Page 31: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Improvements Manning (1993): applies POS tagging before running Brent’s

cue detection.

NB: this combines two error-prone systems (cues + tagger)!

Example: cue c has ε=0.25. c occurs 11/80 times with v then pE = 0.011 < 0.02, so H0 is still rejected

I.e. given appropriate hypothesis testing, an unreliable cue can be useful if it occurs enough times.

Page 32: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Application 2: PP Attachment ambiguity

Page 33: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

PP Attachment

Pervasive problem for NL Parsing:

PP follows an object NP

Problem is whether PP attaches to VP or NP

Heuristics for improvement:

lexical co-occurrence likelihoods (cake + (with) icing vs. eat + (with)

spoon)

local operations: preference for attaching PP as low as possible in the

tree (i.e. to the NP)

Page 34: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Approach 1

Moscow sent 5000 soldiers into Afghanistan

Compute co-occurrence counts between:

verb & preposition (send + into)

noun & preposition (soldier + into)

Compare the two hypotheses using log-likelihood ratio:

)|(

)|(log

npP

vpP

Page 35: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Limitations of Approach 1 Lexical co-occurrence stats ignore syntactic preferences.

The preference seems to be to attach new material to the ―last seen‖ syntactic node Lynn Frazier’s minimal attachment principle

This predicts preference for PP attachment to object NP, unless there is strong evidence for the contrary.

Page 36: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Why minimal attachment is important

Chrysler confirmed that it would end its venture with Maserati.

PP of interest: with Maserati

occurs frequently with end (e.g. the play ended with a song)

occurs frequently with venture too

So simple frequencies of lexical co-occurrence will not be able to decide (or risk the wrong decision)

Page 37: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Approach 2: Hindle & Rooth (1993)

Event space of interest:

potentially ambiguous sentences with PPs

Given a PP headed by p, a VP headed by v and an NP headed

by n, define two indicator random variables:

VA = 1 iff PP attaches to VP

NA = 1 iff PP attaches to NP

possible in principle for both to be 1:

he put the book [on WW2] [on the table]

VA = 1, NA = 1

Page 38: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Hindle & Rooth - II

Given the sequence [v… n… PP], we calculate the

probability that VA = 1 and NA = 1, given the verb and

noun:

P(VA & NA|v & n) = P(VA|v)P(NA|n)

NB: We assume that attachment to NP or to VP are independent

Page 39: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Hindle and Rooth - III To determine whether on WW2 attaches to NP (the book) or

VP (put):

P(attach(p)=n|v,n)

= P(VA=0 OR NA=1 | v) * P(NA=1 | n)

= 1 * P(NA = 1 | n)

= P(NA=1 | n)

same for P(attach(p)=v|v,n)

Page 40: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Some explanation

Why do we only need to consider NA for P(NA=1|n)?

any one PP can only attach to VP or NP, not both

(VA = 1 and NA = 1 is only true if a sentence has multiple PPs)

If VA = 1 and NA = 1 for any sentence, then:

first PP must attach to the NP

second PP must attach to VP

otherwise, we’d have crossing branches

However, to determine, for a specific PP within the sentence, whether VA=1, we need to exclude the possibility that NA=1.

this accounts for cases where there are 2 pps, both attaching to the NP

Page 41: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Final step

Once we’ve computed, for a given PP, the probability that VA = 1 and the probability that NA=1, we use log likelihood to compare them:

If value is negative, we choose NP attachment; if positive, we choose VP attachment

)|1(

)|0()|1(log

),|)((

),|)((log),,(

2

2

nNAP

vNAPvVAP

nvnpattachP

nvvpattachPpnv

Page 42: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Estimating the initial probabilities

The Hindle & Rooth model needs prior estimates for: P(VA=1| v) P(NA=1 | n)

This is plain old conditional probability, but where do the frequencies come from? We need to disambiguate all ambiguous PPs to count them. But that’s exactly the initial problem! OK if we have a treebank, but often we don’t.

Page 43: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Hindle and Rooth’s solution

Build an initial model by looking only at unambiguous cases.

The road to London is..

She sent him into the nursery…

Apply the initial model to ambiguous cases if the λ value

exceeds a threshold (e.g. 0.2 for VP and -0.2 for NP)

For each remaining ambiguous case, divide it between the

two counts for NA and VA:

i.e. add 0.5 to each count

Page 44: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Other attachment ambiguities Noun compounds:

left-branching: [[statistical parsing] practitioner]

= ―someone who does statistical parsing‖

right-branching: [statistical [parsing algorithm]]

= ―a parsing algorithm which is statistical‖

Could apply a Hindle&Rooth solution, but data sparseness problem is great for these complex N-compounds.

Page 45: Corpora and Statistical Methods Lecture 5staff.um.edu.mt/albert.gatt/teaching/dl/statLecture5a.pdf · 2019. 4. 29. · In this lecture We begin to consider the problem of lexical

Indeterminacy

we signed an agreement with X

VP-attachment: we signed the agreement in the presence of/in the company of/

together with X

NP-attachment: we signed an agreement between us and X

Probably, both are true, and one must be true for the other to be true.

So is this a real ambiguity? Indeterminacy?