lin3022 natural language processing lecture 9

Click here to load reader

Upload: aspen

Post on 22-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

LIN3022 Natural Language Processing Lecture 9. Albert Gatt. In this lecture. We continue with our discussion of parsing algorithms We introduce dynamic programming approaches We then look at probabilistic context-free grammars Statistical parsers. Part 1. Dynamic programming approaches. - PowerPoint PPT Presentation

TRANSCRIPT

PowerPoint Presentation

Albert GattLIN3022 Natural Language ProcessingLecture 9In this lectureWe continue with our discussion of parsing algorithmsWe introduce dynamic programming approaches

We then look at probabilistic context-free grammarsStatistical parsersPart 1Dynamic programming approachesTop-down vs bottom-up searchTop-downBottom-upNever considers derivations that do not end up at root S.

Wastes a lot of time with trees that are inconsistent with the input.

Generates many subtrees that will never lead to an S.

Only considers trees that cover some part of the input.NB: With both top-down and bottom-up approaches, we view parsing as a search problem.Beyond top-down and bottom-upOne of the problems we identified with top-down and bottom-up search is that they are wasteful.

These algorithms proceed by searching through all possible alternatives at every stage of processing.

Wherever there is local ambiguity, these possibly alternatives multiply.

There is lots of repeated work. Both S NP VP and S VP involve a VPThe VP rule is therefore applied twice!

Ideally, we want to break up the parsing problem into sub-problems and avoid doing all this extra work.Extra effort in top-down parsingInput: a flight from Indianapolis to Houston.

NP Det Nominal rule. (Dead end)

NP Det Nominal PP + Nominal Noun PP(Dead end)

NP Det Nominal+Nominal Nominal PP+Nominal Nominal PP

Dynamic programmingIn essence, dynamic programming involves solving a task by breaking it up into smaller sub-tasks.

In general, this is carried out by:Breaking up a problem into sub-problems.Creating a table which will contain solutions to each sub-problem.Resolving each sub-problem and populating the table.Reading off the complete solution from the table, by combining the solutions to the sub-problems.Dynamic programming for parsingSuppose we need to parse:Book that flight.

We can split the parsing problem into sub-problems as follows:Store sub-trees for each constituent in the table.This means we only parse each part of the input once.In case of ambiguity, we can store multiple possible sub-trees for each piece of input.Part 2The CKY Algorithm and Chomsky Normal FormCKY parsingClassic, bottom-up dynamic programming algorithm (Cocke-Kasami-Younger).

Requires an input grammar based on Chomsky Normal FormA CNF grammar is a Context-Free Grammar in which:Every rule LHS is a non-terminalEvery rule RHS consists of either a single terminal or two non-terminals.Examples:A BCNP Nominal PPA aNoun manBut not:NP the NominalS VP

Chomsky Normal FormAny CFG can be re-written in CNF, without any loss of expressiveness.

That is, for any CFG, there is a corresponding CNF grammar which accepts exactly the same set of strings as the original CFG.

Converting a CFG to CNFTo convert a CFG to CNF, we need to deal with three issues:Rules that mix terminals and non-terminals on the RHSE.g. NP the Nominal

Rules with a single non-terminal on the RHS (called unit productions)E.g. NP Nominal

Rules which have more than two items on the RHSE.g. NP Det Noun PP

Converting a CFG to CNFRules that mix terminals and non-terminals on the RHSE.g. NP the Nominal

Solution: Introduce a dummy non-terminal to cover the original terminalE.g. Det theRe-write the original rule:NP Det NominalDet the

Converting a CFG to CNFRules with a single non-terminal on the RHS (called unit productions)E.g. NP Nominal

Solution:Find all rules that have the form Nominal ...Nominal Noun PPNominal Det NounRe-write the above rule several times to eliminate the intermediate non-terminal:NP Noun PPNP Det Noun

Note that this makes our grammar flatter

Converting a CFG to CNFRules which have more than two items on the RHSE.g. NP Det Noun PP

Solution:Introduce new non-terminals to spread the sequence on the RHS over more than 1 rule.Nominal Noun PPNP Det NominalThe outcomeIf we parse a sentence with a CNF grammar, we know that:

Every phrase-level non-terminal (above the part of speech level) will have exactly 2 daughters.NP Det N

Every part-of-speech level non-terminal will have exactly 1 daughter, and that daughter is a terminal:N ladyPart 3Recognising strings with CKYRecognising strings with CKYExample input: The flight includes a meal.

The CKY algorithm proceeds by:Splitting the input into words and indexing each position.(0) the (1) flight (2) includes (3) a (4) meal (5)

Setting up a table. For a sentence of length n, we need (n+1) rows and (n+1) columns.

Traversing the input sentence left-to-right

Use the table to store constituents and their span.

The table123450DetS1234theflightincludesameal[0,1] for theRule: Det theThe table123450DetS1N234theflightincludesameal[0,1] for the[1,2] for flightRule1: Det theRule 2: N flightThe table123450DetNPS1N234theflightincludesameal[0,1] for the[0,2] for the flight[1,2] for flightRule1: Det theRule 2: N flightRule 3: NP Det NA CNF CFG for CYK (!!)S NP VP NP Det N VP V NP V includes Det the Det a N meal N flightCYK algorithm: two componentsLexical step:for j from 1 to length(string) do:let w be the word in position jfind all rules ending in w of the form X w put X in table[j-1,1]

Syntactic step:for i = j-2 to 0 do: for k = i+1 to j-1 do: for each rule of the form A B C do: if B is in table[i,k] & C is in table[k,j] then add A to table[i,j]CKY algorithm: two componentsWe actually interleave the lexical and syntactic steps:for j from 1 to length(string) do:let w be the word in position jfind all rules ending in w of the form X wput X in table[j-1,1]

for i = j-2 to 0 do:for k = i+1 to j-1 do:for each rule of the form A B C do:if B is in table[i,k] & C is in table[k,j] then add A to table[i,j]CKY: lexical step (j = 1)The flight includes a meal.Lexical lookup Matches Det the

123450Det12345CKY: lexical step (j = 2)The flight includes a meal.123450Det1N2345Lexical lookup Matches N flight

CKY: syntactic step (j = 2)The flight includes a meal.123450DetNP1N2345Syntactic lookup: look backwards and see if there is any rule that will cover what weve done so far.

CKY: lexical step (j = 3)The flight includes a meal.123450DetNP1N2V345Lexical lookup Matches V includes

CKY: lexical step (j = 3)The flight includes a meal.123450DetNP1N2V345Syntactic lookup There are no rules in our grammar that will cover Det, NP, V

CKY: lexical step (j = 4)The flight includes a meal.123450DetNP1N2V3Det45Lexical lookup Matches Det a

CKY: lexical step (j = 5)The flight includes a meal.123450DetNP1N2V3Det4NLexical lookup Matches N meal

CKY: syntactic step (j = 5)The flight includes a meal.123450DetNP1N2V3DetNP4NSyntactic lookup We find that we have NP Det N

CKY: syntactic step (j = 5)The flight includes a meal.123450DetNP1N2VVP3DetNP4NSyntactic lookup We find that we have VP V NP

CKY: syntactic step (j = 5)The flight includes a meal.123450DetNPS1N2VVP3DetNP4NSyntactic lookup We find that we have S NP VP

From recognition to parsingThe procedure so far will recognise a string as a legal sentence in English.

But wed like to get a parse tree back!

Solution:We can work our way back through the table and collect all the partial solutions into one parse tree.Cells will need to be augmented with backpointers, i.e. With a pointer to the cells that the current cell covers.From recognition to parsing123450DetNPS1N2VVP3DetNP4NFrom recognition to parsing123450DetNPS1N2VVP3DetNP4NNB: This algorithm always fills the top triangle of the table!What about ambiguity?The algorithm does not assume that there is only one parse tree for a sentence.(Our simple grammar did not admit of any ambiguity, but this isnt realistic of course).

There is nothing to stop it returning several parse trees.

If there are multiple local solutions, then more than one non-terminal will be stored in a cell of the table.Part 4Probabilistic Context Free GrammarsCFG definition (reminder)A CFG is a 4-tuple: (N,,P,S):

N = a set of non-terminal symbols (e.g. NP, VP)

= a set of terminals (e.g. words)N and are disjoint (no element of N is also an element of )

P = a set of productions of the form A where:A is a non-terminal (a member of N) is any string of terminals and non-terminals

S = a designated start symbol (usually, sentence)

CFG ExampleS NP VPS Aux NP VPNP Det NomNP Proper-NounDet that | the | aProbabilistic CFGsA CFG where each production has an associated probabilityPCFG is a 5-tuple: (N,,P,S, D):D is a function assigning each rule in P a probabilityusually, probabilities are obtained from a corpusmost widely used corpus is the Penn TreebankExample tree

Building a tree: rulesS NP VPNP NNP NNPNNP MrNNP VinkenSNPNNPNNPVinkenMrVPNPVBZPPNPNNischairmanINNNNNPofElsevierCharacteristics of PCFGsIn a PCFG, the probability P(A) expresses the likelihood that the non-terminal A will expand as .e.g. the likelihood that S NP VP(as opposed to SVP, or S NP VP PP, or )

can be interpreted as a conditional probability:probability of the expansion, given the LHS non-terminalP(A) = P(A|A)

Therefore, for any non-terminal A, probabilities of every rule of the form A must sum to 1in this case, we say the PCFG is consistent

Uses of probabilities in parsingDisambiguation: given n legal parses of a string, which is the most likely?e.g. PP-attachment ambiguity can be resolved this way

Speed: weve defined parsing as a search problemsearch through space of possible applicable derivationssearch space can be pruned by focusing on the most likely sub-parses of a parse

parser can be used as a model to determine the probability of a sentence, given a parsetypical use in speech recognition, where input utterance can be heard as several possible sentencesUsing PCFG probabilitiesPCFG assigns a probability to every parse-tree t of a string We.g. every possible parse (derivation) of a sentence recognised by the grammar

Notation:G = a PCFGs = a sentencet = a particular tree under our grammart consists of several nodes neach node is generated by applying some rule r

Probability of a tree vs. a sentenceWe work out the probability of a parse tree t by multiplying the probability of every rule (node) that gives rise to t (i.e. the derivation of t).

Note that:A tree can have multiple derivations (different sequences of rule applications could give rise to the same tree)But the probability of the tree remains the same (its the same probabilities being multiplied)We usually speak as if a tree has only one derivation, called the canonical derivation

Picking the best parse in a PCFGA sentence will usually have several parseswe usually want them ranked, or only want the n best parseswe need to focus on P(t|s,G)probability of a parse, given our sentence and our grammar

definition of the best parse for s:The tree for which P(t|s,G) is highest

Probability of a sentenceGiven a probabilistic context-free grammar G, we can the probability of a sentence (as opposed to a tree).

Observe that:As far as our grammar is concerned, a sentence is only a sentence if it can be recognised by the grammar (it is legal)There can be multiple parse trees for a sentence.Many trees whose yield is the sentenceThe probability of the sentence is the sum of all the probabilities of the various trees that yield the sentence.

Flaws I: Structural independence Probability of a rule r expanding node n depends only on n.Independent of other non-terminals

Example:P(NP Pro) is independent of where the NP is in the sentencebut we know that NPPro is much more likely in subject positionFrancis et al (1999) using the Switchboard corpus: 91% of subjects are pronouns; only 34% of objects are pronouns

Flaws II: lexical independencevanilla PCFGs ignore lexical material e.g. P(VP V NP PP) independent of the head of NP or PP or lexical head VExamples:prepositional phrase attachment preferences depend on lexical items; cf:dump [sacks into a bin] dump [sacks] [into a bin] (preferred parse)coordination ambiguity:[dogs in houses] and [cats] [dogs] [in houses and cats]Lexicalised PCFGsAttempt to weaken the lexical independence assumption.

Most common technique:mark each phrasal head (N,V, etc) with the lexical materialthis is based on the idea that the most crucial lexical dependencies are between head and dependentE.g.: Charniak 1997, Collins 1999

Lexicalised PCFGs: Matt walksMakes probabilities partly dependent on lexical content.

P(VPVBD|VP) becomes: P(VPVBD|VP,h(VP)=walks)

NB: normally, we cant assume that all heads of a phrase of category C are equally probable.

S(walks)NP(Matt)NNP(Matt)MattVP(walks)VBD(walks)walksPractical problems for lexicalised PCFGsdata sparseness: we dont necessarily see all heads of all phrasal categories often enough in the training data

flawed assumptions: lexical dependencies occur elsewhere, not just between head and complementI got the easier problem of the two to solveof the two and to solve are very likely because of the prehead modifier easier

Structural contextThe simple way: calculate p(t|s,G) based on rules in the canonical derivation d of tassumes that p(t) is independent of the derivationcould condition on more structural contextbut then, P(t) could really depend on the derivation!Part 5Parsing with a PCFGUsing CKY to parse with a PCFGThe basic CKY algorithm remains unchanged.

However, rather than only keeping partial solutions in our table cells (i.e. The rules that match some input), we also keep their probabilities.Probabilistic CKY: example PCFGS NP VP [.80]NP Det N [.30]VP V NP [.20]V includes [.05]Det the [.4]Det a [.4]N meal [.01]N flight [.02]Probabilistic CYK: initialisationThe flight includes a meal.12345012345S NP VP [.80]NP Det N [.30]VP V NP [.20]V includes [.05]Det the [.4]Det a [.4]N meal [.01]N flight [.02]Probabilistic CYK: lexical stepThe flight includes a meal.123450Det(.4)12345S NP VP [.80]NP Det N [.30]VP V NP [.20]V includes [.05]Det the [.4]Det a [.4]N meal [.01]N flight [.02]Probabilistic CYK: lexical stepThe flight includes a meal.123450Det(.4)1N.022345S NP VP [.80]NP Det N [.30]VP V NP [.20]V includes [.05]Det the [.4]Det a [.4]N meal [.01]N flight [.02]Probabilistic CYK: syntactic stepThe flight includes a meal.123450Det(.4)NP.00241N.022345S NP VP [.80]NP Det N [.30]VP V NP [.20]V includes [.05]Det the [.4]Det a [.4]N meal [.01]N flight [.02]Note: probability of NP in [0,2]P(Det the) * P(N meal) * P(NP Det N)Probabilistic CYK: lexical stepThe flight includes a meal.123450Det(.4)NP.00241N.022V.05345S NP VP [.80]NP Det N [.30]VP V NP [.20]V includes [.05]Det the [.4]Det a [.4]N meal [.01]N flight [.02]Probabilistic CYK: lexical stepThe flight includes a meal.123450Det(.4)NP.00241N.022V.053Det.445S NP VP [.80]NP Det N [.30]VP V NP [.20]V includes [.05]Det the [.4]Det a [.4]N meal [.01]N flight [.02]Probabilistic CYK: syntactic stepThe flight includes a meal.123450Det(.4)NP.00241N.022V.053Det.44N.01S NP VP [.80]NP Det N [.30]VP V NP [.20]V includes [.05]Det the [.4]Det a [.4]N meal [.01]N flight [.02]Probabilistic CYK: syntactic stepThe flight includes a meal.123450Det(.4)NP.00241N.022V.053Det.4NP.0014N.01S NP VP [.80]NP Det N [.30]VP V NP [.20]V includes [.05]Det the [.4]Det a [.4]N meal [.01]N flight [.02]Probabilistic CYK: syntactic stepThe flight includes a meal.123450Det(.4)NP.00241N.022V.05VP.000013Det.4NP.0014N.01S NP VP [.80]NP Det N [.30]VP V NP [.20]V includes [.05]Det the [.4]Det a [.4]N meal [.01]N flight [.02]Probabilistic CYK: syntactic stepThe flight includes a meal.123450Det(.4)NP.0024S.00000001921N.022V.05VP.000013Det.4NP.0014N.01S NP VP [.80]NP Det N [.30]VP V NP [.20]V includes [.05]Det the [.4]Det a [.4]N meal [.01]N flight [.02]Probabilistic CYK: summaryCells in chart hold probabilities

Bottom-up procedure computes probability of a parse incrementally.

To obtain parse trees, we traverse the table backwards as before.Cells need to be augmented with backpointers.