1 sims 290-2: applied natural language processing marti hearst sept 22, 2004

SIMS 290-2: Applied Natural Language Processing

Marti HearstSept 22, 2004

Cascaded ChunkingExample of Using Chunking: Word AssociationsEvaluating ChunkingGoing to the next level: Parsing

Cascaded Chunking

Goal: create chunks that include other chunksExamples:

PP consists of preposition + NPVP consists of verb followed by PPs or NPs

How to make it work in NLTKThe tutorial is a bit confusing, I attempt to clarify

Creating Cascaded Chunkers

Start with a sentence tokenA list of words with parts of speech assignedCreate a fresh one or use one from a corpus

Creating Cascaded Chunkers

Create a set of chunk parsersOne for each chunk typeEach one takes as input some kind of list of tokens, and produced as output a NEW list of tokens

– You can decide what this new list is called Examples: NP-CHUNK, PP-CHUNK, VP-CHUNK

– You can also decide what to name each occurrence of the chunk type, as it is assigned to a subset of tokens

Examples: NP, VP, PP

How to match higher-level tags?It just seems to match their string descriptionSo best be certain that their name does not overlap with POS tags too

Let’s do some text analysis

Let’s try this on more complex sentencesFirst, read in part of a corpusThen, count how often each word occurs with each POSDetermine some common verbs, choose oneMake a list of sentences containing that verbTest out the chunker on them; examine further

Why didn’t this parse work?

idn’t

Why didn’t this parse work?

Corpus Analysis for Discovery ofWord Associations

Classic paper by Church & Hanks showed how to use a corpus and a shallow parser to find interesting dependencies between words

– Word Association Norms, Mutual Information, and Lexicography, Computational Linguistics, 16(1), 1991

– http://www.research.att.com/~kwc/publications.html

Some cognitive evidence:Word association norms: which word to people say most often after hearing another word

– Given doctor: nurse, sick, health, medicine, hospital…

People respond more quickly to a word if they’ve seen an associated word

– E.g., if you show “bread” they’re faster at recognizing “butter” than “nurse” (vs a nonsense string)

Idea: use a corpus to estimate word associationsAssociation ratio: log ( P(x,y) / P(x)P(y) )

The probability of seeing x followed by y vs. the probably of seeing x anywhere times the probability of seeing y anywhereP(x) is how often x appears in the corpusP(x,y) is how often y follows x within w wordsInteresting associations with “doctor”:

– X: honorary Y: doctor– X: doctors Y: dentists– X: doctors Y: nurses– X: doctors Y: treating– X: examined Y:doctor– X: doctors Y: treat

Now let’s make use of syntactic information.Look at which words and syntactic forms follow a given verb, to see what kinds of arguments it takesCompute triples of subject-verb-object

Example: nouns that appear as the object of the verb usage of “drink”:

– martinis, cup_water, champagne, beverage, cup_coffee, cognac, beer, cup, coffee, toast, alcohol…

– What can we note about many of these words?

Example: verbs that have “telephone” in their object: – sit_by, disconnect, answer, hang_up, tap, pick_up, return,

be_by, spot, repeat, place, receive, install, be_on

The approach has become standardEntire collections available

Dekang Lin’s Dependency Database– Given a word, retrieve words that had dependency

relationship with the input word

Dependency-based Word Similarity– Given a word, retrieve the words that are most similar

to it, based on dependencies

http://www.cs.ualberta.ca/~lindek/demos.htm

Example Dependency Database: “sell”

Example Dependency-based Similarity: “sell”

Homework Assignment

Choose a verb of interestAnalyze the context in which the verb appears

Can use any corpus you like– Can train a tagger and run it on some fresh text

Example: What kinds of arguments does it take?Improve on my chunking rules to get better characterizations

Evaluating the Chunker

Why not just use accuracy?Accuracy = #correct/total number

DefinitionsTotal: number of chunks in gold standardGuessed: set of chunks that were labeledCorrect: of the guessed, which were correctMissed: how many correct chunks not guessed?Precision: #correct / #guessedRecall: #correct / #totalF-measure: 2 * (Prec*Recall) / (Prec + Recall)

Example

Assume the following numbersTotal: 100Guessed: 120Correct: 80Missed: 20Precision: 80 / 120 = 0.67Recall: 80 / 100 = 0.80F-measure: 2 * (.67*.80) / (.67 + .80) = 0.69

Evaluating in NLTKWe have some already chunked text from the Treebank

The code below uses the existing parse to compare against, and to generate Tokens of type word/tag to parse with our own chunker.

Have to add location information so the evaluation code can compare which words have been assigned which labels

How to get better accuracy?

Use a full syntactic parserThese days the probabilistic ones work surprisingly well

They are getting faster too.Prof. Dan Klein’s is very good and easy to run

– http://nlp.stanford.edu/downloads/lex-parser.shtml

Next Week

Shallow Parsing AssignmentDue on Wed Sept 29

Next week:Read paper on end-of-sentence disambiguationPresley and Barbara lecturing on categorizationWe will read the categorization tutorial the following week

1 sims 290-2: applied natural language processing marti hearst sept 22, 2004

corpus slide

parsing slide

doctors y

nonsense string slide

doctor x

associated word

honorary y

y pxpy

Documents

sims 213: user interface design & development marti hearst...

sims 213: user interface design & development marti hearst...

marti hearst sims 247 sims 247 lecture 20 visualizing text &...

sims 213: user interface design & development marti hearst...

discount usability engineering marti hearst (ucb sims) sims...

sims 213: user interface design & development marti hearst...

relevance feedback prof. marti hearst sims 202, lecture 24

sims 213: user interface design & development marti hearst...

sims 247 information visualization and presentation marti...

marti hearst sims 247 sims 247 lecture 11 evaluating...

sims 213: user interface design & development marti hearst...

sims 213: user interface design & development marti hearst...

marti hearst sims 247 sims 247 lecture 5 brushing and...

sims 296a-3: ui background marti hearst fall ‘98

searching in hypertext prof. marti hearst sims 202, lecture...

marti hearst sims 247 sims 247 lecture 12 visual properties...

using metadata in search prof. marti hearst sims 202,...

rapid prototyping marti hearst (ucb sims) sims 213, ui...

sims 213: user interface design & development marti hearst...

marti hearst sims 247 sims 247 lecture 3 graphing basics,...