a whirlwind tour of natural language processing mark sammons cognitive computation group, uiuc

A Whirlwind Tour of Natural Language Processing

Mark Sammons Cognitive Computation Group, UIUC

Who Cares about NLP?

…Eddie Izzard,

that’s who…

(Those of a sensitive disposition toward explicit language should probably cover their ears…)

Remember Star Trek? HAL in 2001? The Heart of Gold in Hitch-hiker’s Guide…?

Grand Vision of Artificial Intelligence: computers that actively communicate.

A substantial effort devoted to achieving AI. But how do we decide whether a machine is smart? IBM’s Deep Blue plays a mean game of chess…

…but is it intelligent? Early idea of evaluation: Turing Test

If a human can’t tell that it’s a machine… AI philosophy: is *appearance* of intelligent behavior

the same as intelligence? General assumption: NLP is AI-complete (play on

concept of NP-completeness) – i.e. need Intelligence to properly solve NLP

More Realistically… where does NLP help?

Already here: Context-sensitive spelling, grammar checkers in

text editors Machine Translation, e.g. in web browsers Automated phone trees (by some definition of

“help”) Web search

Under development: Better Machine Translation Better search Voice command in e.g. cars

Outline

Why NLP is hard

NLP domains: Speech vs. Text

Attacking NLP problems

Linguistics: building explanatory models

Statistics: data-driven approaches

Machine Learning & NLP

NLP Problems and Solutions

Why is NLP so hard?

Meaning

Language

Ambiguity

Variability

Variability

Example: Relation Extraction: “Works for”

Jim Carpenter works for the U.S. Government.

The American government employed Jim Carpenter.

Jim Carpenter was fired by the US Government.

Jim Carpenter worked in a number of important positions. …. As a press liaison for the IRS, he made contacts in the white house.

Top Russian interior minister Yevgeny Topolov met yesterday with his US counterpart, Jim Carpenter.

Former US Secretary of Defense Jim Carpenter spoke today…

Context Sensitive Paraphrasing [3]

He used a Phillips head to tighten the screw.

The bank owner tightened security after a spate of local crimes.

The Federal Reserve will aggressively tighten monetary policy.

……….

LoosenStrengthenStep upToughenImproveFastenImposeIntensifyEaseBeef upSimplifyCurbReduce

LoosenStrengthenStep upToughenImproveFastenImposeIntensifyEaseBeef upSimplifyCurbReduce

Ambiguity

Domain Size

Ideal goal: must handle all well-formed strings of text Problem: infinite domain

Sequential modifiers:

I saw Martin Sheen in a movie I saw Martin Sheen in a movie in ParisI saw Martin Sheen in a movie in Paris in the SpringI saw Martin Sheen in a movie in Paris in the Spring with my friend…

Unbounded relative clauses:

I saw Martin Sheen, who was with a friend I knew from high school, which was well known for its long, storied history of ………….., in a movie…..

Outline

Why NLP is hard







Speech Recognition

NOT “voice recognition”

How hard can it be?

First image: “Fix the wing” Second image: same utterance in

noisy airport maintenanceenvironment

Speech Recognition – yup, it’s hard…

“Yuhgudda unnuhstahn sheeguhnuhbeeyah, yunoewaah, dissappointed.”

“You’ve got to understand she’s going to be, ah, you know, ah, disappointed.”

Difficult to recognize words, word boundaries Even given word boundaries, utterances are ill-

formed (compared to text) (multiple variations for single word)

Hesitations, repetitions, fragmentary sentences, self-interruptions, poor word choice, sound quality…

LBJ/Mansfield audio sample

Development and Evaluation for Speech Recognition

Switchboard (and other) corpora Large set of phone conversations Audio signals aligned with transcriptions of utterances

(phone sequences) Dictionaries aligning words with phone sequence

equivalents

Typically, machine learning approaches applied Signal processing techniques extract features from signals Statistical methods relate these features to particular

phones – create a model

Analyze new signals, use model to identify plausible phone sequences

Choose most likely sequence given another statistical model

Speech Recognition System (Courtesy of ComputerWorld…)

The State of the Art in Speech-to-Text Translation

Current performance on known tasks: 98% word accuracy for dictation Very controlled circumstances

State of the art for spontaneous speech: News broadcast: ~90% Switchboard (phone conversations): ~80%

A lot of work even to get to a clean text representation of signal

Notice that I haven’t even begun to address tasks like search using this input

(Note also that there are many other research directions in speech processing – e.g. speaker identification)

What about Text?

A lot of overlap

If you can solve NLP in text, and can accurately parse

speech into text, the two problems are the same

Text domain has some nice characteristics

Paragraph, Sentence, Word segmentation already

present

Well-formed utterances (in many/most sub-domains)

Little regional variation

Most information is already in the form of text

Outline

Why NLP is hard







Linguistics

Linguists: meaning through structure + lexical knowledge

“Colorless green ideas sleep furiously”

Discover the rules of language (a grammar) Prescriptive grammar: rules describe what you shouldn’t

do. Generative grammar: a finite set of rules that can

generate all possible strings in a language, and only those strings that are valid in that language [3] “Generate” here means “assign a structural description

to” Attempts to move beyond simplistic linear models, where

words are dependent only on previous words

Divide and Conquer: Morphology

Consider the sub-problem of recognizing well-formed variations of words

Popular method: Finite State Automata/Transducers

Automaton: recognizes patterns Transducer: maps from an input pattern to an output

pattern – e.g. indicate whether a noun is plural

Morphology Example: plurals [5]

q0

q1

q2

Regular noun

Irregular singular noun

Irregular plural noun

-s: N

: + PL

: N+ PL

: N

Basic Generative Grammar: Context-Free Grammar

Accomplishes the goal of a finite description of infinite domain, at least for syntactic structure

Generate parse trees, decompose into constituents, infer generative rules:

S => NP VPVP => V VPVP => VP PPVP => V ADJPNP => PROPRO => HeV => wantsPP => to…..

[4]

Context-Free Grammar

Drawbacks to CFGs: Real natural language may not be context-free Hard to model some phenomena, e.g. limits on nesting:

The cat ran away.

The cat the dog bit ran away.The cat the dog the horse kicked bit ran

away.

Phenomena like agreement, morphology, long distance dependencies, require very complex set of rules

What about unseen words/phrases/sentences? Given a sentence, there may be multiple ways

to explain it.

I pointed to the man with the crutch.

That doesn’t deter Real Linguists…

A range of formalisms have been developed Different ways of tackling composition of words,

phrases, clauses Trade-off between importance of sentence structure

and individual words Strong emphasis on generality, particularly across

languages

Typically much more involved than the simplistic CFG in the previous example

There is ongoing work to encode a hand-written grammar of English – English Resource Grammar Uses Head-driven Phrase Structure Grammar Explains syntax via a Typed Feature Structure model

HPSG sample Feature Structure (for one word)

General Points

Much work on analyzing languages for structure

Wide range of theories; all have some descriptive power

All assume close relation between structure and meaning

We will see CFGs again later…

Outline

Why NLP is hard


Attacking NLP problems: 4 research strands





Data-Driven Approaches

Consider a partially completed sentence…

We can capture some measure of this intuitive restriction on word choice using probabilities Bigrams, trigrams, n-grams Effect of adding complexity in terms of storage

requirements?50,0002 = 2.5 Billion

We can estimate these probabilities directly from a corpus (body of text): p(wn|wn-1) = C(wn-1 wn)/C(wn-1)

Applications: spelling checker, augmentative communication systems, speech processing…

N-gram model samples

The following sentences were generated using n-gram models trained on Shakespeare’s works (~885,000 words, ~29,000 types) [5]:

1-gram: Every enter now severally so, let2-gram: What means sir. I confess she? Then all

sorts, he is trim, captain.3-gram: This shall forbid it should be branded, if

renown made it empty.4-gram: Enter Leonato’s brother Antonio, and the

rest, but seek the weary beds of people sick.

N-Gram Modeling

What’s it good for? Determine plausibility of new sentence:

The man spoke briefly…

The dog spoke briefly…

The spoke briefly man…

The wheel spoke briefly…

Given N-gram models of two domains, identify most likely source:

ACENOR stocks caught fire today on word of a take-over….

Teen pop sensation Tilde Greengrass roared into Austin today…

Teen Angst Poetry and Band Names… Drawbacks: how to handle unseen sequences?

Computational Linguistics

We just used very elementary statistics to make some potentially interesting discoveries about language

In fact, given the right resources, we can use statistics to build automated resources for linguistic analysis… Part of speech tagging:

(DT the) (NN man) (VBD climbed) (IN up) (DT the ) (NN tree)

Phrase boundary detection & phrase labeling

(NP the man) (VP climbed) (PP up the tree)

Parsing….

Parsing Revisited

We saw earlier an outline of a Context-Free Grammar model of language S => NP VP

VP => NP PPNP => NP PPNP => DT NN

(NP I) (VP saw) (NP the man) (PP with the telescope)(NP I) (VP saw) (NP the man) (PP with the book)

Two valid parses for each… are they equally valid?

Probabilistic CFGs

In the n-gram modeling example, we derived probabilities based on a corpus. Can we do the same for CFG rules? Not the same problem: for n-gram modeling, the

words alone were sufficient Need a corpus with additional information – the

parse trees Given such a corpus, can use statistical analysis

to derive the rules themselves, and the relative probabilities of rules.

This pattern – applying statistical methods to a labeled data set to extract a predictive model – is common in Machine Learning.

Outline

Why NLP is hard







Machine Learning: Classification

h: X -> Y(classifier)

yOutput:

xInput:

(x,y)(x,y)(x,y)(x,y)yx

Learningalgorithm

D: Training examples

Machine Learning (supervised)

Given some labeled data, and assuming some set of models, find the model that best maps each example to its label.

Statistically: represent examples using some abstraction (set of features), compute the relation between features and labels. Choice of model affects best possible performance. Complex model: may get better results (more

expressive), but requires much more data to train (and labeled data is expensive)

Simple model: fewer parameters, so less expressive, but easier to learn

Some examples…

Outline

Why NLP is hard




Logic: defining meaning and reasoning




NLP Problems and Solutions (focused)

Part-of-Speech tagging Context Sensitive Spelling Correction Named Entity Recognition Relation detection Comma Resolution Verb and Noun Phrase Chunking Prepositional Phrase Attachment Coreference Resolution Statistical Parsing Semantic Role Labeling Emotion and Subjectivity detection

Example: Named Entity Recognition

Entities are inherently ambiguous (e.g. JFK can be both location and a person depending on the context) Can appear in various forms ; Can be nested. Using lists is not sufficient New entities are always being introduced

A lot of Machine Learning work – significant over fitting

Key difficulties – Adaptation to: New domains/corpora Slightly new definition of an

entity New languages New types of entities

How to reduce the requirements on the resources needed to produce a semantic categorization for a new domain/new language/new type of entities

New

NE s

een

NE seen

Grand Challenges

Machine Translation

Message Understanding (Information Extraction)

Question Answering

Information Retrieval & Data Mining

Textual Entailment

Textual Entailment

Work at the level of meaning Frame the task of understanding text as recognizing

when two text fragments mean the same thing (one meaning ‘contains’ the other)

Dagan and Glickman, 2004 pose this problem as Recognizing Textual Entailment.

Now we can recast many problems in terms of TE:

The American government employed Jim Carpenter. Top Russian interior minister Yevgeny Topolov met yesterday with his US counterpart, Jim Carpenter.

Former US Secretary of Defence Jim Carpenter spoke today…

Jim Carpenter works for the U.S. Government.

?

PASCAL RTE Challenges (2004-present)

Move away from strict definition (Chierchia & McConnell-Ginet, 2001 [6]):

A text T entails a hypothesis H if H is true in every circumstance (possible world) in which T is true

‘Applied’ Definition (Dagan & Glickman, 2004 [7])

T entails H (TH) if humans reading T will infer that H is most likely true

800 development, 800 test pairs for each challenge

Some Examples (2nd RTE Challenge)

TEXT HYPOTHESIS TASK ENTAILMENT

1

Reagan attended a ceremony in Washington to commemorate the landings in Normandy.

Washington is located inNormandy.

IE False

2Google files for its long awaited IPO.

Google goes public. IR True

3

… a shootout at the Guadalajara airport in May, 1993, that killed Cardinal Juan Jesus Posadas Ocampo and six others.

Cardinal Juan Jesus Posadas Ocampo died in 1993.

QA True

4

The SPD got just 21.5% of the votein the European Parliament elections,while the conservative opposition parties polled 44.5%.

The SPD is defeated bythe opposition parties.

IE True

Incomplete List of Citations1. Peter Bell and Simon King. Sparse gaussian graphical models for

speech recognition. In Proc. Interspeech 2007, Antwerp, Belgium, August 2007

2. Connor & Roth ECML 073. Chomsky, Noam (1957,2002). Syntactic Structures. Mouton de Gruyter,

13.4. Image courtesy of Bill Wilson, Univ. New South Wales, Australia

http://www.cse.unsw.edu.au/~billw/5. Jurafsky and Martin. Speech and Language Processing, Prentice-Hall,

20006. Chierchia & McConnell-Ginet. Meaning and Grammar: An Introduction

to Semantics (rev. 2nd ed.), 20007. Dagan & Glickman, 2004. Probabilistic textual entailment: Generic

applied modeling of language variability. PASCAL workshop on Text Understanding and mining. 2004.

Some slides came from Prof. Dan Roth, University of Illinois.

a whirlwind tour of natural language processing mark sammons cognitive computation group, uiuc

Documents

nlp page

nlp help

hard nlp domains

ambiguity slide

uiuc slide

martin sheen

solutions page

cars page