600.465 connecting the dots - i (nlp in practice)

35
600.465 Connecting the dots - I (NLP in Practice) Delip Rao [email protected]

Upload: caraf

Post on 26-Feb-2016

31 views

Category:

Documents


1 download

DESCRIPTION

600.465 Connecting the dots - I (NLP in Practice). Delip Rao [email protected]. What is “Text”?. What is “Text”?. What is “Text”?. “Real” World. Tons of data on the web A lot of it is text In many languages In many genres. Language by itself is complex. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 600.465 Connecting the  dots - I (NLP in Practice)

600.465 Connecting the dots - I(NLP in Practice)

Delip [email protected]

Page 3: 600.465 Connecting the  dots - I (NLP in Practice)

What is “Text”?

Page 4: 600.465 Connecting the  dots - I (NLP in Practice)

What is “Text”?

Page 5: 600.465 Connecting the  dots - I (NLP in Practice)

What is “Text”?

Page 6: 600.465 Connecting the  dots - I (NLP in Practice)

“Real” World

• Tons of data on the web• A lot of it is text• In many languages• In many genres

Language by itself is complex. The Web further complicates language.

Page 7: 600.465 Connecting the  dots - I (NLP in Practice)

But we have 600.465

Adapted from : Jason Eisner

We can study anything about language ...

1. Formalize some insights 2. Study the formalism mathematically 3. Develop & implement algorithms 4. Test on real data

feature functions!f(wi = off, wi+1 = the)

f(wi = obama, yi = NP)

Forward Backward, Gradient Descent, LBFGS,

Simulated Annealing, Contrastive Estimation, …

Page 8: 600.465 Connecting the  dots - I (NLP in Practice)

NLP for fun and profit

• Making NLP more accessible– Provide APIs for common NLP tasksvar text = document.get(…);var entities = agent.markNE(text);

• Big $$$$• Backend to intelligent processing of text

Page 9: 600.465 Connecting the  dots - I (NLP in Practice)

Desideratum: Multilinguality

• Except for feature extraction, systems should be language agnostic

Page 10: 600.465 Connecting the  dots - I (NLP in Practice)

In this lecture

• Understand how to solve and ace in NLP tasks• Learn general methodology or approaches• End-to-End development using an example

task• Overview of (un)common NLP tasks

Page 11: 600.465 Connecting the  dots - I (NLP in Practice)

Case study: Named Entity Recognition

Page 12: 600.465 Connecting the  dots - I (NLP in Practice)

Case study: Named Entity Recognition

• Demo: http://viewer.opencalais.com

• How do we build something like this?• How do we find out well we are doing?• How can we improve?

Page 13: 600.465 Connecting the  dots - I (NLP in Practice)

Case study: Named Entity Recognition

• Define the problem– Say, PERSON, LOCATION, ORGANIZATION

The UN secretary general met president Obama at Hague.

The UN secretary general met president Obama at Hague.

ORG PER LOC

Page 14: 600.465 Connecting the  dots - I (NLP in Practice)

Case study: Named Entity Recognition

• Collect data to learn from– Sentences with words marked as PER, ORG, LOC,

NONE• How do we get this data?

Page 15: 600.465 Connecting the  dots - I (NLP in Practice)

Pay the experts

Page 16: 600.465 Connecting the  dots - I (NLP in Practice)

Wisdom of the crowds

Page 17: 600.465 Connecting the  dots - I (NLP in Practice)

Getting the data: Annotation

• Time consuming• Costs $$$• Need for quality control– Inter-annotator aggreement– Kappa score (Kippendorf, 1980)

• Smarter ways to annotate– Get fewer annotations: Active Learning– Rationales (Zaidan, Eisner & Piatko, 2007)

Page 18: 600.465 Connecting the  dots - I (NLP in Practice)

Only France and Great Britain backed Fischler ‘s proposal .

Only O

France B-LOC

and O

Great B-LOC

Britain I-LOC

backed O

Fischler B-PER

‘s O

proposal O

. O

Only France and Great Britain backed Fischler ‘s proposal .

Input x

Labels y

y = g(x)

Page 19: 600.465 Connecting the  dots - I (NLP in Practice)

1. Formalize some insights 2. Study the formalism mathematically 3. Develop & implement algorithms 4. Test on real data

Our recipe …

Page 20: 600.465 Connecting the  dots - I (NLP in Practice)

NER: Designing features

• Not as trivial as you think• Original text itself might be in

an ugly HTML• Cleaneval!

• Need to segment sentences• Tokenize the sentences

OnlyFrance

and

Great

Britain

backed

Fischler

‘s

proposal

.

Preprocessing

Page 21: 600.465 Connecting the  dots - I (NLP in Practice)

NER: Designing featuresOnly IS_CAPITALIZEDFrance IS_CAPITALIZED

and

Great IS_CAPITALIZED

Britain IS_CAPITALIZED

backed

Fischler IS_CAPITALIZED

‘s

proposal

.

Page 22: 600.465 Connecting the  dots - I (NLP in Practice)

NER: Designing featuresOnly IS_CAPITALIZED IS_SENT_STARTFrance IS_CAPITALIZED

and

Great IS_CAPITALIZED

Britain IS_CAPITALIZED

backed

Fischler IS_CAPITALIZED

‘s

proposal

.

Page 23: 600.465 Connecting the  dots - I (NLP in Practice)

NER: Designing featuresOnly IS_CAPITALIZED IS_SENT_STARTFrance IS_CAPITALIZED

and

Great IS_CAPITALIZED

Britain IS_CAPITALIZED

backed

Fischler IS_CAPITALIZED

‘s

proposal

.

Page 24: 600.465 Connecting the  dots - I (NLP in Practice)

NER: Designing featuresOnly IS_CAPITALIZED IS_SENT_STARTFrance IS_CAPITALIZED IN_LEXICON_LOC

and

Great IS_CAPITALIZED

Britain IS_CAPITALIZED IN_LEXICON_LOC

backed

Fischler IS_CAPITALIZED

‘s

proposal

.

Page 25: 600.465 Connecting the  dots - I (NLP in Practice)

NER: Designing featuresOnly POS=RB IS_CAPITALIZED IS_SENT_STARTFrance POS=NNP IS_CAPITALIZED IN_LEXICON_LOC

and POS=CC

Great POS=NNP IS_CAPITALIZED

Britain POS=NNP IS_CAPITALIZED IN_LEXICON_LOC

backed POS=VBD

Fischler POS=NNP IS_CAPITALIZED

‘s POS=XX

proposal POS=NN

. POS=.

These are extracted during

preprocessing!

Page 26: 600.465 Connecting the  dots - I (NLP in Practice)

NER: Designing featuresOnly POS=RB IS_CAPITALIZED … PREV_WORD=_NONE_

France POS=NNP IS_CAPITALIZED … PREV_WORD=only

and POS=CC … PREV_WORD=france

Great POS=NNP IS_CAPITALIZED … PREV_WORD=and

Britain POS=NNP IS_CAPITALIZED … PREV_WORD=great

backed POS=VBD … PREV_WORD=britain

Fischler POS=NNP IS_CAPITALIZED … PREV_WORD=backed

‘s POS=XX … PREV_WORD=fischler

proposal POS=NN … PREV_WORD=‘s

. POS=. … PREV_WORD=proposal

Page 27: 600.465 Connecting the  dots - I (NLP in Practice)

NER: Designing featuresOnly POS=RB IS_CAPITALIZED … PREV_WORD=_NONE_ …France POS=NNP IS_CAPITALIZED … PREV_WORD=only …and POS=CC … PREV_WORD=france …Great POS=NNP IS_CAPITALIZED … PREV_WORD=and …Britain POS=NNP IS_CAPITALIZED … PREV_WORD=great …backed

POS=VBD … PREV_WORD=britain …Fischler

POS=NNP IS_CAPITALIZED … PREV_WORD=backed …‘s POS=XX … PREV_WORD=fischler …proposal

POS=NN … PREV_WORD=‘s …. POS=. … PREV_WORD=proposal …

Page 28: 600.465 Connecting the  dots - I (NLP in Practice)

NER: Designing features

• Can you think of other features?HAS_DIGITSIS_HYPHENATEDIS_ALLCAPSFREQ_WORDRARE_WORDUSEFUL_UNIGRAM_PERUSEFUL_BIGRAM_PERUSEFUL_UNIGRAM_LOCUSEFUL_BIGRAM_LOCUSEFUL_UNIGRAM_ORGUSEFUL_BIGRAM_ORGUSEFUL_SUFFIX_PERUSEFUL_SUFFIX_LOCUSEFUL_SUFFIX_ORG

WORDPREV_WORDNEXT_WORDPREV_BIGRAMNEXT_BIGRAMPOSPREV_POSNEXT_POSPREV_POS_BIGRAMNEXT_POS_BIGRAMIN_LEXICON_PERIN_LEXICON_LOCIN_LEXICON_ORGIS_CAPITALIZED

Page 29: 600.465 Connecting the  dots - I (NLP in Practice)

Case: Named Entity Recognition

• Evaluation Metrics– Token accuracy: What percent of the tokens got

labeled correctly– Problem with accuracy– Precision-Recall-F

Model F-ScoreHMM 74.6

president OBarack B-PERObama O

Page 30: 600.465 Connecting the  dots - I (NLP in Practice)

NER: How can we improve?

• Engineer better features• Design better models• Conditional Random Fields

Model F-ScoreHMM 74.6

TBL 81.2

Maxent 85.6

x1

Y1

x2

Y2

x3

Y3

x4

Y4

Model F-ScoreHMM 74.6

TBL 81.2

Maxent 85.6

CRF 91.7

… …

Page 31: 600.465 Connecting the  dots - I (NLP in Practice)

NER: How else can we improve?

• Unlabeled data!

example from Jerry Zhu

Page 32: 600.465 Connecting the  dots - I (NLP in Practice)

NER : Challenges

• Domain transfer WSJ NYT WSJ Blogs ?? WSJ Twitter ??!?• Tough nut: Organizations• Non textual data?

Entity Extraction is a Boring Solved Problem – or is it?(Vilain, Su and Lubar, 2007)

Page 33: 600.465 Connecting the  dots - I (NLP in Practice)

NER: Related application

• Extracting real estate information from Criagslist Ads

Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-in kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door.

Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-in kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door.

Page 34: 600.465 Connecting the  dots - I (NLP in Practice)

NER: Related Application

• BioNLP: Annotation of chemical entities

Corbet, Batchelor & Teufel, 2007

Page 35: 600.465 Connecting the  dots - I (NLP in Practice)

Shared Tasks: NLP in practice

• Shared Task– Everybody works on a (mostly) common dataset– Evaluation measures are defined– Participants get ranked on the evaluation measures– Advance the state of the art– Set benchmarks

• Tasks involve common hard problems or new interesting problems