auckland 2012kilgarriff: nlp and corpus processing1 the contribution of nlp: corpus processing

22
Auckland 2012 Kilgarriff: NLP and Corpus Processing 1 The contribution of NLP: corpus processing

Upload: conrad-goodman

Post on 17-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 1

The contribution of NLP: corpus processing

Page 2: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 2

What is NLP?

• Natural Language Processing– natural language vs. computer languages

• Other names – Computational Linguistics

• emphasizes scientific not technological

– Language Engineering • official European Union term, ca 1996-99

– Language Technology

Page 3: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 3

NLP and linguistics

LING

NLP

supply ideasinterpret results

test theoriesexpose gaps

plus turn into technology

Page 4: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 4

Example: regular morphology

LINGUISTICS: – Rules: stems -> inflected forms

NLP: – program the rules

– apply rules to a lexicon of stems

– Is the output correct? Errors?

LINGUISTICS:– refine the theory

Needed for: web search, spell-checkers, machine translation, speech recognition systems etc.

Page 5: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 5

Applications

• web search– Basic search– Filtering results

• spelling and grammar checking • machine translation (MT) • talking to computers

– speech processing as well

• information extraction (IE)– finding facts in a database of documents; populating a

database, answering questions

Page 6: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 6

How can NLP make better dictionaries?

By pre-processing a corpus:

• tokenization

• sentence splitting

• lemmatization

• POS-tagging

• parsing

Each step builds on predecessors

Page 7: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 7

Tokenization

“identifying the words”

from:he didn't arrive.

to: Hedidn’t arrive.

Page 8: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 8

Automatic tokenization

• Western writing systems – easy! space is separator

• Chinese, Japanese– do not use word-separator

– hard • like POS-tagging (below)

Page 9: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 9

Why isn't space=separator enough (even for English)?

– what is a space• Line breaks, paragraph breaks, tabs

– Punctuation• No space between it and word

– brackets, quotation marks – Hyphenation

• co-op?• well-managed?

Page 10: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 10

Sentence splitting

“identifying the sentences”

from:he didn't arrive. to: Hedidn’t arrive.

to:<s> Hedidn’t arrive.</s>

Page 11: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 11

Lemmatization

Mapping from text-word to lemma help (verb)

text-word to lemmahelp help (v)helps help (v)helping help (v)helped help (v)

.

Page 12: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 12

Lemmatization

Mapping from text-word to lemma help (verb) help (noun), helping (noun)

text-word to lemmahelp help (v), help (n)helps help (v), helps (n)**helping help (v), helping (n)helped help (v) helpings helping (n)

**help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending.

.

Page 13: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 13

Lemmatization

Dictionary entries are for lemmas

Match between text-word and dictionary-word

lemmatization

Page 14: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 14

Lemmatization

• Searching by lemma – English: little inflection

– French: 36 forms per verb

– Finno-Ugric: 2000.

• Not always wanted:– English royalty

• singular: kings and queens

• plural royalties: payments to authors

Page 15: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 15

Automatic lemmatization• Write rules:

– if word ends in "ing", delete "ing"; – if the remainder is verb lemma, add to list of possible lemmas

• If detailed grammar available, use it• full lemma list is also required

– Often available from dictionary companies

Page 16: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 16

Part-of-speech (POS) tagging

“identifying parts of speech”

from:he didn't arrive.

.

to:<s> He PNP pers pronoun

did VVD past tense verb

n’t XNOT not

arrive VV base form of verb

. C punctuation

</s>

Page 17: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 17

Tagsets

• The set of part-of-speech tags to choose between– Basic: noun, verb, pronoun …– Advanced: examples - CLAWS English

tagset• NN2 plural noun• VVG -ing form of lexical verb

• Based on linguistics of the language.

Page 18: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 18

POS-tagging: why?

• Use grammar when searching– Nouns modified by buckle– Verbs that buckle is object of

Page 19: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 19

POS-tagging: how?

• Big topic for computational linguistics – well understood – taggers available for major languages

• Some taggers use lemmatized input, others do not • Methods

– constraint-based: set of rules of the form if previous word is "the" and VERB is one of the

possibilities, delete VERB – Statistical:

• Machine learning from tagged corpus• Various methods

• Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999.

Page 20: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 20

Parsing

• Find the structure:– Phrase structure (trees)

The cat sat on the mat– Dependency structure (links)

– The cat sat on the mat

Page 21: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 21

Automatic parsing• Big topic

– see Jurafsky and Martin or other NLP textbook

• Many methods too slow for large corpora

• Sketch Engine usually uses “shallow parsing”– Patterns of POS-tags– Regular expressions

Page 22: Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing

Auckland 2012 Kilgarriff: NLP and Corpus Processing 22

Summary

• What is NLP?

• How can it help?– Tokenizing– Sentence splitting– Lemmatizing– POS-tagging– Parsing