auckland 2012kilgarriff: nlp and corpus processing1 the contribution of nlp: corpus processing
TRANSCRIPT
Auckland 2012 Kilgarriff: NLP and Corpus Processing 1
The contribution of NLP: corpus processing
Auckland 2012 Kilgarriff: NLP and Corpus Processing 2
What is NLP?
• Natural Language Processing– natural language vs. computer languages
• Other names – Computational Linguistics
• emphasizes scientific not technological
– Language Engineering • official European Union term, ca 1996-99
– Language Technology
Auckland 2012 Kilgarriff: NLP and Corpus Processing 3
NLP and linguistics
LING
NLP
supply ideasinterpret results
test theoriesexpose gaps
plus turn into technology
Auckland 2012 Kilgarriff: NLP and Corpus Processing 4
Example: regular morphology
LINGUISTICS: – Rules: stems -> inflected forms
NLP: – program the rules
– apply rules to a lexicon of stems
– Is the output correct? Errors?
LINGUISTICS:– refine the theory
Needed for: web search, spell-checkers, machine translation, speech recognition systems etc.
Auckland 2012 Kilgarriff: NLP and Corpus Processing 5
Applications
• web search– Basic search– Filtering results
• spelling and grammar checking • machine translation (MT) • talking to computers
– speech processing as well
• information extraction (IE)– finding facts in a database of documents; populating a
database, answering questions
Auckland 2012 Kilgarriff: NLP and Corpus Processing 6
How can NLP make better dictionaries?
By pre-processing a corpus:
• tokenization
• sentence splitting
• lemmatization
• POS-tagging
• parsing
Each step builds on predecessors
Auckland 2012 Kilgarriff: NLP and Corpus Processing 7
Tokenization
“identifying the words”
from:he didn't arrive.
to: Hedidn’t arrive.
Auckland 2012 Kilgarriff: NLP and Corpus Processing 8
Automatic tokenization
• Western writing systems – easy! space is separator
• Chinese, Japanese– do not use word-separator
– hard • like POS-tagging (below)
Auckland 2012 Kilgarriff: NLP and Corpus Processing 9
Why isn't space=separator enough (even for English)?
– what is a space• Line breaks, paragraph breaks, tabs
– Punctuation• No space between it and word
– brackets, quotation marks – Hyphenation
• co-op?• well-managed?
Auckland 2012 Kilgarriff: NLP and Corpus Processing 10
Sentence splitting
“identifying the sentences”
from:he didn't arrive. to: Hedidn’t arrive.
to:<s> Hedidn’t arrive.</s>
Auckland 2012 Kilgarriff: NLP and Corpus Processing 11
Lemmatization
Mapping from text-word to lemma help (verb)
text-word to lemmahelp help (v)helps help (v)helping help (v)helped help (v)
.
Auckland 2012 Kilgarriff: NLP and Corpus Processing 12
Lemmatization
Mapping from text-word to lemma help (verb) help (noun), helping (noun)
text-word to lemmahelp help (v), help (n)helps help (v), helps (n)**helping help (v), helping (n)helped help (v) helpings helping (n)
**help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending.
.
Auckland 2012 Kilgarriff: NLP and Corpus Processing 13
Lemmatization
Dictionary entries are for lemmas
Match between text-word and dictionary-word
lemmatization
Auckland 2012 Kilgarriff: NLP and Corpus Processing 14
Lemmatization
• Searching by lemma – English: little inflection
– French: 36 forms per verb
– Finno-Ugric: 2000.
• Not always wanted:– English royalty
• singular: kings and queens
• plural royalties: payments to authors
Auckland 2012 Kilgarriff: NLP and Corpus Processing 15
Automatic lemmatization• Write rules:
– if word ends in "ing", delete "ing"; – if the remainder is verb lemma, add to list of possible lemmas
• If detailed grammar available, use it• full lemma list is also required
– Often available from dictionary companies
Auckland 2012 Kilgarriff: NLP and Corpus Processing 16
Part-of-speech (POS) tagging
“identifying parts of speech”
from:he didn't arrive.
.
to:<s> He PNP pers pronoun
did VVD past tense verb
n’t XNOT not
arrive VV base form of verb
. C punctuation
</s>
Auckland 2012 Kilgarriff: NLP and Corpus Processing 17
Tagsets
• The set of part-of-speech tags to choose between– Basic: noun, verb, pronoun …– Advanced: examples - CLAWS English
tagset• NN2 plural noun• VVG -ing form of lexical verb
• Based on linguistics of the language.
Auckland 2012 Kilgarriff: NLP and Corpus Processing 18
POS-tagging: why?
• Use grammar when searching– Nouns modified by buckle– Verbs that buckle is object of
Auckland 2012 Kilgarriff: NLP and Corpus Processing 19
POS-tagging: how?
• Big topic for computational linguistics – well understood – taggers available for major languages
• Some taggers use lemmatized input, others do not • Methods
– constraint-based: set of rules of the form if previous word is "the" and VERB is one of the
possibilities, delete VERB – Statistical:
• Machine learning from tagged corpus• Various methods
• Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999.
Auckland 2012 Kilgarriff: NLP and Corpus Processing 20
Parsing
• Find the structure:– Phrase structure (trees)
The cat sat on the mat– Dependency structure (links)
– The cat sat on the mat
Auckland 2012 Kilgarriff: NLP and Corpus Processing 21
Automatic parsing• Big topic
– see Jurafsky and Martin or other NLP textbook
• Many methods too slow for large corpora
• Sketch Engine usually uses “shallow parsing”– Patterns of POS-tags– Regular expressions
Auckland 2012 Kilgarriff: NLP and Corpus Processing 22
Summary
• What is NLP?
• How can it help?– Tokenizing– Sentence splitting– Lemmatizing– POS-tagging– Parsing