learning bit by bit class 3 – stemming and tokenization

26
Learning Bit by Bit Class 3 – Stemming and Tokenization

Post on 20-Dec-2015

225 views

Category:

Documents


2 download

TRANSCRIPT

Learning Bit by Bit

Class 3 – Stemming and Tokenization

Morphology

• The study of the way words are constructed from smaller components

Morphology

• The study of the way words are constructed from smaller components

• Stems – “talk”• Affixes – “ing”

Morphology

• Orthographic Rules – General• Morphological Rules - Specific

Parsing

• Analyzing a text in pieces

Parsing

• Morphological Parsing – decomposing a word into its constituent morphemes

• Foxes -> fox + es

Morphological Parsing

• Must recognize proper words “spelling”• Must not recognize improper words

“computering”

Morphological Parsing

• Should not require a list of all possible words

Morphological Parsing

• Web Search• Spell check, grammar check • Machine translation• Sentiment analysis

Computational Lexicon

• Stems• Affixes• Rules

Computational Lexicon

Computational Lexicon

Finite State Transducer

• FSA which maps an input to an output• relationships

Finite State Transducer

c:c a:a t:t +N: ε +PL:s

Input – cat +N +PLOutput - cats

Porter Stemmer

• Returns the stem of each word

• Input: cats, output: cat• Input: positivity, output: positive• Input: pitted, output: pit

Porter Stemmer

• ATIONAL : ATE (relational -> relate)• ING : ε (motoring - > motor)• SSES : SS (grasses -> grass)

Porter Stemmer

• Errors:– Organization -> organ– Doing -> do– Policy -> Polici

Tokenization

• Breaking a text into words or sentences

Tokenization

• Mrs. Wilson’s reaction to the damage was “quite positive.” She asked for $15.55.

Tokenization

• Simplest tokenizer is regex-based

Tokenization

• IndoEuropean Tokenizer• General purpose alphabetic• Token = letters + numbers• Splits on whitespace, punctuation, special

characters

Sentence Tokenization

• What is the challenge?

Sentence Tokenization

• Binary Classifier

Stop List

• List of words to remove• [the, a, an…]

Stop List

• EnglishStopTokenizerFactory:• “a, be, had, it, only, she, was, about, because,

has, its, of, some, we, after, been, have, last, on, such, were, all, but, he, more, one, than, when, also, by, her, most, or, that, which, an, can, his, mr, other, the, who, any, co, if, mrs, out, their, will, and, corp, in, ms, over, there, with, are, could, inc, mz, s, they, would, as, for, into, no, so, this, up, at, from, is, not, says, to”

Homework

• Program a stop list tokenizer (you can use my example as a starting point)

• Blog about what makes a good stop list,• how major search engines use them • and how yours compares