ling 570 day #3 stemming, probabilistic automata, markov chains/model

Ling 570

Day #3

Stemming, Probabilistic Automata, Markov Chains/Model

2

MORPHOLOGY AND FSTS

3

FST as Translator

FR: ce bill met de le baume sur une blessure

EN: this bill puts balm on a sore wound

Last Class

4

FST Application Examples

• Case folding:– He said he said

• Tokenization:– “He ran.” “ He ran . “

• POS tagging:– They can fish PRO VERB NOUN

5

FST Application Examples

• Pronunciation:– B AH T EH R B AH DX EH R

• Morphological generation:– Fox s Foxes

• Morphological analysis:– cats cat s

6

Roadmap

• Motivation:– Representing words

• A little (mostly English) Morphology• Stemming

7

The Lexicon

• Goal: Represent all the words in a language• Approach?

8

The Lexicon


– Enumerate all words?

9

The Lexicon


– Enumerate all words?• Doable for English

– Typical for ASR (Automatic Speech Recognition)– English is morphologically relatively impoverished

10

The Lexicon




• Other languages?

11

The Lexicon




• Other languages?– Wildly impractical

» Turkish: 40,000 forms/verb;

uygarlas¸tıramadıklarımızdanmıs¸sınızcasına

“(behaving) as if you are among those whom we could not civilize”

12

Morphological Parsing

• Goal: Take a surface word form and generate a linguistic structure of component morphemes

13



• A morpheme is the minimal meaning-bearing unit in a language.

14



• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central

meaning unit in a word– Affix: prefix, suffix, infix, circumfix

15





• Prefix: e.g., possible impossible

16





• Prefix: e.g., possible impossible• Suffix: e.g., walk walking

17



• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central meaning

unit in a word– Affix: prefix, suffix, infix, circumfix

• Prefix: e.g., possible impossible• Suffix: e.g., walk walking• Infix: e.g., hingi humingi (Tagalog)

18



• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central meaning

unit in a word– Affix: prefix, suffix, infix, circumfix

• Prefix: e.g., possible impossible• Suffix: e.g., walk walking• Infix: e.g., hingi humingi (Tagalog)• Circumfix: e.g., sagen gesagt (German)

19

Surface Variation & Morphology

• Searching (a la Bing) for documents about:– Televised sports

20



• Many possible surface forms:– Televised, television, televise,..– Sports, sport, sporting,…

21




• How can we match?

22




• How can we match?– Convert surface forms to common base form

• Stemming or morphological analysis

23

Two Perspectives

• Stemming:– writing

24

Two Perspectives

• Stemming:– writing write (or writ)– Beijing

25

Two Perspectives

• Stemming:– writing write (or writ)– Beijing Beije

• Morphological Analysis:

26

Two Perspectives


• Morphological Analysis:– writing write+V+prog

27

Two Perspectives


• Morphological Analysis:– writing write+V+prog– cats cat + N + pl– writes write+V+3rdpers+Sg

Stemming

• Simple type of morphological analysis• Supports matching using base form• e.g. Television, televised, televising televise

Stemming

• Simple type of morphological analysis• Supports matching using base form• e.g. Television, televised, televising televise• Most popular: Porter stemmer

Stemming


• Task: Given surface form, produce base form– Typically, removes suffixes

Stemming


• Task: Given surface form, produce base form– Typically, removes suffixes

• Model:– Rule cascade– No lexicon!

32

Stemming

• Used in many NLP/IR applications• For building equivalence classes

ConnectConnectedConnectingConnectionConnections

Porter Stemmer, simple and efficientWebsite: http://www.tartarus.org/~martin/PorterStemmer

On patas: ~/dropbox/12-13/570/porter

Same class;

suffixes irrelevant

http://www.tartarus.org/~martin/PorterStemmer

Porter Stemmer

• Rule cascade:– Rule form:

• (condition) PATT1 PATT2

Porter Stemmer


• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε

Porter Stemmer


• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE

Porter Stemmer



– Rule partial order:• Step1a: -s• Step1b: -ed, -ing

Porter Stemmer



– Rule partial order:• Step1a: -s• Step1b: -ed, -ing• Step 2-4: derivational suffixes

Porter Stemmer



– Rule partial order:• Step1a: -s• Step1b: -ed, -ing• Step 2-4: derivational suffixes• Step 5: cleanup

• Pros:

Porter Stemmer




• Pros: Simple, fast, buildable for a variety of languages• Cons:

Porter Stemmer




• Pros: Simple, fast, buildable for a variety of languages• Cons: Overaggressive and underaggressive

41

STEMMING & EVAL

42

Evaluating Performance

• Measures of Stemming Performance rely on similar metrics used in IR:– Precision: measure of the proportion of selected items

the system got right• precision = tp / (tp + fp)• # of correct answers / # of answers given

– Recall: measure of the proportion of the target items the system selected• recall = tp / (tp + fn)• # of correct answers / # of possible correct answers

– Rule of thumb: as precision increases, recall drops, and vice versa

• Metrics widely adopted in Stat NLP

43

Precision and Recall

• Take a given stemming task– Suppose there are 100 words that could be

stemmed– A stemmer gets 52 of these right (tp)– But it inadvertently stems 10 others (fp)

Precision = 52 / (52 + 10) = .84

Recall = 52 / (52 + 48) = .52

44

Precision and Recall

• Take a given stemming task– Suppose there are 100 words that could be

stemmed– A stemmer gets 52 of these right (tp)– But it inadvertently stems 10 others (fp)

Precision = 52 / (52 + 10) = .84

Recall = 52 / (52 + 48) = .52

Note: easy to get precision of 1.0. Why?

45

Baseline Tokenizer 1 Tokenizer 2 Tokenizer 3 Tokenizer 4After After After After Aftercoming coming coming coming comingclose close close close close Precision Recall F-Measure

to to to to to Tokenizer 1 0.827586 0.888889 0.858237548

a a a a a Tokenizer 2 0.961538 0.925926 0.943732194

partial partial partial partial partial Tokenizer 3 0.928571 0.962963 0.945767196

settlementsettlement settlement settlement settlement Tokenizer 4 1 1 1

a a a a ayear year year year yearago ago ago ago ago, , , , ,shareholdersshareholders shareholders shareholders shareholderswho who who who whofiled filed filed filed filedcivil civil civil civil civilsuits suits suits suits suitsagainst against against against againstIvan Ivan Ivan Ivan IvanF. F F. F. F.Boesky . Boesky Boesky Boesky& Boesky & & &Co. & Co. Co Co.L.P. Co L.P. . L.P.Drexel . Drexel's L.P. Drexel's L.P plaintiffs Drexel 'splaintiffs . ' 's plaintiffs' Drexel plaintiffs '

's 'plaintiffs

WEIGHTED AUTOMATA & MARKOV CHAINS

PFA Definition

• A Probabilistic Finite-State Automaton is a 6-tuple:– A set of states Q– An alphabet Σ– A set of transitions: δsubset Q x Σ x Q– Initial state probabilities: Q R+

– Transition probabilities: δ R+

– Final state probabilities: Q R+

PFA Recap

• Subject to constraints:

• Computing sequence probabilities

PFA Example

• Example– I(q0)=1

– I(q1)=0

– F(q0)=0

– F(q1)=0.2

– P(q0,a,q1)=1; P(q1,b,q1) =0.8

– P(abn) = I(q0)*P(q0,a,q1)*P(q1,b,q1)n*F(q1)

– = 0.8n*0.2

Markov Chain

• A Markov Chain is a special case of a PFA in which the sequence uniquely determines which states the automaton will go through.

• Markov Chains can not represent inherently ambiguous problems– Can assign probability to unambiguous

sequences

Markov Chain for Words

Markov Chain for Pronunciation

• Observations: 0/1

Markov Chain for Walking through Groningen

Markov Chain: “First-order observable Markov Model”

• A set of states – Q = q1, q2…qN; the state at time t is qt



• Transition probabilities: – a set of probabilities A = a01a02…an1…ann.

– Each aij represents the probability of transitioning from state i to state j

– The set of these is the transition probability matrix A





– The set of these is the transition probability matrix A• Distinguished start and final states

– q0,qF





– The set of these is the transition probability matrix A• Distinguished start and final states

– q0,qF

• Current state only depends on previous state

Markov Models

• The parameters of a MM can be arranged in matrices

• The A-matrix for the set of transition probabilities:

p11 p12 …p1j

A = p21 p22 …p2j

…[ ]

Markov Models



p11 p12 …p1j

A = p21 p22 …p2j

…

• What’s missing?

[ ]

Markov Models



p11 p12 …p1j

A = p21 p22 …p2j

…

• What’s missing? Starting probabilities.

[ ]

Markov Models

• Exercise– Build the transition probability matrix over

this set of dataThe duck died.

The car killed the duck.

The duck died under her car.

We duck under the car.

We retrieve the poor duck.

– Build the starting probability matrix

Markov Models

• Exercise– Given your model, what’s the probability for

each of the following sentences?The duck died under her car.

We duck under the car.

The duck under the car.

We retrieve killed the duck.

We the poor duck died.

We retrieve the poor duck under the car.

– For a given start state (The, We), what’s the most likely string (of the above)?

HMMs

• Next class

ling 570 day #3 stemming, probabilistic automata, markov chains/model

Documents