ling 388 language and computers lecture 21 11/13/03 sandiway fong

21
LING 388 Language and Computers Lecture Lecture 21 21 11/13 11/13 /03 /03 Sandiway FONG Sandiway FONG

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

LING 388Language and Computers

Lecture Lecture 2121

11/1311/13/03/03

Sandiway FONGSandiway FONG

Administrivia

Room change next Tuesday Room change next Tuesday One time deal :-)One time deal :-) PAS 224PAS 224

Physics and Atmospheric Sciences BuildingPhysics and Atmospheric Sciences Building

Morphology

Inflectional Morphology:Inflectional Morphology: Phi-features (person, number, gender)Phi-features (person, number, gender)

Examples: movieExamples: moviess, blond, blondee, actr, actressess Irregular examples: appendices, geeseIrregular examples: appendices, geese

CaseCase Examples: he/him, who/whomExamples: he/him, who/whom

Comparatives and superlativesComparatives and superlatives Examples: happiExamples: happierer/happi/happiestest

TenseTense Examples: drive/driveExamples: drive/drivess/dr/droveove (-ed)/driv (-ed)/drivenen

Morphology

Derivational MorphologyDerivational Morphology NominalizationNominalization

Examples: formalizatExamples: formalizationion, inform, informantant, inform, informerer, refus, refusalal, , losslossageage

DeadjectivalsDeadjectivals Examples: weakExamples: weakenen, happi, happinessness, simpl, simplifyify,, formalformalizeize, ,

slowslowlyly, calm, calm DeverbalsDeverbals

Examples: Examples: see nominalizationssee nominalizations, read, readableable, employ, employeeee DenominalsDenominals

Examples: formExamples: formalal, bridge, ski, coward, bridge, ski, cowardlyly, use, usefulful

Morphology and Semantics

SuffixationSuffixation Examples: Examples:

xx employ employ yy• employemployeeee: picks out : picks out yy• employemployerer: picks out : picks out xx

x read yx read y• readreadableable: picks out y: picks out y

PrefixationPrefixation Examples:Examples:

unundo, do, reredo, do, unun--reredo, do, enencode, code, dedefrost, frost, aasymmetric, symmetric, malmalformed, formed, illill-formed, -formed, propro-Chomsky-Chomsky

Stemming

Normalization procedure:Normalization procedure: Inflectional morphology: Inflectional morphology:

cities -> city, improves/improved -> improvecities -> city, improves/improved -> improve Derivational morphology: Derivational morphology:

transformation/transformational -> transformtransformation/transformational -> transform

Criterion: preserve meaningCriterion: preserve meaning Primary application: information retrieval (IR)Primary application: information retrieval (IR)

Efficacy questioned: Harman (1991)Efficacy questioned: Harman (1991)

Stemming

IR-centric view:IR-centric view: Applies to open-class lexical items only:Applies to open-class lexical items only:

Stop-words: the, below, being, doesStop-words: the, below, being, does

Not full morphology:Not full morphology: prefixes generally excluded prefixes generally excluded

(not meaning preserving)(not meaning preserving)Examples: Examples: aasymmetric, symmetric, unundo,. do,. enencodingcoding

Stemming: Methods

Use a dictionary (look-up)Use a dictionary (look-up) OK for English, not for languages with more OK for English, not for languages with more

productive morphology, e.g. Japaneseproductive morphology, e.g. Japanese

Write rules, e.g. Porter Algorithm (Porter, 1980)Write rules, e.g. Porter Algorithm (Porter, 1980) Example:Example:

Ends in doubled consonant (not “l”, “s” or Ends in doubled consonant (not “l”, “s” or “z”), remove last character“z”), remove last character

• hohoppppinging -> hop, hi -> hop, hissssinging -> hiss -> hiss

Stemming: Methods

Dictionary approach not enoughDictionary approach not enough Example: (Porter, 1991)Example: (Porter, 1991)

routed -> route/routrouted -> route/rout• At Waterloo, Napoleon’s forces were At Waterloo, Napoleon’s forces were

routrouteded• The cars were The cars were routerouted off the highway d off the highway

Here, the (inflected) verb form is Here, the (inflected) verb form is polysemouspolysemous

Stemming: Errors

Understemming: failure to mergeUnderstemming: failure to merge Adhere/adhesionAdhere/adhesion

Overstemming: incorrect mergeOverstemming: incorrect merge Probe/probProbe/probableable

Claim: -Claim: -ableable irregular suffix, root: irregular suffix, root: probareprobare (Lat.)(Lat.)

Mis-stemming: removing a non-suffix (Porter, 1991)Mis-stemming: removing a non-suffix (Porter, 1991) repreplyly -> rep -> rep

Stemming: Interaction

Interacts with noun compounding:Interacts with noun compounding: Example:Example:

operatoperatinging system systemssnegatnegativeive polar polarityity item itemss

For IR, compounds need to be identified first…For IR, compounds need to be identified first…

Stemming: Porter Algorithm

The Porter Stemmer (Porter, 1980) The Porter Stemmer (Porter, 1980) http://www.tartarus.org/~martin/PorterStemmer/http://www.tartarus.org/~martin/PorterStemmer/

For EnglishFor English Most widely used systemMost widely used system Manually written rulesManually written rules 5 stage approach to extracting roots5 stage approach to extracting roots Considers suffixes onlyConsiders suffixes only May produce non-word rootsMay produce non-word roots

Stemming: Porter Algorithm

Rule format:Rule format: (condition on stem) suffix(condition on stem) suffix11 -> suffix -> suffix22 In case of conflict, prefer longest suffix matchIn case of conflict, prefer longest suffix match

““Measure” of a word is Measure” of a word is mm in: in: (C) (VC)(C) (VC)mm (V) (V) C = sequence of one or more consonantsC = sequence of one or more consonants V = sequence of one or more vowelsV = sequence of one or more vowels Examples:Examples:

treetree C(VC) C(VC)00V V troublestroubles C(VC) C(VC)22

Stemming: Porter Algorithm

Step 1a: remove plural suffixationStep 1a: remove plural suffixation SSES -> SS (careSSES -> SS (caressessses)) IES -> I (ponIES -> I (poniesies)) SS -> SS (careSS -> SS (caressss)) S -> (catS -> (catss))

Step 1b: remove verbal inflectionStep 1b: remove verbal inflection (m>0) EED -> EE (agr(m>0) EED -> EE (agreedeed, feed), feed) (*v*) ED -> (plaster(*v*) ED -> (plastereded, bled), bled) (*v*) ING -> (motor(*v*) ING -> (motoringing, sing), sing)

Stemming: Porter Algorithm

Step 1b: (contd. for -Step 1b: (contd. for -eded and - and -inging rules) rules) AT -> ATE (conflAT -> ATE (conflatateded)) BL -> BLE (trouBL -> BLE (troublbleded)) IZ -> IZE (sIZ -> IZE (sizizeded)) (*doubled c & ¬(*L v *S v *Z)) -> single c (*doubled c & ¬(*L v *S v *Z)) -> single c

(ho(hoppppinging, hiss, hissinging, fall, fallinging, fizz, fizzinging)) (m=1 & *cvc) -> E ((m=1 & *cvc) -> E (filfilinging, fail, failinging, slow, slowinging))

Step 1c: Y and IStep 1c: Y and I (*v*) Y -> I (happ(*v*) Y -> I (happyy, sky), sky)

Stemming: Porter Algorithm

Step 2: Peel one suffix off for multiple suffixesStep 2: Peel one suffix off for multiple suffixes (m>0) ATIONAL -> ATE (rel(m>0) ATIONAL -> ATE (relationalational)) (m>0) TIONAL -> TION (condi(m>0) TIONAL -> TION (conditionaltional, rational), rational) (m>0) ENCI -> ENCE (val(m>0) ENCI -> ENCE (valencienci)) (m>0) ANCI -> ANCE (hesit(m>0) ANCI -> ANCE (hesitancianci)) (m>0) IZER -> IZE (digit(m>0) IZER -> IZE (digitizerizer)) (m>0) ABLI -> ABLE (conform(m>0) ABLI -> ABLE (conformabliabli) - ) - ableable (step 4) (step 4) …… (m>0) IZATION -> IZE (vietnam(m>0) IZATION -> IZE (vietnamizationization)) (m>0) ATION -> ATE (predic(m>0) ATION -> ATE (predicationation)) …… (m>0) IVITI -> IVE (sensit(m>0) IVITI -> IVE (sensitivitiiviti))

Stemming: Porter Algorithm

Step 3Step 3 (m>0) ICATE -> IC (tripl(m>0) ICATE -> IC (triplicateicate)) (m>0) ATIVE -> (form(m>0) ATIVE -> (formativeative)) (m>0) ALIZE -> AL (form(m>0) ALIZE -> AL (formalizealize)) (m>0) ICITI -> IC (electr(m>0) ICITI -> IC (electricitiiciti)) (m>0) ICAL -> IC (electr(m>0) ICAL -> IC (electricalical, chem, chemicalical)) (m>0) FUL -> (hope(m>0) FUL -> (hopefulful)) (m>0) NESS -> (good(m>0) NESS -> (goodnessness))

Stemming: Porter Algorithm

Step 4: Delete last suffixStep 4: Delete last suffix (m>1) AL -> (reviv(m>1) AL -> (revivalal) - ) - reviverevive, see step 5, see step 5 (m>1) ANCE -> (allow(m>1) ANCE -> (allowanceance, dance), dance) (m>1) ENCE -> (infer(m>1) ENCE -> (inferenceence, fence), fence) (m>1) ER -> (airlin(m>1) ER -> (airlinerer, employ, employerer)) (m>1) IC -> (gyroscop(m>1) IC -> (gyroscopicic, electr, electricic)) (m>1) ABLE -> (adjust(m>1) ABLE -> (adjustable, able, mov(e)mov(e)ableable)) (m>1) IBLE -> (defens(m>1) IBLE -> (defensibleible,bible),bible) (m>1) ANT -> (irrit(m>1) ANT -> (irritantant,ant),ant) (m>1) EMENT -> (replac(m>1) EMENT -> (replacementement)) (m>1) MENT -> (adjust(m>1) MENT -> (adjustmentment)) ……

Stemming: Porter Algorithm

Step 5a: remove Step 5a: remove ee (m>1) E -> (probat(m>1) E -> (probatee, rate), rate) (m>1 & ¬*cvc) E -> (ceas(m>1 & ¬*cvc) E -> (ceasee))

Step 5b: Step 5b: llll reduction reduction (m>1 & *LL) -> L (contro(m>1 & *LL) -> L (controllllerer, roll), roll)

Stemming: Porter Algorithm

Misses (understemming)Misses (understemming) Unaffected:Unaffected:

agreementagreement (VC) (VC)11VCC - step 4 (m>1)VCC - step 4 (m>1)adhesionadhesion

Irregular morphology:Irregular morphology:drove, geesedrove, geese

OverstemmingOverstemmingrelatrelativityivity - step 2 - step 2

Mis-stemmingMis-stemmingwandwanderer C(VC) C(VC)11VCVC

Stemming: Porter Algorithm

Possible Term ProjectPossible Term Project The Porter Stemmer is a rule-based systemThe Porter Stemmer is a rule-based system We know how to write rulesWe know how to write rules Implement the Porter Stemmer in SWI-PrologImplement the Porter Stemmer in SWI-Prolog