ling 388 language and computers lecture 21 11/13/03 sandiway fong
Post on 22-Dec-2015
215 views
TRANSCRIPT
Administrivia
Room change next Tuesday Room change next Tuesday One time deal :-)One time deal :-) PAS 224PAS 224
Physics and Atmospheric Sciences BuildingPhysics and Atmospheric Sciences Building
Morphology
Inflectional Morphology:Inflectional Morphology: Phi-features (person, number, gender)Phi-features (person, number, gender)
Examples: movieExamples: moviess, blond, blondee, actr, actressess Irregular examples: appendices, geeseIrregular examples: appendices, geese
CaseCase Examples: he/him, who/whomExamples: he/him, who/whom
Comparatives and superlativesComparatives and superlatives Examples: happiExamples: happierer/happi/happiestest
TenseTense Examples: drive/driveExamples: drive/drivess/dr/droveove (-ed)/driv (-ed)/drivenen
Morphology
Derivational MorphologyDerivational Morphology NominalizationNominalization
Examples: formalizatExamples: formalizationion, inform, informantant, inform, informerer, refus, refusalal, , losslossageage
DeadjectivalsDeadjectivals Examples: weakExamples: weakenen, happi, happinessness, simpl, simplifyify,, formalformalizeize, ,
slowslowlyly, calm, calm DeverbalsDeverbals
Examples: Examples: see nominalizationssee nominalizations, read, readableable, employ, employeeee DenominalsDenominals
Examples: formExamples: formalal, bridge, ski, coward, bridge, ski, cowardlyly, use, usefulful
Morphology and Semantics
SuffixationSuffixation Examples: Examples:
xx employ employ yy• employemployeeee: picks out : picks out yy• employemployerer: picks out : picks out xx
x read yx read y• readreadableable: picks out y: picks out y
PrefixationPrefixation Examples:Examples:
unundo, do, reredo, do, unun--reredo, do, enencode, code, dedefrost, frost, aasymmetric, symmetric, malmalformed, formed, illill-formed, -formed, propro-Chomsky-Chomsky
Stemming
Normalization procedure:Normalization procedure: Inflectional morphology: Inflectional morphology:
cities -> city, improves/improved -> improvecities -> city, improves/improved -> improve Derivational morphology: Derivational morphology:
transformation/transformational -> transformtransformation/transformational -> transform
Criterion: preserve meaningCriterion: preserve meaning Primary application: information retrieval (IR)Primary application: information retrieval (IR)
Efficacy questioned: Harman (1991)Efficacy questioned: Harman (1991)
Stemming
IR-centric view:IR-centric view: Applies to open-class lexical items only:Applies to open-class lexical items only:
Stop-words: the, below, being, doesStop-words: the, below, being, does
Not full morphology:Not full morphology: prefixes generally excluded prefixes generally excluded
(not meaning preserving)(not meaning preserving)Examples: Examples: aasymmetric, symmetric, unundo,. do,. enencodingcoding
Stemming: Methods
Use a dictionary (look-up)Use a dictionary (look-up) OK for English, not for languages with more OK for English, not for languages with more
productive morphology, e.g. Japaneseproductive morphology, e.g. Japanese
Write rules, e.g. Porter Algorithm (Porter, 1980)Write rules, e.g. Porter Algorithm (Porter, 1980) Example:Example:
Ends in doubled consonant (not “l”, “s” or Ends in doubled consonant (not “l”, “s” or “z”), remove last character“z”), remove last character
• hohoppppinging -> hop, hi -> hop, hissssinging -> hiss -> hiss
Stemming: Methods
Dictionary approach not enoughDictionary approach not enough Example: (Porter, 1991)Example: (Porter, 1991)
routed -> route/routrouted -> route/rout• At Waterloo, Napoleon’s forces were At Waterloo, Napoleon’s forces were
routrouteded• The cars were The cars were routerouted off the highway d off the highway
Here, the (inflected) verb form is Here, the (inflected) verb form is polysemouspolysemous
Stemming: Errors
Understemming: failure to mergeUnderstemming: failure to merge Adhere/adhesionAdhere/adhesion
Overstemming: incorrect mergeOverstemming: incorrect merge Probe/probProbe/probableable
Claim: -Claim: -ableable irregular suffix, root: irregular suffix, root: probareprobare (Lat.)(Lat.)
Mis-stemming: removing a non-suffix (Porter, 1991)Mis-stemming: removing a non-suffix (Porter, 1991) repreplyly -> rep -> rep
Stemming: Interaction
Interacts with noun compounding:Interacts with noun compounding: Example:Example:
operatoperatinging system systemssnegatnegativeive polar polarityity item itemss
For IR, compounds need to be identified first…For IR, compounds need to be identified first…
Stemming: Porter Algorithm
The Porter Stemmer (Porter, 1980) The Porter Stemmer (Porter, 1980) http://www.tartarus.org/~martin/PorterStemmer/http://www.tartarus.org/~martin/PorterStemmer/
For EnglishFor English Most widely used systemMost widely used system Manually written rulesManually written rules 5 stage approach to extracting roots5 stage approach to extracting roots Considers suffixes onlyConsiders suffixes only May produce non-word rootsMay produce non-word roots
Stemming: Porter Algorithm
Rule format:Rule format: (condition on stem) suffix(condition on stem) suffix11 -> suffix -> suffix22 In case of conflict, prefer longest suffix matchIn case of conflict, prefer longest suffix match
““Measure” of a word is Measure” of a word is mm in: in: (C) (VC)(C) (VC)mm (V) (V) C = sequence of one or more consonantsC = sequence of one or more consonants V = sequence of one or more vowelsV = sequence of one or more vowels Examples:Examples:
treetree C(VC) C(VC)00V V troublestroubles C(VC) C(VC)22
Stemming: Porter Algorithm
Step 1a: remove plural suffixationStep 1a: remove plural suffixation SSES -> SS (careSSES -> SS (caressessses)) IES -> I (ponIES -> I (poniesies)) SS -> SS (careSS -> SS (caressss)) S -> (catS -> (catss))
Step 1b: remove verbal inflectionStep 1b: remove verbal inflection (m>0) EED -> EE (agr(m>0) EED -> EE (agreedeed, feed), feed) (*v*) ED -> (plaster(*v*) ED -> (plastereded, bled), bled) (*v*) ING -> (motor(*v*) ING -> (motoringing, sing), sing)
Stemming: Porter Algorithm
Step 1b: (contd. for -Step 1b: (contd. for -eded and - and -inging rules) rules) AT -> ATE (conflAT -> ATE (conflatateded)) BL -> BLE (trouBL -> BLE (troublbleded)) IZ -> IZE (sIZ -> IZE (sizizeded)) (*doubled c & ¬(*L v *S v *Z)) -> single c (*doubled c & ¬(*L v *S v *Z)) -> single c
(ho(hoppppinging, hiss, hissinging, fall, fallinging, fizz, fizzinging)) (m=1 & *cvc) -> E ((m=1 & *cvc) -> E (filfilinging, fail, failinging, slow, slowinging))
Step 1c: Y and IStep 1c: Y and I (*v*) Y -> I (happ(*v*) Y -> I (happyy, sky), sky)
Stemming: Porter Algorithm
Step 2: Peel one suffix off for multiple suffixesStep 2: Peel one suffix off for multiple suffixes (m>0) ATIONAL -> ATE (rel(m>0) ATIONAL -> ATE (relationalational)) (m>0) TIONAL -> TION (condi(m>0) TIONAL -> TION (conditionaltional, rational), rational) (m>0) ENCI -> ENCE (val(m>0) ENCI -> ENCE (valencienci)) (m>0) ANCI -> ANCE (hesit(m>0) ANCI -> ANCE (hesitancianci)) (m>0) IZER -> IZE (digit(m>0) IZER -> IZE (digitizerizer)) (m>0) ABLI -> ABLE (conform(m>0) ABLI -> ABLE (conformabliabli) - ) - ableable (step 4) (step 4) …… (m>0) IZATION -> IZE (vietnam(m>0) IZATION -> IZE (vietnamizationization)) (m>0) ATION -> ATE (predic(m>0) ATION -> ATE (predicationation)) …… (m>0) IVITI -> IVE (sensit(m>0) IVITI -> IVE (sensitivitiiviti))
Stemming: Porter Algorithm
Step 3Step 3 (m>0) ICATE -> IC (tripl(m>0) ICATE -> IC (triplicateicate)) (m>0) ATIVE -> (form(m>0) ATIVE -> (formativeative)) (m>0) ALIZE -> AL (form(m>0) ALIZE -> AL (formalizealize)) (m>0) ICITI -> IC (electr(m>0) ICITI -> IC (electricitiiciti)) (m>0) ICAL -> IC (electr(m>0) ICAL -> IC (electricalical, chem, chemicalical)) (m>0) FUL -> (hope(m>0) FUL -> (hopefulful)) (m>0) NESS -> (good(m>0) NESS -> (goodnessness))
Stemming: Porter Algorithm
Step 4: Delete last suffixStep 4: Delete last suffix (m>1) AL -> (reviv(m>1) AL -> (revivalal) - ) - reviverevive, see step 5, see step 5 (m>1) ANCE -> (allow(m>1) ANCE -> (allowanceance, dance), dance) (m>1) ENCE -> (infer(m>1) ENCE -> (inferenceence, fence), fence) (m>1) ER -> (airlin(m>1) ER -> (airlinerer, employ, employerer)) (m>1) IC -> (gyroscop(m>1) IC -> (gyroscopicic, electr, electricic)) (m>1) ABLE -> (adjust(m>1) ABLE -> (adjustable, able, mov(e)mov(e)ableable)) (m>1) IBLE -> (defens(m>1) IBLE -> (defensibleible,bible),bible) (m>1) ANT -> (irrit(m>1) ANT -> (irritantant,ant),ant) (m>1) EMENT -> (replac(m>1) EMENT -> (replacementement)) (m>1) MENT -> (adjust(m>1) MENT -> (adjustmentment)) ……
Stemming: Porter Algorithm
Step 5a: remove Step 5a: remove ee (m>1) E -> (probat(m>1) E -> (probatee, rate), rate) (m>1 & ¬*cvc) E -> (ceas(m>1 & ¬*cvc) E -> (ceasee))
Step 5b: Step 5b: llll reduction reduction (m>1 & *LL) -> L (contro(m>1 & *LL) -> L (controllllerer, roll), roll)
Stemming: Porter Algorithm
Misses (understemming)Misses (understemming) Unaffected:Unaffected:
agreementagreement (VC) (VC)11VCC - step 4 (m>1)VCC - step 4 (m>1)adhesionadhesion
Irregular morphology:Irregular morphology:drove, geesedrove, geese
OverstemmingOverstemmingrelatrelativityivity - step 2 - step 2
Mis-stemmingMis-stemmingwandwanderer C(VC) C(VC)11VCVC