in computer science nlp in a nutshell - kaistnlpcl.kaist.ac.kr/~cs492/lecture06.pdf ·...
TRANSCRIPT
Special Topics in Computer Science
NLP in a NutshellNLP in a NutshellCS492B Spring Semester 2009
Jong C. ParkgComputer Science Department
Korea Advanced Institute of Science and Technology
WORDS, PARTS OF SPEECH, AND MORPHOLOGY
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 2
WordsWords
Parts of SpeechClasses whose words share common grammatical properties
The word parsing comes from the Latin phrase partesThe word parsing comes from the Latin phrase partesorationis ‘parts of speech”.POS tagging is the automatic annotation of words gg gwith grammatical categories (or POS tags). Parts of speech are also called lexical categories.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 3
WordsWords
Parts of SpeechTwo classes: the closed class and the open class. p
Closed class words are relatively stable over time and have a functional role.Open class words form the bulk of a vocabulary, and appear or disappear with the evolution of the language.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 4
WordsWordsT bl Cl d l t iTable 5.1. Closed class categories.
Part of speech English French German
Determiners the, several, my le, plusieurs, mon der, mehrere, mein
Pronouns he, she, it il, elle, lui er, sie, ihm
Prepositions to, of vers, de nach, von
Conjunctions and or et ou und oderConjunctions and, or et, ou und, oder
Auxiliaries and modals be, have, will, would être, avoir, pouvoir sein, haben, können
Table 5.2. Open class categories.
Part of speech English French German
Nouns name, Frank nom, François Name, FranzNouns name, Frank nom, François Name, Franz
Adjectives big, good grand, bon groß, gut
Verbs to swim nager schwimmen
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 5
Adverbs rather, very, only plutôt, très, uniquement fast, nur, sehr, endlich
WordsWords
FeaturesBasic categories can be further refined (or
b t i d) subcategorized). Nouns can be split into singular and plural nouns. Nouns in French can also be split into masculine and Nouns in French can also be split into masculine and feminine nouns. Nouns in German can be split into masculine, feminine, and neuter nouns.
Additional properties that can further specify main categories are called the features categories are called the features.
Examples include the number, gender, person, case, and tense features.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 6
WordsWords
Two Significant Parts of Speech: The Noun and the Verb
Nouns are divided into proper and common nouns.nouns.
Table 5.3. Features of common nouns.
Features\Values English French GermanNumber singular, plural
waiter/waiters, book/books
singular, pluralserveur/serveurs, livre/livres
singular, pluralBuch/Bucher
Gender masculine feminine masculine feminine neuterGender masculine, feminineserveur/table
masculine, feminine, neuterOber/Gabel/Tuch
Case nominative, accusative,genitive, dative
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 7
Junge/Jungen/Jungen/Jungen
WordsWords
Two Significant Parts of Speech: The Noun and the Verb
Verbs can be classified into three main types: auxiliaries, modals, and main verbs.auxiliaries, modals, and main verbs.
Table 5.4. Auxiliary verbs.
English French Germang
to be: am, are, is, was, wereto have: has, have, hadto do: does, did, done
être: suis, es, est, sommes,sont, étais, étaitavoir: ai, as, a, avons, ont,
sein: bin, bist, ist, war, warenhaben: habe, hast, hat,, , , , , , ,
avais, avait, avions, , ,
haben, habtwerden: werde, wirst, wird,wurde
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 8
WordsWords
Two Significant Parts of Speech: The Noun and the Verb
Verbs can be classified into three main types: auxiliaries, modals, and main verbs.auxiliaries, modals, and main verbs.
Table 5.5. Modal verbs.
English French (semiauxiliaries) GermanEnglish French (semiauxiliaries) German
can, could,must, may, might,shall should
pouvoir: peux, peut, pouvons,pourrai, pourraisdevoir: dois doit devons devrai
können: kann, können, konntedürfen: darf, dürfen, dürftemögen: mag mögen möchteshall, should devoir: dois, doit, devons, devrai,
devraisvouloir: veux, veut, voulons,voudrai, voudrais
mögen: mag, mögen, möchtemüssen: muß, müssen, mußtesollen: soll, sollen, sollte
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 9
WordsWords
Two Significant Parts of Speech: The Noun and the Verb
Main verbs are categorized according to their complement’s function:
Copular or link verb: verbs linking a subject to an (adjective) complementIntransitive: verbs taking no objectIntransitive: verbs taking no objectTransitive: verbs taking an objectDitransitive: verbs taking two objectsg j
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 10
WordsWords
Two Significant Parts of Speech: The Noun and the Verb
Main verbsTable 5.6. Verb types.
English French German
Copulas Man ismortalShe seems intelligent
l’homme estmortelElle paraît intelligente
Der Mensch ist SterblSie scheint intelligentg g g
Intransitive verbs
Frank sleepsCharlotte runs
François dortCharlotte court
Franz schläftCharlotte rennt
Transitive You take the book Tu prends le livre Du nimmst das BuchTransitive verbs
You take the bookSusan reads the paper
Tu prends le livreSuzanne lis l’article
Du nimmst das BuchSusan liest den Artikel
Ditransitiveverbs
I givemy neighborsthe notes
Je donne les notes àmon voisin
Ich gebe die Notizenmeinem Nachbarn
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 11
WordsWords
Two Significant Parts of Speech: The Noun and the Verb
Main verbsTable 5.7. Features common to verbs and nouns.
Features\Values English French GermanFeatures\Values English French German
Person 1, 2, and 3I am
1, 2, and 3je suistu es
1, 2, and 3ich bindu bistyou are
she istu eselle est
du bistsie ist
Number singular, pluralI am/ e are
singular, pluralje suis/nous sommes
singular, pluralich bin/ ir sindI am/we are
She eats/they eatje suis/nous sommeselle mange/elles mangent
ich bin/wir sindsie ißt/sie essen
Gender – masculine, feminineil est mangé/elle est mangée –
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 12
il est mangé/elle est mangée
WordsWords
Two Significant Parts of Speech: The Noun and the Verb
Verbs also have specific features, or the tense, the mode, and the voice:
Tense locates the verb, and the sentence, in time.Mood enables the speaker to present or to conceive the action in various waysin various ways.Voice characterizes the sequence of syntactic groups.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 13
WordsWords
T bl 8 T t t d i i fl tiTable 5.8. Tenses constructed using inflection.
Features\Values English French German
Base I like to sing j’aime chanter Ich singe gernPresent I sing everyday Je chante tous les Ich singe alltagsPreterit (Simple past) I sang in my youth Je chantai dans ma
jeunesseIch sang in meinerJugend
Imperfect Je chantais dans maImperfect – Je chantais dans majeunesse
–
Future – Je chanterai plus tard –Present participle I am singing En chantant tous les Singendp p g g
joursg
Past participle I have sung before J’ai chanté Ich habe gesungen
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 14
WordsWordsTable 5.9. Some tenses constructed using auxiliaries. Values do not correspond across languages.
E li h F h GEnglish French German
Present progressive I am singing – –
Future I shall (will) sing – Ich werde singen
Present perfect I have sung J’ai chanté Ich habe gesungen
Pluperfect I had sung J’avais chanté Ich hatte gesungen
Passé antérieur J’eus chantéPassé antérieur – J’eus chanté –
Future perfect I will have sung J’aurai chanté Ich werde gesungen haben
Futur antérieur I would have sung J’aurais chanté Ich würde gesungen haben
Past progressive I was singing – –
Future progressive I will be singing – –Present perfect I h b i iese t pe ectprogressive I have been singing – –
Future perfect progressive
I will have been singing – –
f
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 15
Past perfect progressive I had been singing – –
WordsWords
Table 5.10. Moods (Present only).
English French GermanEnglish French GermanIndicative I am singing Je chante Ich singeImperative sing chante singeConditional I should (would) sing Je chanterais Ich würde singenConditional I should (would) sing Je chanterais Ich würde singen
Subjunctive Rare, it appears in expressionssuch as: God save the queen Il faut que je chante Ich singe
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 16
LexiconsLexicons
A Lexicon is a list of words.Lexical entries are also called the lexemes.
Table 5.11. Word ambiguity.
E gli h F e h GeEnglish French German
Part of speech can modalcan noun
le articlele pronoun
der articleder pronoun
great big grand big groß bigSemantic great biggreat notable
grand biggrand notable
groß biggroß notable
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 17
Table 5.12. The first lines of the Oxford Advanced Learner’s Dictionary.
Word Pronunciation Syntactic tag Syllable count or verb pattern (for verbs)a @ S-* 1
a EI Ki$ 1
a fortiori eI ,fOtI’OraI Pu$ 5
a posteriori eI ,p0sterI’OraI OA$,Pu$ 6
a priori eI ,praI’OraI OA$, Pu$ 4
a’s Eiz Kj$ 1
ab initio &b I’nISI@U Pu$ 5
abaci ’&b@saI Kj$ 3
aback @’b&k Pu% 2
abacus ’&b@k@s K7% 3
abacuses ’&b@k@sIz Kj% 4
abaft @’bAft Pu$,T-$ 2
abandon @’b&nd@n H0%,L@% 36A,14
abandoned @’b&nd@nd Hc%,Hd%,OA% 36A,14
abandoning @’b&nd@nIN Hb% 46A,14
abandonment @’b&nd@nm@nt L@% 4
abandons @’b&nd@nz Ha% 36A,14
abase @’beIs H2% 26B
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 18
abased @’beIst Hc%,Hd% 26B
abasement @’beIsm@nt L@% 3
Table 5.13. An excerpt from BDLex. Digits encode accents on letters. The syntactical tags of the verbs correspond to their conjugation type taken from the Bescherellereference.
Entry Part of speech Lemma Syntactic taga2 Prep a2 Prep_00_00;abaisser Verbe abaisser Verbe 01 060 **;
reference.
abaisser Verbe abaisser Verbe_01_060_ ;abandon Nom abandon Nom_Mn_01;abandonner Verbe abandonner Verbe_01_060_**;abattre Verbe abattre Verbe_01_550_**;abbe1 Nom abbe1 Nom_gn_90;abdiquer Verbe abdiquer Verbe_01_060_**;abeille Nom abeille Nom_Fn_81;abi3mer Verbe abi3mer Verbe 01 060 **;abi3mer Verbe abi3mer Verbe_01_060_ ;abolition Nom abolition Nom_Fn_81;abondance Nom abondance Nom_Fn_81;abondant Adj abondant Adj_gn_01;abonnement Nom abonnement Nom_Mn_01;abord Nom abord Nom_Mn_01;aborder Verbe aborder Verbe_01_060_**;aboutir Verbe aboutir Verbe 00 190 **;aboutir Verbe aboutir Verbe_00_190_**;aboyer Verbe aboyer Verbe_01_170_**;abre1ger Verbe abre1ger Verbe_01_140_**;abre1viation Nom abre1viation Nom_Fn_81;
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 19
_ _abri Nom abri Nom_Mn_01;abriter Verbe abriter Verbe_01_060_**;
LexiconsLexicons
Encoding a DictionaryLetter trees or tries are useful data structures to store large lexicons and to search words quickly.Tries are used to store the words as trees of Tries are used to store the words as trees of characters and to share branches as far as the letters of two words are identical letters of two words are identical.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 20
LexiconsLexicons
Fig. 5.1. A letter tree encoding the words tab, table, tablet, and tables.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 21
LexiconsLexicons
P l i f iProlog representation of a TrieAs embedded lists[[b, [i, [n, bin] ] ][d [a [r [k dark] ][d, [a, [r, [k, dark] ],
[w, [n, dawn] ] ] ][t, [a, [b, tab,[ , [ , [ , ,
[l, [e, table,[s, tables],[t, tablet] ] ] ] ] ] ] ]
]
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 22
LexiconsLexicons
Building a Trie in Prologch5‐make trie.pl5 _ pmake_trielist(tab, noun, TL).
Fi di g W d i T iFinding a Word in a Trieis_word_in_trie(+WordChars, +Trie, ‐Lex)make_trielist(tag,noun,T), is word in trie([t,a,g],[T],L).is_word_in_trie([t,a,g],[T],L).
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 23
MorphologyMorphology
Types of MorphemeLexical morphemes correspond to the word p pstems.Grammatical morphemes include grammatical Grammatical morphemes include grammatical words and the affixes.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 24
MorphologyMorphology
Table 5.14. Morpheme decomposition. We replaced the stems with the corresponding lemmas.
Word Morpheme decompositionWord Morpheme decomposition
English disentanglingrewritten
dis+en+tangle+ingre+write+en
French désembrouillérecrite
dé+em+brouiller+ére+écrire+te
German entwirrendi d h i b
ent+wirren+endi d h ibwiedergeschrieben wieder+ge+schreiben+en
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 25
MorphologyMorphology
Types of morphologyConcatenative morphologyp gy
Fig. 5.2. Concatenative morphology where prefixes and suffixes are concatenated to the stem.
Templatic morphologyinterweaves the grammatical morphemes to the stem (cf. the Sematic languages)
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 26
MorphologyMorphology
Fig. 5.3. Embedding of the stem into the grammatical morphemes in the German verb sangst(second‐person preterit of singen). After Simone (1998, p. 144).
Fig. 5.4. Embedding of the stem into the grammatical morphemes in the German verb gesungeng 5 4 g o o g o p g g(past participle of singen). After Simone (1998, p. 144).
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 27
MorphologyMorphology
MorphsGrammatical morphemes represent syntactic or p p ysemantic functions whose realizations in words are called morphs. (cf. allomorphs)p ( p )
Table 5.15. Plural morphs.
Plural of nouns Morpheme decomposition
English hedgehogs hedgehog+sEnglish hedgehogschurchessheep
hedgehog+schurch+essheep+
French hérissons hérisson+sFrench hérissonschevaux
hérisson+scheval+ux
German GründeHände
Grund+(¨)eHand+(¨)e
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 28
Igel( )
Igel+
MorphologyMorphology
Inflection and DerivationInflection corresponds to the application of a p ppgrammatical feature to a word, such as putting a noun into the plural or a verb into the past p pparticiple.
Table 5 16 Verb inflection with past participleTable 5.16. Verb inflection with past participle.
English French GermanBase form work
singtravailler, chanterparaître
arbeitensingensing paraître singen
Past participle (regular) worked travaillé, chanté gearbeitetPast participle (exception) sung paru gesungen
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 29
MorphologyMorphology
Derivation is linked to lexical semantics and involves another set of affixes.
T bl D i ti l ffiTable 5.17. Derivational affixes.
English French GermanPrefixes foresee, unpleasant prévoir, déplaisant vorhersehen, unangenehmS ffi g bl ig gé bl ig i hti h t itbSuffixes manageable, rigorous gérable, rigoureux vorsichtich, streitbar
Table 5.18. Derivation related to part of speech.
Adj i Ad b N Adj i V b NAdjectives Adverbs Nouns Adjectives Verbs Nouns
English recentfrank
recentlyfrankly
airbase
aerialbasic
compute computation
French récentfranc
récemmentfranchement
luneair
lunaireaérien
calculer calcul
German glücklich glücklicherweise Luft Luftig rechnen Rechnung
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 30
möglich möglicherweise Grund gründlich
MorphologyMorphology
Table 5.19. Word derivation
Word Contrary PossibilityEnglish pleasant unpleasant *pleasableg p
dop
undop
doableFrench plaisant
fairedéplaisantdéfaire
*plaisablefaisable
h h * h bGerman angenehmtun
unangenehm*untun
*angenehmbartunlichst
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 31
MorphologyMorphology
Morphological ProcessingIncludes parsing and generationp g gParsing consists in splitting an inflected, derived, or compounded word into morphemesor compounded word into morphemes.
Lemmatization refers to transforming a word into its canonical dictionary form: retrieving retrieve, canonical dictionary form: retrieving retrieve, recherchant rechercher, suchend suchenStemming consists of removing the suffix from the g grest of the word: retrieving retriev, recherchantrecherch, suchend such
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 32
MorphologyMorphology
Table 5.20. Morphological generation and parsing.Generation
English French Germandog+s dogs chien+s chiens Hund+e Hundework+ing working travailler+ant travaillant arbeiten+end arbeitendun+do undo dé+faire défaire
Parsing
Table 5.21. Open class word morphology, where * denotes zero or more elements and ?denotes an optional element.
English and French prefix* stem suffix* inflection?German inflection? prefix* stem* suffix* inflection?
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 33
MorphologyMorphology
Table 5.22. Lemmatization ambiguities.
Word Words in context LemmatizationE li h REnglish Run
1. A run in the forest2. Sportsmen run everyday
1. run: noun singular2. run: verb present third person plural
French MarcheFrench Marche1. Une marche dans la forêt2. Il marche dans la cour
1. marche: noun singular feminine2. marcher: verb present third person singular
German Lauff d i f i1. Der Lauf der Zeit
2. Lauf schnell!1. Der Lauf: noun, sing, masc2. laufen: verb, imperative, singular
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 34
MorphologyMorphology
Language Differences
b f b f i fl d f i i (kb)
Table 5.23. Some language statistics from a Xerox promotional flyer.
Language Number of stems Number of inflected forms Lexicon size (kb)English 55,000 240,000 200–300French 50,000 5,700,000 200–300German 50 000 350 000or 450German 50,000 350,000
infiniteor(compounding)
450
Japanese 130,000 200suffixes 50020,000,000word forms 500
Spanish 40,000 3,000,000 200–300
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 35
Morphological ParsingMorphological Parsing
Two‐Level Model of MorphologyEnables us to link the surface form of a word to its lexical or underlying form.
Table 5.24. Surface and lexical forms.
Generation: Lexical to surface form→
English dis+en+tangle+edhappy+er
disentangledhappierhappy+er
move+edhappiermoved
French dés+em+brouiller+édé+chanter+erons
désembrouillédéchanterons
German ent+wirren+endwieder+ge+schreiben+en
entwirrendwiedergeschrieben
Parsing: ←Surface to lexical form
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 36
Parsing: ←Surface to lexical form
Morphological ParsingMorphological Parsing
Two‐Level Model of MorphologyIn the two‐level model, the mapping between , pp gthe surface and lexical forms is synchronous.
Both strings need to be aligned with a letter‐for‐letter Both strings need to be aligned with a letter for letter correspondence.
Table 5.25. Correspondence between lexical and surface forms.
English dis+en+tangle+ed happy+er move+edEnglish dis+en+tangle+ed…dis0en0tangl00ed
happy+er…happi0er
move+ed…mov00ed
French dé+chanter+erons cheval+ux cheviller+éFrench dé+chanter+erons…dé0chant000erons
cheval+ux…cheva00ux
cheviller+é…chevill000é
German singen+st Grund+¨e Igel+Ø
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 37
g…singe00st
…Gründ00e
g Ø…Igel00
Morphological ParsingMorphological Parsing
i h hInterpreting the MorphsConsidering inflection only, it is easier to interpret the morphological information using grammatical features rather than morphs.
disentangle: disentangle+Verb+PastBoth+123SPhappier: happy+Adj+CompG ü d G d N M Pl N A GGründe: Grund+Noun+Masc+Pl+NomAccGen
Given these new lexical forms, the parser has to align the feature symbols with letters or null align the feature symbols with letters or null symbols. The principles do not change, however.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 38
Morphological ParsingMorphological Parsing
Fig. 5.5. Alignments with features.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 39
Morphological ParsingMorphological Parsing
Finite‐State TransducersThe two‐level model is commonly implemented y pusing finite‐state transducers (FST).
Transducers are automata that accept, translate, or Transducers are automata that accept, translate, or generate pairs of strings.
Fig. 5.6. A transducer.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 40
Morphological ParsingMorphological Parsing
Conjugating a French VerbTable 5.26. Future tense of French verb chanter.
Number\Person First Second Thirdsingular chanterai chanteras chanteraplural chanterons chanterez chanteront
Table 5.27. Aligned lexical and surface forms.
Number\Pers. First Second Third
singular chanter+eraichant000erai
chanter+eraschant000eras
chanter+erachant000era
chanter+erons chanter+erez chanter+erontplural chanter+eronschant000erons
chanter+erezchant000erez
chanter+erontchant000eront
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 41
Morphological ParsingMorphological Parsing
Conjugating a French Verb
Fig. 5.7. A finite‐state transducer describing the future tense of chanter.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 42
Morphological ParsingMorphological Parsing
Conjugating a French Verb
Fig. 5.8. A finite‐state transducer describing the future tense of French verbs of the first group.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 43
Morphological ParsingMorphological Parsing
Prolog Implementationtransduce(+Start, ?Final, ?Lexical, ?Surface)( , , , )?‐ transduce(1, Final, Lexical, [r, ê, v, e, r, a]).? transduce(1 Final [r ê v e r + e r e z] ?‐ transduce(1, Final, [r, ê, v, e, r, +, e, r, e, z], Surface).? d ( [ ê | ] f )?‐ transduce(1, 11, [r, ê, v, e, r | L], Surface).load_files([‘ch5‐transduce.pl’],[encoding(unicode_le)]).
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 44
Morphological ParsingMorphological Parsing
AmbiguityExamplep
(je) chante ‘I sing’(il) chante ‘he sings’(il) chante he sings
Number\Person First Second Third
Table 5.28. Present tense of French verb chanter.
Number\Person First Second Thirdsingular chante chantes chanteplural chantons chantez chantent
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 45
Morphological ParsingMorphological Parsing
Ambiguity
Fig 5 9 A finite‐state transducer encoding the present tense of verbs of the first groupFig. 5.9. A finite state transducer encoding the present tense of verbs of the first group.
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 46
Remaining IssuesRemaining Issues
Morphological RulesTwo‐Level Rules Rules and Finite‐State TransducersRule Composition: An Example with French Rule Composition: An Example with French Irregular Verbs
Jong C. Park, CS Dept., KAIST CS492B: Spring 2009 47