a morphological tagger for korean: statistical tagging...

24
A Morphological Tagger for Korean: Statistical Tagging Combined with Corpus-based Morphological Rule Application CHUNG-HYE HAN Department of Linguistics, Simon Fraser University, 8888 University Dr., Burnaby, BC V5A1S6, Canada E-mail: [email protected] MARTHA PALMER Department of Computer and Information Sciences, University of Pennsylvania, 3330 Walnut St., Philadelphia, PA 19104-6389, USA E-mail: [email protected] Abstract. This paper describes a novel approach to morphological tagging for Korean, an agglutinative language with a very productive inflectional system. The tagger takes raw text as input and returns a lemmatized and morphologically disambiguated output for each word: the lemma is labeled with a POS tag and the inflections are labeled with inflectional tags. Unlike the standard approach to tagging for morphologically complex languages, in our proposed approach the tagging phase precedes the analysis phase. It comprises a trigram-based tagging component followed by a morphological rule application component, obtaining 95% precision and recall on unseen test data. Key words: agglutinative morphology, Korean, morphological rules, morphological tagger, statistical tagging, Treebank 1. Introduction Korean is an agglutinative language with a very productive inflectional sys- tem. Inflections include different types of case markers, postpositions, pre- fixes, and suffixes on nouns, as in (1a); tense, honorific and sentence type markers on verbs and adjectives, as in (2a,b); among others. Furthermore, these inflections can combine with each other to form complex compound inflections. For example, postpositions, which correspond to English prepo- sitions, can be followed by case markers such as nominative or accusative markers, as in (1b), or auxiliary postpositions such as (to ‘also’) and (man ‘only’), as in (1c), which then can be followed by a topic marker, as in (1d). Honorific and tense markers on verbs can be followed by a nominalizer and then a case marker or a postposition and a topic marker, as in (2c,d). Based on the possible inflectional sequences, we estimate that the number of possible morphological variants of a word can, in principle, be in the tens of thousands. (1) a. - hakkyo-ka c 2004 Kluwer Academic Publishers. Printed in the Netherlands. mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.1

Upload: dinhhuong

Post on 08-Mar-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

A MorphologicalTaggerfor Korean:StatisticalTaggingCombinedwith Corpus-basedMorphologicalRuleApplication

CHUNG-HYEHANDepartment of Linguistics, Simon Fraser University, 8888 University Dr., Burnaby, BCV5A1S6, CanadaE-mail: [email protected]

MARTHA PALMERDepartment of Computer and Information Sciences, University of Pennsylvania, 3330 WalnutSt., Philadelphia, PA 19104-6389, USAE-mail: [email protected]

Abstract. This paperdescribesa novel approachto morphologicaltaggingfor Korean,anagglutinative languagewith averyproductive inflectionalsystem.Thetaggertakesraw text asinputandreturnsa lemmatizedandmorphologicallydisambiguatedoutputfor eachword: thelemmais labeledwith a POStagandtheinflectionsarelabeledwith inflectionaltags.Unlikethe standardapproachto taggingfor morphologicallycomplex languages,in our proposedapproachthetaggingphaseprecedestheanalysisphase.It comprisesa trigram-basedtaggingcomponentfollowedby amorphologicalruleapplicationcomponent,obtaining95%precisionandrecallonunseentestdata.

Key words: agglutinative morphology, Korean,morphologicalrules,morphologicaltagger,statisticaltagging,Treebank

1. Introduction

Koreanis anagglutinative languagewith a very productive inflectionalsys-tem. Inflectionsincludedifferent typesof casemarkers,postpositions,pre-fixes,andsuffixes on nouns,as in (1a); tense,honorific andsentencetypemarkers on verbsand adjectives, as in (2a,b);amongothers.Furthermore,theseinflectionscancombinewith eachother to form complex compoundinflections.For example,postpositions,which correspondto Englishprepo-sitions,canbe followed by casemarkers suchas nominative or accusativemarkers,asin (1b), or auxiliary postpositionssuchas

��(to ‘also’) and

����(man ‘only’), asin (1c),which thencanbefollowedby a topic marker, asin(1d).

�Honorificandtensemarkersonverbscanbefollowedby anominalizer

and then a casemarker or a postpositionand a topic marker, as in (2c,d).Basedon thepossibleinflectionalsequences,we estimatethatthenumberofpossiblemorphologicalvariantsof a word can,in principle,bein thetensofthousands.

(1) a.� ��� - �hakkyo-ka

c�

2004Kluwer Academic Publishers. Printed in the Netherlands.

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.1

Page 2: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

2

SCHOOL-nom

b.� ��� - � ��� � - �hakkyo-eyse-kaSCHOOL-FROM-nom

c.� ��� - � ��� � - ����hakkyo-eyse-manSCHOOL-FROM-ONLY

d.� ��� - � ��� � - ���� - �� �hakkyo-eyse-man-unSCHOOL-FROM-ONLY-topic

(2) a. � - � �� -� �

ka-ess-taGO-past-decl

b. � - ��� - � �� -� �

ka-si-ess-taGO-honorific-past-decl

c. � - � - �ka-ki-kaGO-nominalizer-nom

d. � - ��� - � �� - � - � � - �� �ka-si-ess-ki-ey-nunGO-honorific-past-nominalizer-TO-topic

This morphologicalcomplexity implies that for any NLP applicationonKoreanto be successful,a componentthat doesmorphologicalanalysisisnecessary. Without it, any applicationthat makes use of a computationallexicon would call for large and unnecessaryspacerequirements,becauseit would requirea lexicon that lists all possiblemorphologicalvariantsof aword; moreover, the developmentof applicationssuchas a part-of-speech(POS)taggeranda parserbasedon statisticalmodelswould not be feasibledueto the sparsedataproblemcausedby themultiplicity of morphologicalvariantsof the samestemin the training data.Further, to bestexploit theagglutinative natureof morphologicalstructures,a taggerfor Koreanneedsto bemorpheme-based,ratherthanword-based.Thatis, anoutputof a taggershouldconsistof asegmentationof eachwordinto its constituentmorphemesandanassignmentof a tagto eachof them.

In thispaper, wedescribeanimplementedmorphologicaltagger(thatalsodoesmorphologicalanalysis)for Koreanthat takesraw text asan input andreturnsa lemmatizedandmorphologicallydisambiguatedoutputwherefor

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.2

Page 3: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

3

eachword, the base-formis labeledwith a POStag andthe inflectionsarelabeledwith inflectional tags.Throughoutthis paper, we will usethe term“word” to meancharactersequencesseparatedby whitespacesin theoriginalraw text, and the term “lemma” to meanthe stemor base-formof a wordstrippedoff of all inflectionalandderivationalmorphemes.

In the standardapproachto tagging for morphologicallycomplex lan-guageslikeKorean,amorphologicalanalysisphaseprecedesataggingphase.That is, all possiblewaysof morphologicallysegmentinga given word aregeneratedwhich are then disambiguatedat the taggingphase(Lim et al.,1995; Hong et al., 1996; Lim et al., 1997; Chanet al., 1998; Yoon et al.,1999;Leeetal.,2000).� Themainmotivationbehindapplyingmorphologicalanalysisprior to taggingis thesparsedataproblemwhich could potentiallyarisedueto the fact that baseforms of wordscanappearwith a wide arrayof morphologicalaffixes.However, this presupposesknowledgeof all pos-sible morphologicalrules, requiringquite a lot of time andeffort for theirconstructionand implementation.Furthermore,while the generationof allpossiblemorphologicalanalysesis not anespeciallyoneroustask,resolvingtheresultingambiguityin thetaggingphaseis quitechallenging.In contrast,in our proposedapproach,the taggingphaseprecedestheanalysisphase.Itwill be shown that even with the sparsedataproblem,by applyingtaggingbeforeanalysis,wecanachieveaccuracy resultsthatarecomparableto state-of-the-art systems.This is shown to be possiblebecausetagging is donein two stages:(i) trigram-basedtagging,and (ii) updatingthe tagsfor un-known wordsusingmorphologicalinformationautomaticallyextractedfromthetrainingdata.For us,therich inflectionsarenotseenasaproblemcausingsparsedata,but ratherthey areexploitedwhenguessingthetagsfor unknownwords.As we will seein Section2.4, our approachallows for morphologi-cal analysisto be donedeterministicallyby usingthe informationobtainedfrom the taggingphase.This makesour approachefficient sincethereis notime-consumingmorphologicaldisambiguationstep.

Our approachemploys techniquesfrom POS tagging to resolve ambi-guity in morphologicaltaggingand analysis.POStaggingwas introducedby Church (1988), and Ratnaparkhi(1996) currently has one of the bestperformancesfor a POStaggerthat hasbeentrainedon the PennEnglishTreebank(Marcusetal., 1993). SinceEnglishis notasmorphologicallycom-plex asKorean,Ratnaparkhimodelsvery limited suffix informationto dealwith unseenwords.Taggersfor morphologicallyrich languageslike Czech,an inflectional language,andHungarian,BasqueandTurkish,agglutinativelanguages,have alsobeendeveloped.Hajic andHladka (1998)useanexpo-nentialprobabilisticmodelfor tagdisambiguationafteremploying morpho-logical analysisfor Czech.Hajic et al. (2001)develop a hybrid systemforCzech,wherea rule-basedsystemperformspartialdisambiguation,followedby the applicationof a trigram HMM tagger. Tufis et al. (2000) propose

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.3

Page 4: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

4

a two-steptaggingprocessfor Hungarianin which initial taggingis doneusing a reducedtag set with an off-the-shelfHMM tagger, followed by arecovery of the full morphosyntacticdescription.Hakkani-Tur et al. (2002)proposeto breakup the full morphologicaltag sequenceinto smallerunitscalled inflectional groups,treatingthe membersof eachgroup as subtags,andthento determinethecorrectsequenceof tagsvia statisticaltechniques.For Basque,Ezeizaetal. (1998)suggesttheapplication,afterinitial morpho-logical analysis,of morphologicalrulesfollowed by an HMM-basedtaggerfor disambiguation.In comparison,ourmodelis basedonagenerative statis-tical modelanda rule-basedapproach,andappliesa trigram-basedtaggingcomponentfollowedby amorphologicalruleapplicationcomponent.

We employed a corpus-basedapproachfor the implementationof ourmorphologicaltagger, andautomaticallyextractedthetrainingdataandmor-phologicalrulesfrom thePennKoreanTreebankthatcontains54,366wordsand5,078sentences(Hanetal., 2002). TheTreebankis composedof 33filesin total, out of which 30 files were usedfor training data,and threefiles(files 05, 20, and30, comprising9% of theentirecorpus)weresetasidefortestdata.The Treebankhasphrase-structureannotation,with head/phrase-level tagsaswell asfunction tags.Eachword is morphologicallyanalyzed,wherethelemmaandtheinflectionsareidentified.Thelemmais taggedwitha POStag (e.g.,NNC, NPN, VV, VX), and the inflectionsare taggedwithinflectionaltags(e.g.,PCA,EAU, EFN).� An exampleannotationis givenin(4) for thesentencein (3).

(3) � � � �� �� ���� �� � � ��!�#" � ?kwenhan-ulAUTHORITY-acc

nwu-kaWHO-nom

kackoHAVE

iss-ciBE-int

‘Who hastheauthority?’

(4) (S (NP-OBJ-1 $%'&)(+*, /NNC+ -. / /PCA)(S (NP-SBJ 01324 /NPN+5�6 /PCA)

(VP (VP (NP-OBJ*T*-1)5�687:9 /VV+ ;< /EAU)=?>@ /VX+ 7:9 /EFN))?/SFN)

As wedid notneedto spendany timeon manualconstructionof morpholog-ical rules,theimplementationof theentiresystemwasdonequiteefficiently,allowing usto developaquickandeffective techniquefor rapidlyprototypingaworkingsystem.

The rest of the paperis organizedas follows. In Section2, we discussthe generalapproach,the training data,and the workings of eachcompo-nentunderlyingthesystem.Section3 presentstheresultsof theperformanceevaluationdoneon thetagger.

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.4

Page 5: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

5

2. Description of the Morphological Tagger

The taggerfollows several sequentialstepsfor the labelingof the raw textwith POSandinflectionaltags.After tokenization(mostly appliedto punc-tuation),all morphologicalcontractionsin the input string aredecomposed(spellingrecovery). Theknown wordsarethentagged(morphtagging)fol-lowedby aphasethatupdatesthetagsfor any unknown words(updatemorphtagging).Finally, the lemmaand its tagsareseparatedfrom the inflectionswith their tags,creatingthe final output (lemma/inflectionidentification).Thesestepsaresummarizedin Figure2.

Tokenization

Morph Tagging

Update Morph Tagging

IdentificationLemma/Inflection

INPUT

OUTPUT

Spelling Recovery (with reduced POS tagging)

Figure 1. Overview of theTagger

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.5

Page 6: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

6

2.1. TOKENIZATION

Whentheraw text is fed throughour tagger, thefirst taskof thetaggeris toseparateout thepunctuationsymbols.This is asimpleprocessin whichwhitespaceis insertedbetweenthewordandtheassociatedpunctuationsymbol.

2.2. SPELLING RECOVERY

In many cases,in thecreationof an inflectedword,a syllableor a characteris deletedfrom thestemasin (5), and/oramorphologicalcontractionoccursbetweenthestemandtheinflectionasin (6).

(5) a. �� � + �nwukwu+ka

A �� �nwuka

WHO+nom

b.� �B + �C � D�tep+umyen

A � �E�� � D�tewumyen

HOT+IF

(6) a. � + � �� +� �

ka+ass+ta

A �� � �kassta

GO+past+decl

b.��F� �G + ��� + H �mwues+i+nka

A ��I� �� �mwuenka

WHAT+co+int

In otherwords,therecanbe several allomorphsfor a given morpheme,andthusan importantpart of the taskfor a morphologicaltagger/analyzeris toidentify thebase-formof a morpheme,by recoveringthematerialsthathavebeendeleted,anddecomposingthematerialsthathave beencontracted.Wecall thisprocessSPELLING RECOVERY.

To obtaintrainingmaterialfor thespellingrecovery component,theTree-bank was converted to a data format containinga list of triples for eachsentence:a word with morphologicaldeletion and contraction,the corre-spondingstring with the deletionand contractionundone,and a POStagfor the word. All the inflectional tagsare strippedaway, and all the POStagsaremappedontooneof thefive tags:NN (noun),AN (nounmodifier),VV (verb andadjective), AD (adverb andconjunction),or NC (nounwitha copula inflection). This meansthat the tag for a word annotatedwith anounPOStag anda caseinflectional tag is reducedto NN, andthe tag fora word annotatedwith a verbPOStaganda tenseandsentencetype inflec-tional tagsis reducedto VV. Importantly, spelling recovery is conditioned

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.6

Page 7: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

7

on thesefive POStags.Note that the Treebankmakes finer distinctionsintagging,andhas34 tagsin total. For instance,the Treebankhasfive nountags:NPR for propernouns,NNC for commonnouns,NNX for dependentnouns,NPN for personalpronouns,NNU for numeralsandNFW for wordswritten in foreign characters.For the purposesof spellingrecovery though,all the words belongingto the five noun categoriesbehave the sameway.Thesamegoesfor verb/adjective categories.AlthoughtheTreebankmakesathree-waydistinctionfor verb/adjective tags,all typesof verbsandadjectivesbehavethesamewaywith respectto spellingrecovery. Wethussimplifiedtherulesinvolved in thespellingrecovery processby collapsing34 tagsto fivetagsin orderto conditiontheprocess.

Wethenautomaticallyextracted605spellingrecovery templatesfrom thelist of triples, by comparingthe word and the correspondingstring. If thetwo formsdiffer, they arelisted from thepoint wherethe two formsstarttodiffer, with the correspondingPOStag. For a small set of function words,entirewordsandthecorrespondingstringsarelisted.Someexamplespellingrecoverytemplatesaregivenin TableI. Thefirst columncontains“subwords”(which arethe partsof wordsthat areto be targetedfor spellingrecovery),thesecondcolumnPOStags,andthethird columnsubstrings(which definesthestringsthesubwordswill beconvertedinto).J

TableI. Examplesof SpellingRecovery Templates

Subword POStag SubstringKLNMPO nwuka NNKLNMLQMPO nwukwukaMR MTSU kuken NN MR MTSV�WX U kukesunMTY[Z O\M] kelako NC MTSV^W _ Z O[M] kesilakoMP`a M] kacko VV MPOcb _ M] kacikoMd e Z O kolla VV M] ZR W Y koluefPghji] chyessso VVf _cW Shki] chiesssofL Wl U chwuwun VVfl m WX U chwupunn ohjiX m K _cp O haysssupnita VVn O W ShkiX m K _qp O haesssupnita

The POSfor eachword is neededbeforewe can attemptthe spellingrecovery stepbecausethis is constrainedby POStags.For instance,

�r s ��ttonghay is ambiguousbetweena verb ‘move’ anda noun ‘eastsea’,and itshouldhave differentspellingrecovery outputsdependingon its POStag,asillustratedin (7).

(7) a.�r s ��t

/NNtonghay

A �r s ��ttonghay

‘eastsea’

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.7

Page 8: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

8

b.�r s ��t

/VVtonghay

A �r s � � � �tonghae

‘move’

We usea statisticalmodelto infer themostlikely sequenceof POStagsfor the input word sequence.The training datais the word andits POStagfrom thelist of triplesshown in TableI. This trainingdatais usedto createaprobabilitymodelto infer themostlikely tagsequencet ujvxwyuz\{ | | | wyu} givenaninputwordsequence~ z { | | |�{ ~ } . Thisprobabilitymodelis usedasaPOStaggerwhichfindst u by finding thetagsequencewith thehighestprobabilityasin (8).

(8) t u�v��\���[���������q�'�'�'� � ����v�����w z { | | |�{ w }^� ~ z { | | |�{ ~ }��This provides the most likely tag from the five possiblePOStagsfor eachword in a given input sentence,asin (9) (for glosssee(3)). Becausethetagsetat thisstagecontainsonly fivetags,thesparsedataproblemis minimized.

(9) a. � � � �� �� � �� �� � � ��!�#" � ?

b. � � � �� �� � /NN �� � /NN � � /VV ��!�#" � /VV ?

In ourimplementation,weusedastandardtrigrammodelcombinedwith Katzbackoff smoothingfor unseenevents.Unknown wordsaretaggedasNN. Thedetailsof the modelwe useareexplainedin AppendixB. The codefor theimplementationof this trigram-basedtaggerwasadaptedandmodifiedfromthetrigram-basedSuperTaggerby Srinivas(1997).�

In thespellingrecovery phase,the input sentenceis first run throughthePOStagger, followedby theapplicationof thespellingrecovery templatestoeachword.At thispoint,eachtemplateis takenasaninstruction,asfollows:

If the input word containsthesubword andis taggedwith the POStag,thenreplacethepart from theword matchingthesubword with thesub-string.

Somespellingrecoverytemplatesthatareunambiguouslyverb/adjective tem-platesaremadeto apply to input wordstaggedasNN aswell asVV. Thistakescareof unknown verbsandadjectivesthatarewrongly taggedasNN.

An exampleinput and output of spelling recovery is given in (10) (forglosssee(3)). The parts to which spelling recovery hasbeenappliedareunderlined.

(10) a. Input:� � � �� �� � �� � � � ��!�#" � ?

b. POStagging:� � � �� �� � /NN �� � /NN � � /VV ��!�#" � /VV ?

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.8

Page 9: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

9

c. Output:� � � �� �� � �� � � � " � � ��!�#" � ?

The spelling recovery phasewas evaluatedon the test sentenceswhichweresetasidefrom theTreebank.Theresultsaregivenin TableII.

TableII. Evaluationon SpellingRecovery

Process Accuracy (%)

POStaggingfor spellingrecovery 92.03

Spellingrecovery 96.33

Spellingrecoveryassuming100%POStagging 98.62

2.3. MORPHOLOGICAL TAGGING

Morphologicaltaggingis donein two stages:first, input wordsare taggedwith a trigram-basedmorphologicaltagger, and then the tag for unknownwords is updatedusing what we call “inflectional templates”.Eachstepisdescribedin moredetailbelow.

2.3.1. Initial morphological taggingA morphologicaltaggerbasedona trigram-basedmodelwastrainedon91%of the 54k KoreanTreebank.� The statisticalmodel for this morphologicaltaggeris the sameasthe oneusedfor the POStaggerthat wasusedin thespellingrecovery phase.That is, giventheinput word sequence~ z { | | |�{ ~ } ,the model is usedto find the tag sequencet u v�w uz { | | | w u} with the highestprobability. Theonly differenceis that themorphologicaltaggerusesa dif-ferentsetof tags.SeeAppendixB for thedetailsof themodel.Thetagsetforthe morphologicaltaggerconsistsof possiblecomplex tagsextractedfromtheTreebank,altogether165complex tags.Thesecomplex tagsarebasedon14 POStags,ten inflectionaltags,andfive punctuationtags.� Examplesaregiven in Table III. For example,NNC+CO+EFN+PAD is the complex tagfor a commonnoun,inflectedwith a copula,a sentencetype marker andapostposition.The taggerassignsan NNC (commonnoun) tag to unknownwords,NNC beingthe mostfrequenttag for unknown words.Dealingwithunknown wordsusinginflectionalinformationis discussedin Section2.3.2.

Thetaggingphasewasevaluatedon thetestsentences,obtaining81.96%accuracy. An exampleinput andoutputof themorphologicaltaggeris givenin (11).

(11) a. Input:" � ��� � � � � � � �s �� � ¡� � � � � �� �� B ��� � � .

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.9

Page 10: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

10

Table III. Examples ofComplex Tags

NNC

NNC+PCA

NNC+CO+EFN+PAD

NPR+PAD+PAU

NNU+PAU

VJ+EPF+EAN

VV+EPF+EFN

VV+EPF+EFN+PAU

VX+EPF+EFN+PAU

Cey-kaI-nom

kwanchukOBSERVATION

sahang-ulITEM-acc

pokoha-ess-supnita.REPORT-past-decl

‘I reportedtheobservationitems.’

b. Output:" � � /NPN+PCA � �¢ � � /NNC � � � �s �� � /NNC+PCA¡� � � � � �� �� B ��� � � /VV+EPF+EFN./SFN

2.3.2. Updating the morphological tagsThe resultsobtainedfrom the trigram-basedmorphologicaltaggerare notveryaccurate,themainreasonbeingthatboththenumberof unknown wordsandthenumberof tagsin thetagsetarequite large.This vindicatesthedis-cussionin Section1 thatarguesfor theimportanceof morphologicalanalysis.Without morphologicalanalysis,the tagging resultswill be quite low. Tocompensatefor this, we addeda phasethat updatesthe morphologicaltagsfor unknown words.Theupdateprocessexploits thefact thatthePOStagofa givenword is in generalpredictablefrom theinflectionson theword.Thatis, a large numberof inflectionscan only occur on verbs/adjectives, and alargenumberof inflectionscanonly occuronnouns.

For this purpose,inflectionaltemplates,which arepairsof inflection se-quencesandthe correspondingcomplex tag,wereextractedfrom the Tree-bank.Thetotal numberof extractedinflectionaltemplatesis 594.This com-prisesonly 1% of the numberof word tokensin the training data,showingthat thesizeof inflectionaltemplatesis relatively quitesmall in comparisonto thesizeof thetrainingdata.A few exampletemplatesarelistedin TableIV.Further, a list of commonnounstemswasextractedfrom theTreebank.

Theprocedurefor updatingthemorphologicaltagsis shown in Figure2.An exampleof the taggingupdateprocessis given in (12). The main

verb of the sentence��! � �� �� B ��� � � icesssupnita ‘forgot’, which is inflected

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.10

Page 11: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

11

TableIV. Examplesof InflectionalTemplatesW ShkiX m K _qp O esssupnita VV+EPF+EFNW Sh�WR K _¤£ O essunikka VV+EPF+ECSW Sh�WRN¥T¦ essumye VV+EPF+ECSW Sh�WRN¥R Z] essumulo VV+EPF+ECSW Sh�WX § WT¨ essumey VV+EPF+ENM+PAD£ O8b _ KX U kkacinun NNC+PAD+PAUpP© Z] ¥ `U tayloman NNC+PAD+PAUZ]QªL^«PY W¬ lopwuteuy NNC+PAD+PCA

1. For eachinput word that is taggedasanNNC, first checkif theword isincludedin thecommonnounlist:

a) If theword is includedin thecommonnounlist, it is a known word,henceleave thetagasit is.

b) If theword is not includedin thecommonnounlist, it is anunknownword,hencechecktheinflectionaltemplatesto updatethetag.

i) Match the inflection sequencewith the word, startingfrom theright side. If there are matchinginflection sequences,choosethecomplex tag pairedwith the longestmatchinginflection se-quence,andreplacetheNNC tagwith thatcomplex tag.Thatis,adopta right-to-left longestmatchrequirement.

ii) If thereis no matchinginflection sequence,leave the complextagasNNC.

2. For thosewordsthat arenot taggedsimply asNNC, leave their tagsasthey are.

Figure 2. Procedurefor UpdatingtheMorphologicalTags

with a past-tensemarker anda declarative sentencetypemarker, is wronglytaggedasNNC in the trigram-basedtaggingphase.This tag is updatedasVV+EPF+EFN,VV for verb,EPFfor tense,andEFNfor sentencetype.

(12) a. Input:

" � � /NPN+PCA��! � �� �� B ��� � � /NNC� �­ � � /NNC � � � �s �� � /NNC+PCA

Cey-ka kwanchuk sahang-ul ic-ess-supnita.

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.11

Page 12: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

12

I-nomOBSERVATION ITEM-accFORGET-past-decl‘I forgot theobservation items.’

b. Output:" � � /NPN+PCA � �¢ � � /NNC � � � �s �� � /NNC+PCA��! � �� �� B ��� � � /VV+EPF+EFN./SFN

After the tagsfor unknown wordswereupdated,the morphologicaltag-ging accuracy on the testsetincreasedto 93.86%,a 12-pointincreasefromthe trigram-basedmorphologicaltagging.If the shortestmatchis employedto thetaggingupdateprocess,theaccuracy reducesto 90.29%.

2.4. LEMMA /INFLECTION IDENTIFICATION

Lemma/inflectionidentificationproducesthefinal outputof our morpholog-ical tagger. Given a pair of a word andits complex tag, the complex tag isdecomposedand eachconstituenttag is pairedup with the correspondingmorphemein theword.Thatis, thelemmaandits POStagarepairedup,andtheneachinflectionis pairedupwith thecorrespondinginflectionaltag.Thisis illustratedin (13), where ® , ¯ , ° and ± standfor substringsof a word, ²standsfor aPOStag,and ³ , ´ and µ standfor inflectionaltags.

(13)

®�¯q°q± /²·¶¢³·¶¢´k¶µ¸®�¹º²�¶­¯ ¹�³»¶°¼¹�´½¶±+¹�µThe complex tag in conjunctionwith the inflection andstemdictionary

(extractedfromtheTreebank)look-upareusedtodeterminethelemma/inflectionidentificationprocess.Theinflectiondictionarycontains230entriesin total,andthestemdictionarycontains3,880entriesin total.Exampleentriesfromthe inflection dictionaryandstemdictionaryaregiven in TableV. The listin the stem dictionary is small becausethe size of the Treebankitself issmall.It is possibleto includeamuchlargerstemdictionary, usingresourcesfrom outsidetheTreebank,to enhancetheoverall performanceof thetagger.We however did not do so for presentpurposesin order to be faithful tothe corpus-basedapproachwe have adopted;the datausedfor the imple-mentationandsubsequentlythe evaluationof the taggeris restrictedto theTreebank.An exampleinputandoutputof thelemma/inflectionidentificationis givenin (14) (for glosssee(11)).

(14) a. Input:" � � /NPN+PCA � �¢ � � /NNC � � � �s �� � /NNC+PCA¡� � � � � �� �� B ��� � � /VV+EPF+EFN./SFN

b. Output:" � /NPN+ � /PCA � � � � /NNC � � � �s /NNC+ �� � /PCA¡� � � � /VV+ � �� /EPF+�� B ��� � � /EFN ./SFN

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.12

Page 13: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

13

TableV. ExampleEntriesfrom InflectionandStemDictionaries

InflectionDictionary StemDictionary¾ i�_cp O (psita) EFN MP`U M g¿ (kankyek) NN¾ i�_cW] (psio) EFN MP`U p `U n O (kantanha) VJMPO (ka) PCA MP`U Z À¿ n _ (kanlyakhi) ADVMTÁh (keyss) EPF MP`U i SU (kansen) NNCp O (ta) EFN MP`U i Sm n O (kansepha) VVª] p O (pota) PAD MP`U i Sm (kansep) NNCW `h (ass) EPF MP`U bL n O (kancwuha) VVWX U (un) PAU MP`e (kal) VV

The proceduresfor lemma/inflectionidentificationare shown in Figure 3.Givenapairof a wordandits complex tag( ®�¯8°c± , ²�¶¢³�¶Â´Ã¶­µ ): Ä1. Separatethesubstringfrom the input word that matchesthe longestin-

flection in the inflection dictionaryassociatedwith the last inflectionaltag in the input complex tag,andlabel it with the inflectional tag (e.g.,±�¹�µ );If thereis nomatchinginflectionin theinflectiondictionary, separatethesubstringthat matchesthe longestinflection in the inflection dictionaryassociatedwith the second-to-lastinflectional tag, andlabel it with theinflectionaltag(e.g., ±+¹�´ );

2. Repeatthe above procedureuntil the POS tag (i.e., the first tag) isreached.Theremainingsubstringis labeledwith thePOStag(e.g., ®�¹º² ).

Figure 3. Proceduresfor Lemma/InflectionIdentification

In this sectionwe showedthatthecombinationof corpus-basedrulesandthesimplealgorithmgiven above appliedto morphologicallytaggeddataissufficient for the usually complex task of lemma/inflectionidentification.ÅHoping to improve the performance,we have alsoimplementeda variationof this techniqueusingcorpus-particularknowledgesuchasspecialcasesforPOStagslike NNC, VV, andVJ basedon our intuitions. The performancehowever did not improve much.Theperformanceresultsaregiven in TableVI.

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.13

Page 14: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

14

TableVI. Evaluationof theMorphologicalTagger

Method Precision(%) Recall(%)

Treebank-trained 95.43 95.04

Treebank-trainedÆ 95.78 95.39

3. Evaluation and Discussion

3.1. EVALUATION OF THE MORPHOLOGICAL TAGGER

The performanceof the morphologicaltaggertrainedon the Treebankhasbeenevaluatedon 9% of the Treebank.The testsetconsistsof 3,717wordtokensand425sentences.Both precisionandrecallwerecomputedby com-paringthe morpheme/tagpairs in the testfile andthe “gold file”. The goldfile containshand-correctedannotationsthat areusedto evaluatethe accu-racy of the taggeroutput. The precisioncorrespondsto the percentageofmorpheme/tagpairs in the gold file that matchthe morpheme/tagpairs inthetestfile. Therecallcorrespondsto thepercentageof morpheme/tagpairsin the test file that matchthe morpheme/tagpairs in the gold file. Our ap-proachyielded95.43%precisionand95.04%recallfor theTreebank-trainedtagger. Further, usingthecorpus-specificlemma/inflectionidentificationpro-cedurein theTreebank-trainedÇ tagger, theperformanceincreasedonly veryslightly, yielding precisionandrecall scoresof 95.78%and95.39%respec-tively.

Theperformanceobtainedis comparableto theresultsreportedfor state-of-the-artsystems.TableVII shows accuracy scoresreportedfor instancebysystemssuchasthosedescribedin Chanetal. (1998),Yoonetal. (1999),andLim etal. (1997),whentestedonthetestsetdrawn from thesamecorpuseachwastrainedon. Ideally, for a fair comparison,it wouldbedesirableto retrainothersystemson thePennKoreanTreebankandcomparetheirperformanceswith theproposedtaggeron thesametestset,or retraintheproposedtaggeron thesamedomainastheothertaggersandcomparetheresultson thesametest set.Unfortunately, both optionsare not realistic given that neitherthetrainingdatanor thesourcecodefor theothertaggersarepublicly available.

3.2. ERROR ANALYSIS

Thesourcesfor errorsin oursystemcanbedividedinto six majortypes:tag-ging, spellingrecovery andambiguityresolutionerrors,andmissingentriesfrom the commonnounlist, the inflectional templatelist andthe inflectiondictionary. The mostfrequenttype of error wasa taggingerror, comprising

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.14

Page 15: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

15

Table VII. Comparison with otherMorphologicalTaggers

Reference Accuracy (%)

Chanet al. (1998) 97.0

Yoonet al. (1999) 94.7

Lim et al. (1997) 94.8

63% of total errors.In exampleswith taggingerrors,the lemmaor the in-flectionis mistagged.Thesecondmostfrequenttypeof errorwascausedbyincorrectspellingrecovery, comprising21%of totalerrors.If all morpholog-ical contractionsin the input string arenot correctlydecomposed,the inputwill be misanalyzedandmistagged.Ambiguity resolutionerrorsaremostlycausedby our longestmatchstrategy. Givena pair of a word anda complextag( ®�¯q° , ³È¶�´ ), if theinflectiondictionarycontainsanentrywhereq° is asso-ciatedwith ´ andanotherentrywhere° is associatedwith ´ , thenthepossibleanalysesfor ®+¯q° are ®�¯ ¹�³È¶É°¼¹�´ and ®�¹�³È¶^¯q°¼¹�´ . Thelemma/inflectionidenti-ficationprocedurewill alwaysselecttheanalysiswherethelongestsubstringis identifiedasaninflection,selectingthesecondanalysis.Thus,anerrorwillresultif thecorrectanalysisis thefirst one.This typeof errorcovered16%of total errors.Further, in a smallnumberof cases,missingentriesfrom thecommonnounlist, the inflectionaltemplatelist andtheinflectiondictionaryresultedin errors.Thenumberandthepercentageof thetypesof errorsfromthetestdatais summarizedin TableVIII, alongwith examples.

4. Conclusion

In this paper, we describean implementedmorphologicaltagger(that alsodoesmorphologicalanalysis)for Korean,a languagewith a rich inflectionalsystem.Thetrainingdatafor thetrigram-basedtaggingcomponent,andmor-phologicalrules,inflection andstemdictionariesfor morphologicaltaggingandanalysisareautomaticallyextractedfrom anannotatedcorpora,thePennKoreanTreebank.This corpus-basedapproachenabledus to implementtheentiresystemin a short periodof time (roughly eight weeks),with result-ing performancein the rangeof 95% precision/recallon the test set fromthe samedomainas the training data.The treebank-trainedmorphologicaltaggerperformedaswell asstate-of-the-arttaggerswhosetraining dataarefrom differentdomains,eventhough,in contrastto othertaggers,in our sys-temtheanalysisphasefollows thetaggingphase.This waspossiblebecauseof the two-stagemorphologicaltaggingprocesswe employed. In particu-

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.15

Page 16: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

16

TableVIII. Summaryof Error Analysis

Typeof error Number (%) Gold Test

Tagging 174 63 bPSÊ b�Ë¿ n O /VJ+ b _ /ECS bPSÊ b�Ë¿ n O /VV+ b _ /ECS

cengcikha-ci cengikha-ci

HONEST-connective ‘be honest’

Spelling 57 21 Ì _qWL /VV+ Í /EAN Ì _cWl e /NNC

recovery piwu-l piwul

EMINATE-adnominal

‘which will eminate’

Ambiguity 16 6 ªL pP© /NNC+ Z] /PAD ªL /NNC+ pP© Z] /PAD

pwutay-lo pwu-taylo

SQUAD-TO ‘to thesquad’

MissingNNC 15 5n `¿ p] /NNC

n `¿ /NNC+ p] /PAU

hakto hak-to

CADET ‘cadet’

Missing 11 4nl U Z gU /NNC+ W _ /CO+ Ml U /EFN

nl U Z gU W _ Ml U /NNC

template hwunlyun-i-kwun hwunlyunikwun

DRILL-cop-decl‘be a drill’

Missing 4 1 i] pP© MX m /NNC+ WT¨�£ Ocb _ /PAD i] pP© MX m WT¨ /NNC+ £ O8b _ /PAD

inflection sotaykup-eykkaci sotaykupe-kkaci

PLATOON LEVEL-TOWARDS

‘towardstheplatoonlevel’

lar, in the secondtagging stage,the morphologicaltags for the unknownwordsareupdated,exploiting the inflectionsby matchingtheseagainstin-formationstoredin inflectionaltemplates.Our taggerhasbeensuccessfullyincorporatedinto a Koreanstatisticalparser(SarkarandHan,2002), and aKorean–Englishmachinetranslationsystemaspart of the sourcelanguageanalysiscomponent(Palmeretal., 2002). Our approachis applicableto anylanguage,as long asa corpusannotatedwith morphologicalinformation isavailable. Although the training dataand the morphologicalrules will becorpus-specific,andhencelanguage-specific,the generalapproachand theproceduresthatwehaveproposedfor spellingrecovery, updatingmorpholog-ical tags,andlemma/inflectionidentificationaswell astrigram-basedtaggingarelanguage-neutral,andcanbeappliedto any agglutinative language.In thefuture,we would like to testtheportability of our approachby first retrain-ing our taggeron a larger corpus,andsecondby applyingour approachtomorphologicaltaggerdevelopmentfor otherlanguageswith rich morphology.

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.16

Page 17: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

17

Appendix

A. A List of Tags used in the Penn Korean Treebank

A.1. POS TAGS

NPR: propernoun:� �� Î � (hankwuk ‘Korea’), Ï� �¢Ð�!�ÒÑ r � (kullinthon ‘Clinton’)

NNC : commonnoun:� ��¢� (hakkyo ‘school’), ÏP�ÓÔÕ Ñ � (kemphyute ‘com-

puter’)

NNX : dependentnoun: �G (kes ‘thing’),�� s (tung ‘etc’), � D� (nyen ‘year’)

NPN: personalpronoun,demonstratives: C (ku ‘he’), ��� �G (ikes ‘this’)

NNU : ordinal, cardinal,numeral: 1,� � � � (hana ‘one’),   �GÒÖ t (chesccay

‘first’)

NFW : wordswritten in foreigncharacters: Clinton , computer

VV : verb: � (ka ‘go’),� �� (mek ‘eat’)

VJ : adjective : � ×�ØC (yeppu ‘pretty’),� � ÐC (talu ‘dif ferent’)

VX : auxiliarypredicate: ��!� (iss ‘presentprogressive’),� � (ha ‘must’)

ADV : constituentandclausaladverb:� t �� (maywu ‘very’), "� �Ù s � � (coy-

onghi ‘quietly’),���� ��!� (manil ‘if ’)

ADC : conjunctive adverb : CFÐ�� � (kuliko ‘and’), C#Ð �E� � (kulena ‘but,however’)

DAN : configurative, demonstrative : � t (say ‘new’),� �� (hen ‘old’), C (ku

‘that’)

IJ : exclamation: � � (a ‘ah’)

LST : list marker : a, (b) , 1, 2.3.1 , � (ka), � � . (na.)

A.2. INFLECTIONAL TAGS

PCA : casemarker : � / ��� (ka/i nominative), �� � / Ð� � (ul/lul accusative), �Ú(uy possessive), ��Û (ya vocative)

PAD : adverbialpostposition: � ��� � (yese ‘from’), Ð� (lo ‘to’)

PCJ: conjunctive postposition: �Ü / Ü (wa/kwa ‘and’),� �Ý� (hako ‘and’)

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.17

Page 18: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

18

PAU : auxiliarypostposition:���� (man ‘only’),

��(to ‘also’), �� � (nun topic)

CO : copula: ��� (i ‘be’)

EFN: sentencetypemarker : �� � � � / H � � (nunta/nta declarative), � �ÝÐ � / Ð �(ela/la imperative), ��� / �� � � / �� � " � (ni/nunka/nunci interrogative), " �(ca propositive)

ECS: coordinate,subordinate,adverbial, complementizer: � (ko ‘and’),�CIÐ� (mulo ‘because’), � (key attachesto adjectivesto deriveadverbs),� �E� (tako ‘that’)

EAU : auxiliary, on verbsor adjectivesthat immediatelyprecedeauxiliarypredicates:� � (a), � (key), " � (ci), � (ko)

EAN : adnominal,onmainverbsor adjectivesin relativeclausesor comple-mentclausesof acomplex NP: �� � / H (nun/n)

ENM : nominal,on nominalizedverb: � (ki), Þß à (um)

EPF: pre-finalending: � �� (ess past), ��� (si honorific)

XSF : suffix : ��!Ó (nim),�� � (tul), " �� (cek)

XPF : prefix : " � (cey), �� (kak),� t

(may)

XSV : verbalizationsuffix :� � (ha),

�á(toy), ���ÝÏ � (siki)

XSJ: adjectivizationsuffix : �CFÐ �B (sulep),� �B (tap),

� � (ha)

A.3. PUNCTUATION TAGS

SCM : comma: ,

SFN: sentenceendingmarkers: . ? !

SLQ: left quotationmark: ‘ ( ‘‘ âSRQ: right quotationmark: ’ ) ’’ ãSSY: othersymbols: ... ; : -

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.18

Page 19: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

19

B. Description of the Statistical Model for Tagging

In this appendix,we provide detailsof thestatisticalmodelwe usedin twodifferentstepsin our overall morphologicaltaggingalgorithm.Theonly dif-ferencein the two usesof the modelwasthe changein the tag setused.InSection2.2, we useda tag set of five tagsin order to perform the task ofspelling recovery, while in Section2.3.1,we useda tag set of 165 tagsinorderto performthefull taskof taggingeachmorpheme.

Let the input sentence(word sequence)be w vä~ z { ~ � { | | |�{ ~ } . Let themostlikely tagsequencebet u�våwyuz\{ wyu �c{ | | |�{ wæu} . Givenaninputwordsequencew wewantto find themostlikely tagsequence,whichwedoby constructinga probabilitymodelwhich providesa methodfor comparingall possibletagsequencesgiventheinputwordsequence,����w z { w � { | | |�{ w }ç� ~ z { ~ � { | | |�{ ~ }�� .Thebest(or mostlikely) tagsequenceis (1).

t u v#�\���[���������c�'�'�'� � ������w z { | | |�{ w } � ~ z { | | |�{ ~ } � (1)

In orderto estimatethis from our trainingdata,we useBayesrule to con-vert our conditionalprobability into a generative model(2). We arecompar-ing tagsequencesfor thesameinputwordsequence,andsowecanignorethedenominator����~ z { | | |�{ ~ }�� (3) sinceit doesnotcontributeto thedistinctionbetweendifferenttagsequences.

�\���[���������q�'�'�'� � ��������w z { | | |�{ w }^� ~ z { | | |�{ ~ }��vé�\���[����� � � �'�'�'� � � � ����~ z { | | |¼{ ~ }ê� w z { | | |�{ w }��Èë ����w z { | | |¼{ w }������~ z { | | |�{ ~ } � (2)

vé�\���[������� � �'�'�'� � � � ����~ z { | | |¼{ ~ }ê� w z { | | |�{ w }��Èë ����w z { | | |¼{ w }�� (3)

Thetwo termsin this equationcannotbereasonablyestimateddueto sparsedataproblems.We apply theMarkov assumptionto simplify eachterm.Forthefirst termthat involves lexical probabilities,theassumptionmadeis thattheprobabilityof generatingaword is only dependenton its POStag(4).

����~ z { | | |�{ ~ }ç� w z { | | |¼{ w }��vì����~ z � w z �Èë ����~ � � w � �íë | | | ë ����~ } � w } �v }îï�ð z ����~ ï � w ï � (4)

Thesecondterm is themodelthatproducesPOStagsequences.We usetheMarkov assumptionto generatetagsequencesusinga trigrammodel,whereeachtagis generatedbasedon theprevioustwo tags(5).

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.19

Page 20: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

20

����w z { | | |�{ w }��vé����w z �Èë ����w � � w z �Èë ����w � � w z { w � �Èë | | | ë ����w }ê� w }èñ � { w }èñ � � (5)vé����w z �Èë ����w � � w z �Èë }îï�ð � ����w ï � w ï ñ � { w ï ñ � �Puttingthesetwo termsbackinto theoriginalequationwe get:

�\���[���������q�'�'�'� � ��������w z { | | |�{ w }^� ~ z { | | |�{ ~ }��vì�\���[���������q�'�'�'� � ��������~ z { | | |¼{ ~ }ê� w z { | | |�{ w }��Èë ����w z { | | |¼{ w }��vì�\���[���������q�'�'�'� � ��� ò }îï�ð z ����~ ï � w ï �ôóõëò ����w z �íë ����w � � w z �íë }îïöð � ����w ï � w ï ñ � { w ï ñ � �ôó (6)

vì�\���[����� ���q�'�'�'� � � � }îï�ð z ����~ ï � w ï �Èë ����w ï � w ï ñ � { w ï ñ � � (7)

In order to simplify our equationfrom (6) to (7), we addspecialtokensoftype<bos> to thebeginningof eachsentencein orderto conditionthefirstword on thesetokens.We alsoaddtokensof type <eos> to the endof thesentence.

Now thatwehavethemodelin (7),all weneedto doto find themostlikelytagsequenceis to train thefollowing two probabilitymodels,����~ ï � w ï � and����w ï � w ï ñ � { w ï ñ � � .Thisis doneusingourtrainingdatawhichconsistsof wordandtagsequences.To find themostlikely tagsequence,we usethesamealgorithmusedto findthebesttagsequencein HiddenMarkov Models:theViterbi algorithm.

Themodelhowever cannothandleunseendata,suchasunseenwords ~ ï ;unseenword–tagcombinations��~ ï { w ï � ; andunseentrigrams ��w ï ñ � { w ï ñ � { w ï � .For unseenwords,werenamethosewordswith singletoncountsin our train-ing dataasaspecial<unseen> token.Dueto thismodel,asstatedin Section2.3.1,the trigram taggerfor morphologicaltaggingassignstheNNC (com-monnoun)tag to unknown wordsby default, NNC beingthemostfrequenttag for unknown words.The updatetaggingstep(seeSection2.3.2) doesfurther disambiguationfor theunknown words.We alsousesuffix andpre-fix featuresfrom the wordsanduseit in the original modelaswasdoneinWeischedelet al. (1993).For unseentri-tag sequencesin the trigram model

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.20

Page 21: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

21

of tag sequences,we useKatz backoff smoothing(Katz,1987) which usesbigramsof tagsequenceswhenthetrigramsequencewasunobserved in ourtrainingdata.Thediscountfactorfor thebigramsequenceis computedusingthe Good–Turing estimate(Good,1953) which is the standardpracticeinKatzbackoff smoothing.

Acknowledgements

WethankAnoopSarkarandAravind Joshifor valuablediscussionsonmanyoccasions.We are also grateful to the anonymous reviewer whosecriticalcommentshelpedto reshapeandimprove thispaper. All errorsareours.Thiswork wassupportedby IRCS,andcontractDAAD 17-99-C-0008awardedbytheArmy ResearchLabto CoGenTex, Inc.,with theUniversityof Pennsylva-nia asa subcontractor. In addition,thefirst authorwassupportedby SSHRC#410-2003-0544.

Notes÷Hangulcharactersareromanizedfollowing theYaleRomanizationConventionthrough-

outthepaper. In thisandsubsequentexamples,thefollowingabbreviationsareused:acc(usativecasemarker),co(pulainflection),decl(arative), int(errogative),nom(inative casemarker).ø

Furtherrelevantworks,all in Korean,whichthepresentauthorswereunfortunatelyunableto locate,andfor which incompletereferencesareprovided, includea presentationby E. C.Lee andJ. H. Lee at the 4th Conferenceof Hanguland KoreanInformationProcessing,ajournalpaperby J. H. Choi andS. J. Lee in the1993Journal of Korea Information ScienceSociety, andMastersthesesby S.Y. Kim, from KAIST in 1987,andby H. S.Lim from KoreaUniversityin 1994.ù

NNC is a tagfor commonnouns,NPNfor pronouns,VV for verbs,andVX for auxiliaryverbs.PCA is a tagfor casemarkers,EAU for inflectionson a verb followedby anauxiliaryverb,andEFN for sentence-typeinflections.The tagsusedin thePennKoreanTreebankarelistedin AppendixA. SeeHanandHan(2001)for furtherexplanationsof thesetags.ú

Thespellingrecovery templatesaresimilar to themorphographemicrewrite rulesfoundin traditionalmorphologicalanalyzers.Thedifferencehowever is thatwhile traditionalmor-phographemicrewrite rules are conditionedon morphologicalnotions suchas morphemeboundaries,hencepresupposinga segmentationof a word into its componentmorphemes,ourspellingrecovery templatesareconditionedon thePOStagassignedto theentireword.û

Srinivas’s (1997) SuperTaggeris publicly available from the XTAG Projectwebpage:http://www.cis.upenn.edu/˜xtag .ü

While theKoreanTreebankis smallerthantheusualdatasetusedin EnglishPOStagging(the 1m-word PennEnglishTreebank(Marcuset al., 1993)) the sizeof the Treebankis bigenoughto provide statisticallysignificantresults.Hence,we did not find theneedto conductý -fold crossvalidationexperiments.þ

Thetagsetusedin our morphologicaltaggeris a subsetof theKoreanTreebanktagset.In thetagsetfor thetagger, EAU andECSarecollapsedto ECS,andaffixal tagssuchasXPF,

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.21

Page 22: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

22

XSF, XSV andXSJarenot included.In particular, in thetrainingdata,thePOStagfor wordswith XSV tagwereconvertedto VV, andXSJtagwereconvertedto VJ.ÿ

The longestmatchrequirementemployed in lemma/inflectionidentification is similarto the oneemployed in a finite-statetransducerwith DirectedReplacement,asdescribedinKarttunen(1996).�

A reviewer asksif we assumethat all morphologicalprocessesare via overt suffixes.Althoughtheremaybeempiricalandtheoreticalmotivationsfor positingnull morphemesinKorean,we did not model this in our system.The main reasonfor not doing so is that theTreebankwe useddid not have any annotationsfor null morphemes.

References

Chan,Jeongwon, GeunbaeLee, and Jong-HyeokLee: 1998, ‘GeneralizedUnknown Mor-phemeGuessingfor Hybrid POS Tagging of Korean’. In: Proceedings of the SixthWorkshop on Very Large Corpora, Montreal,Canada,pp.85–93.

Church,Kenneth:1988,‘A StochasticPartsProgramandNounPhraseParserfor UnrestrictedText’. Computer Speech and Language 5, 19–54.

Ezeiza,N., I. Alegria, J.M. Arriola, R. Urizar, and I. Aduriz: 1998,‘Combining StochasticandRule-basedMethodsfor Disambiguationin Agglutinative Languages’.In: COLING-ACL ’98: 36th Annual Meeting of the Association for Computational Linguistics and 17thInternational Conference on Computational Linguistics, Montreal,Canada,pp.379–384.

Good,I. J.: 1953,‘The PopulationFrequenciesof SpeciesandtheEstimationof PopulationParameters’.Biometrika 40, 237–264.

Hajic, Jan and BarboraHladka: 1998, ‘Tagging Inflective Languages:Predictionof Mor-phological Categories for a Rich, StructuredTagset’. In: COLING-ACL ’98: 36thAnnual Meeting of the Association for Computational Linguistics and 17th InternationalConference on Computational Linguistics, Montreal,Canada,pp.483–490.

Hajic, Jan,Pavel Krbec, Pavel Kveton, Karel Oliva, and Vladimır Petkevic: 2001, ‘SerialCombinationof RulesandStatistics:A CaseStudyin CzechTagging’. In: Associationfor Computational Linguistics 39th Annual Meeting and 10th Conference of the EuropeanChapter, Toulouse,France,pp.260–267.

Hakkani-Tur, Dilek Z., Kemal Oflazer, and GokhanTur: 2002, ‘Statistical MorphologicalDisambiguationfor Agglutinative Languages’.Computers and the Humanities 36, 381–410.

Han,Chung-hyeandNa-RaeHan:2001,‘Partof SpeechTaggingGuidelinesfor PennKoreanTreebank’.IRCSReport01-09,IRCS,Universityof Pennsylvania.

Han,Chung-hye,Na-Rare[sic] Han,Eon-SukKo, andMarthaPalmer:2002,‘DevelopmentandEvaluationof aKoreanTreebankandits Applicationto NLP’. In: LREC 2002: ThirdInternational Conference on Language Resources and Evaluation, Las Palmasde GranCanaria,Spain,pp.1635–1642.

Hong, Y., M. W. Koo, andG. Yang:1996, ‘A KoreanMorphologicalAnalyzer for SpeechTranslationSystem’. In: ICSLP 96: The Fourth International Conference on SpokenLanguage Processing, Philadelphia,PA, pp.676–679.

Karttunen,Lauri: 1996,‘DirectedReplacement’.In: 34th Annual Meeting of the Associationfor Computational Linguistics, SantaCruz,California,pp.108–115.

Katz,Slava M.: 1987,‘Estimationof Probabilitiesfrom SparseDatafor theLanguageModelComponentof a SpeechRecognizer’.IEEE Transaction on Acoustics, Speech and SignalProcessing 35, 400–401.

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.22

Page 23: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

23

Lee, Sang-Zoo,Jun-ichi Tsujii, and Hae-ChangRim: 2000, ‘Lexicalized Hidden MarkovModels for Part-of-speechTagging’. In: Proceedings of the 18th International Confer-ence on Computational Linguistics, COLING 2000 in Europe, Saarbrucken,Germany, pp.481–487.

Lim, Hewui Seok,Jin-Dong Kim, and Hae-ChangRim: 1997, ‘A KoreanPart-of-speechTaggerusingTransformation-basedError-drivenLearning’. In: Proceedings of the 1997International Conference on Computer Processing of Oriental Languages, Hong Kong,pp.456–459.

Lim, Heui-Suk,Sang-ZooLee,andHae-ChangRim: 1995,‘An efficient Koreanmorpholog-ical analyzerusing exclusive information’. In: International Conference on ComputerProcessing of Oriental Languages, ICCPOL ’95, Honolulu,HI.

Marcus,Mitch, BeatriceSantorini,andM. Marcinkiewicz: 1993,‘Building aLargeAnnotatedCorpusof English’. Computational Linguistics 19, 313–330.

Palmer, Martha,Chung-hyeHan, Anoop Sarkar, and Ann Bies: 2002, ‘Integrating KoreanAnalysisComponentsin a Modular Korean/EnglishMachineTranslationSystem’. Ms.Universityof PennsylvaniaandSimonFraserUniversity.

Ratnaparkhi,Adwait: 1996,‘A Maximum Entropy Model for Part-of-speechTagging’. In:Proceedings of the Conference on Empirical Methods in Natural Language Processing,Philadelphia,Pa,pp.133–142.

Sarkar, AnoopandChunghyeHan:2002,‘StatisticalMorphologicalTaggingandParsingofKoreanwith anLTAG Grammar’. In: Proceedings of the 6th International Workshop onTree Adjoining Grammars and Related Formalisms TAG+6, Venice,Italy, pp.48–56.

Srinivas,B.: 1997, ‘Complexity of Lexical Descriptionsand its Relevanceto Partial Pars-ing’. Ph.D. thesis,Departmentof Computerand Information Sciences,University ofPennsylvania.

Tufis, Dan,PeterDienes,CsabaOravecz,andTamasVaradi:2000,‘PrincipledHiddenTagsetDesignfor IteratedTaggingof Hungarian’.In: LREC 2000: 2nd International Conferenceon Language Resources and Evaluation, Athens,Greece,pp.1421–1426.

Weischedel,Ralph,RichardSchwartz, Jeff Palmucci,Marie Meteer, and LanceRamshaw:1993, ‘Coping with Ambiguity and Unknown Words through Probabilistic Models’.Computational Linguistics 19, 359–382.

Yoon,Juntae,C. Lee,S.Kim, andM. Song(�� ���� ��� , � ��������� , ����������� , !�"�#%$&(')* ): 1999,‘ +-,�/.-012(3 45 �768:9� � ')*�; � Morany: < =>@?A �CB �EDFHGI�JKMLI �� NPO%QR 12TSVUW�XY[Z-\^]_a`-b � �dc eEf-g�� �dhikj-l ; �nm%opO%QR 3 45 �768:9� � ')* ’ [MorphologicalAnalyzerof YonseiUniversityMorany: Morphological

AnalysisbasedonLargeLexical DatabaseExtractedfrom Corpus].In Proceedings of the11th Conference on Hangul and Korean Language Information Processing ( q 0 11�rsO%QRutv wx yz|{} ~ Z-\����/�F���7� ���%����A � 12 �r��� �u�� ��� y� ), pp.92-98.

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.23

Page 24: A Morphological Tagger for Korean: Statistical Tagging ...chunghye/papers/mtjournal-editor-edit.pdf · A Morphological Tagger for Korean: Statistical Tagging Combinedwith Corpus-basedMorphologicalRule

mtjournal-editor-edit.tex; 13/07/2004; 9:25; p.24