project leaders: sun maosong isahara hitoshi choi...

Word segmentation: Part 2Chinese, Japanese, Korean

Project Leaders:Sun MaosongIsahara HitoshiChoi Key-Sun

Presentation

• General Introduction to CJK Word Segmentation (Choi Key-Sun)

• Chinese GB Standard on Word Segmentation (Song Min)

• Japanese Specification (Isahara Hitoshi)• Korean Specification

(Choi Key-Sun, Hwang Dosam)• Chinese Specification (Song Min)

Preparatory Works

• One CJK Tri-lingual Parallel Corpus– Mainichi Shinbun in Japanese source– Translated into Chinese and Korean– Word segmented and POS-tagged

• Word list – 1,000 word lists of CJK

• Independently collected, • Possibly sorted by frequency in each language’s

corpus

Tri-Parallel Corpus of CJI

• # S-ID:950101004-002 KNP:96/10/27 MOD:96/11/26• J# ロシア南部チェチェン共和国の首都グロズヌイに進

攻したロシア軍は三十一日、首都中心部を装甲車などで攻撃、大統領官邸など数カ所が炎上した。

• C# 进攻俄罗斯南部车臣共和国首都格罗兹尼的俄罗斯军队,31日出动装甲车等进攻首都中心地带,总统官邸等数处着火。

• K# 러시아남부체첸공화국수도그로즈니를공격하는러시아군은 31일장갑차등을동원해수도중심지대를진격하여대통령관저등많은곳에서불이났다.

Language and Morphology

• English– Lemmatizer + POS tagger– A word has at most several word forms

• Chinese– Segmentation System + POS tagger– Word has one word form.

• Korean (Japanese)– Morphological Analyzer is more essential than other

language• Complex Segmentation• Frequent Agglutination• Word Formation

– The Chinese characters in common use

Computational Morphology

• Morpheme– A minimal meaningful elements– "computational" = "comput+ation+al"

• Morphological Analysis– Segmentation

• To divide into morphemes or words• To include lemmatization (= to find its stem)

– Categorization• To assign Part-of-speech category (POS tags)• To assign Semantic features

Morphological Analysis: e.g.,

• These probabilities are learned from the raw corpus in an unsupervised manner.– Original text

• These probability+Plural be+Plural+Presentlearn+PP from the raw corpus in a+Vowelun_supervise+PP manner .– Lemmatized text

• These/pron probability/noun+pluralare/be_verb+plural+present learn/verb_ed/ppfrom/prep the/article raw/adjective corpus/nounin/prep an/article un/prefix_supervis/verb_ ed/pp manner/noun ./period– part-of-speech categorization (POS

tagging/Annotation)

Word Formation

• Types of Languages– Inflectional

• e.g., Latin• canta-bo

– Analytic - derivation• e.g., English • I will sing

– Agglutinative - concatenation• e.g., Korean, Japanese, Hindi, Turkish, • na-neun norayha-keyssda• 나는 노래하-겠다

Complexity Comparison• Complexity?

– In English, we have only to consider each word form.– In Chinese, segmentation rather than morphological analysis.– In Korean, MA should process the segmentation and

agglutination simultaneously. Much more Complexity in segmentation and analyzing functional words. (Japanese is similar with Korean )

Spacing

Order of verb formsPer one verb

Complexity of Segmentation

English

Word form

5

Easy

Korean

Word Phrase

More than 5000

Very Hard

Chinese

No

1

Very Hard

Table of Contents

• Introduction: Morphology and Characteristics

• English, Korean, Chinese, Japanese, – Complexity of Morphology– Types of Ambiguities

• General Representation– Components of Morphological Analysis– Implementation Scheme– Unknown Words

General Morphological Analyzer: 2 phases

• Candidate generation– Generates possible sequences of morphemes

• Segmentation– Lemmatization - Recovery of phonetic changes

• Processing for unknown word

• Candidate selection– “POS Tagging”– Methodologies

• Statistical methods: e.g., if we see two or three consecutive words and their POS tags, we can predict what the current word's tag is. (Noun-Verb-Noun)

• Rule-based methods• Hybrid methods

General Scheme of ImplementationMorphological AnalyzerMorphological Analyzer

POS POS TaggerTaggerStatsticsStatstics

Linguistic RulesLinguistic Rules

Additional Additional ComponentsComponents

Unknown wordsUnknown words

Symbols/NumbersSymbols/Numbers

Exceptional Exceptional word phraseword phrase

PostPost--processingprocessing

Foreign wordsForeign words

Basic ComponentsBasic Components

CodeCode SystemSystem

DictionaryDictionaryPhonologicalPhonological

RulesRules

ConnectionConnectionRulesRules

Algorithms Algorithms and and

Data StructureData Structure

ParserParser

Implementation (2/3)

• Morphological connection rules– idioms– patterns– weighted rules

• Dictionaries (Lexicons)– general lexicon

• level of segmentation depends on the lexicon– domain-specific dictionary

• e.g., economics, law, patent (science & technology), ...

– user-defined dictionary

Implementation (3/3)Diagram of word segmentation system for Chinese

Disambiguation

POS-Tagging

Unknown WordsIdentification

GenerateSubstrings

Word Matching

MorphologicalRules,Idioms,Patterns

Unknown WordModels

ChooseLexicon orLexicons

AccessLexicon

Find wordcandidates

level ofsegmentation

DomainSpecific/

User Defined Lexicon

GeneralLexicon

HeuristicRules

from (Keh-Jiann Chen, 2000)

Corpora • A collection of linguistic data, either written texts or a

transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language– David Crystal, A Dictionary of Linguistics and Phonetics

• A collection of naturally occurring language text, chosen to characterize a state or variety of a language.– John Sinclair, Corpus, Concordance, Collection

• Text Corpora– Written Language– Spoken Language

• Corpora– English: Brown, LOB, BNC and Penn TreeBank– Korean: KAIST Corpus and Sejong Corpus– Japan: Nihon Keizai Shimbun, EDR Corpus– China: Beijing University, LIVAC (HK City U.), Sinica corpus

• Statistics can be obtained from a large Corpus.

Basic Components of Morphological Analyzer (1/2)

• Dictionary– Lemma, Part-of-speech, Semantic Feature(optional)– Structure

• Connection Rules– Connection Table (POS bigram)– Word Formation Graph (n-gram)

• Phonological Rules– Declarative Rules

• 2-level model

– Procedural Rules

Basic Components of Morphological Analyzer (2/2)

• Algorithms and Data Structures– Top-Down and Bottom-Up

• Chart Parsing Recursive Parsing– Data Structure

• Chart, Table, Tree, Lattice, Graphs

• Disambiguation (Tagging phase)– Dictionary Searching

• All words in dictionary + rules for unknown words• Stem dictionary + rule = Derived dictionary

– Stochastic Model– Rule-based Model– Hybrid Model, Integrated Model – Simple Heuristics

• preferring to the longest morpheme.

Word Segmentation in Chinese- A Heuristics -

• Maximal matching rule– The most plausible segmentation is the

three word sequence with the maximal length.

– This heuristic rules achieves as high as 99.69% accuracy and a high applicability of 93.21%, i.e. the 93.21% of the ambiguities were resolved by this rule. [Chen & Liu 1992]

• 完成鑑定• 完成鑑定報

• 完成鑑定報告 (finish judgment report)

Additional Components of Analyzer

• Unknown Word Analysis– Suffix, endings and preposition information

• Symbol / Number expression– Templates

Ex) “2005年 6月 20日” (2005/6/20), “3千 4百圓” ($3,400), “+82-42-869-5565” (telephone number)

• Foreign Languages– TV, [컴퓨터 kom-pyu-teo] (computer)

• Exceptional Word Phrase– Word Phrase Dictionary

• Post-Processing– Spacing Problem– Processing Non-Standard language

Characteristics of different types of unknown words

• Possibly infinite number of elements but with closed form representations, such as numeric type compounds:– 2005年 = 2005-year

• Open-ended types without closed form representations, such as– proper names: "Microsoft"– derived words: "computer-ize"– compounds: "computer desk"– abbreviation (acronym): "LG" (Lucky-

Goldstar) from ACL [Chen, 2000]

Proper Nameswith Transliteration

• Personal– Bill Clinton (bil klintən, 克林頓)– Jian Zemin (jiang z'əmin, 江澤民)

• Geographical– Korea (koria, 韓國)– Cambodia (kambodia, カンボジア）

• Brands/Organization– Panasonic (panasonik, 파나소닉,樂聲)– Samsung (samsəŋ, 삼성, 三星, サムソン)

Numeric Expression Representation- Regular Expression -

• To represent the type of unknown words with possibly infinite number of elements but with a closed form representation, – such as numbers, dates, times, determinant-

measure compounds, etc.• e.g.,

– Number → Digit Number | Digit– Digit → 0 | 1 | 2 | ... | 9

Methodologies

• Tabular Parsing– Segmentation by looking-up dictionary

• Syllable information-based model– Segmentation by syllable information

• Multi-phase filtering• Brute-force• Etc.

– Head-tail : simple model for segmentation

POS Tagging: Statistical approaches

• Statistical Models– HMM-based

• HMM on Morpheme sequences (bi-gram, tri-gram)• HMM Using Word Phrase Structure(HMM using Intra-/inter- word phrases information)

– Weighted Network– Maximum Entropy Model

• More information can be integrated into model

• Pros – Easy for training, guarantees not bad performance

• Cons– Difficult to tune or modify – Requires more space

POS Tagging: other approaches

• Rule-based approach– Transformation Rules (Eric Brill’s Style)– Pros and cons

• Difficult to get rules and maintain consistency of rules

• Easy to lexicalize

• Hybrid Approach– Statistical approach + Rule Based Approach– Pros and cons

• Guarantees a good performance• Difficult to integrate

Morphological Analysis of English

• Affix Analysis : un+happy, be+er(X)• Additional Processing

– AbbreviationEx) I’d I would

– Symbols, Numerals and Units– Processing Idiomatic Expression– Unknown Word Processing

Part of speech Tagging (English)• Stochastic Part-of-Speech Tagging

– Hidden Markov Model and Viterbi Algorithm• Using Markov assumption – efficient

– A Categorization Problem: Machine Learning• Decision Tree, Neural Network, Markov Random Field, ...

FliesFlies like alike a flower.flower.NounNoun Prep. Art.Art. VerbVerb VerbVerb Noun NounNoun

Noun

flies/N

flies/Vlike/V

like/N

like/P

a/ART

a/N

flower/V

flower/N

Part of speech Tagging (English)• Rule based Part-of-Speech Tagging

– Transformation-Based [Brill95] [Brill92]

• Using contextual information• if Noun X Noun, then X may be Verb

• Hybrid Model– Integrating a stochastic tagger

and rule-based system – stochastic: tri-gram

• P(Noun Verb Noun) = 0.2, • P(Noun Noun Noun) = 0.01

• Integrated Model– Maximum Entropy Model– “Classifier Combination for Improved Lexical

Disambiguation” [Brill 99]• Various Models have complementary behaviors.

Initial StateInitial State

LearnerLearner

Unannotated Text

Annotated Text Truth

RulesRulesRules

Using word phrase Structure in Tagging

• “An HMM POS tagger for Korean based on Wordphrase” (J, Shin 1994) : simple model

• “Two-ply HMM” (J. Kim 1997)– HMM variation using POS tags of head and tail of word

phraseH

T$

H

T

H

T

$

Each word phrase has a conventional tagging HMM.Morpheme / POS tags

$

NounN

Noun

Objective.

N

Noun ModifierN p

AdverbAA

Subjective.

p

pNSNS

NMNM

a

NONO

Verb V connectingePCPC

Verb V final ePFPF

HMM

• Hidden Markov Model– What is the sequence of nodes?

• transient probability • usually 1- or 2- or 3-gram

– What is the tag (or label) of the node? ("hidden")

Applications and Word Segmentation

• Machine Translation• Spell checking and correction

– Spell correction– Auto-spacing and spacing correction

• Information Retrieval– Extracting index terms – noun and noun phrases– Question Answering

• Natural Language Interface• Text-To-Speech• Concordance

Application: TTS and IR- A Japanese Example -

• Text to speech synthesis– Word segmentation of orthographic text

• 試験|の|最中|に|映画|へ|行った• (I went to movie during examination)

– Homograph disambiguation • (grapheme to phoneme)• 最中 → saichu (during), monaka (Japanese sweets)• 行った → itta (went), okonatta (did)

• Information retrieval (indexing)– Word segmentation of orthographic text

• 試験|の|最中|に|映画|へ|行った• (I went to movie during examination)

– Part of speech tagging, keyword extraction, stemming• 試験 (examination), 映画 (movie) from the slide (Nagata, ACL2000)

Evaluation of Morphological Analysis (Word Segmentation)

• Correctness– Recall– Precision

• In unit of morpheme and word phrases

• Processing Speed• Robustness

– Processing erroneous input• Spacing• Spell errors

• Effectiveness of Result– Evaluation of Tag set

Recall and Precision

• Recall = how much the correct one was hit from the whole set

• Precision = how much is correct among the system generated set

Question

• A common tagset for word segmentation for CJK

Resources and Tools

• Corpora and Tagsets for morphological analysis– Corpus

• Contest for standardization of POS tagset.

• Visualization Tool– To provide a visible process – understandable to user and easy to debug

for developer

Korean Specification for Word Segmentation

Key-Sun CHOIDosam HWANG

Morphology:how to describe Korean language

• Surface and spacing unit: (Korean)– na+neun norayha+keyssda–나는노래하겠다.

• Grammar: – I+subjective sing+will (future & intention)– I+subjective postposition sing+

future&intention ending• meaning: (English)

– I will sing

Grammatical morphemes(functional words)

• Postposition– to express a case feature (or a functional role)

• subjective case, objective case, etc.• e.g., I sing a song. (I = subjective, a song = objective)

– to represent semantic roles of constituents– typically Noun + postposition

• Ending– to represent features like tenses, aspects, moods and voices– to derive relative clauses– typically Verb or Adjective + endings for its Auxiliary Verbs or

sentence connectives– E.g.: I “a song” sing+will (I will sing a song.)

• 나는 노래를 부른다• 私は歌を歌う。• 我。。。

Code System for Implementation (Korean)

• Standard (KSC5601, KSC5700, KSX001)– Syllable-based – 2-byte-code– including common Chinese Characters

• Combinative Code (one Korean character = CV(C))– Grapheme-based – 2-byte-code

– Other internal code systems for MA• N-byte code• 3 byte code• 2-3 variable byte • Symbol code (Romanization style)

5bitsInitial ConsonantInitial Consonant

5bitsVowelVowel

5bitsFinal consonantFinal consonant

1bitMSB

1byte 1byte

word phraseword phrasesyllablesyllable

graphemegraphemeMorphemeMorpheme

Example of Grammatical Morpheme 1/2

• Grammatical morpheme in a predicative word phrase: (word phrase = a segmentation unit)– "(someone) went but"

ga syeoss ji man가 셨 지 만

man지eosssiga

“go” honorfic PasttenseDeclarative

moodConcessive/Limitation

가 시 만었 지

Verb AUX Particle Ending Postposition

Predicate Grammatical Morphemes

Predicative word phrase

Example of Grammatical Morpheme 2/2

• Grammatical morpheme in a substantive word phrase– "(It should be prohibited) from home, even (others

cannot do)"

jib eseo buteo man

집 만

eun

에서 부터 은

집 에서 부터 만 은

jib eseo buteo man eun“house” place origin concessive topical marker

Adverbial postpos.Noun postposition postposition

Grammatical morphemesNoun

Substantive word phrase

Language units

• Grapheme– Alphabet, a minimal unit

• consonant: (e.g.,) b, c, d, f, ...• vowel: (e.g.,) a, e, i, o, u, ...

• Syllable– 2 or 3 graphemes form a syllable– e.g., ga, sun, wa, ... (가, 선, 와)

• Word phrase (or word)– Segmentation unit (or spacing unit)– it consists of one or more

morphemes

word phraseword phrasesyllablesyllable

graphemegraphemeMorphemeMorpheme

Types of Ambiguities- Homonymy -

• English examples:– lead (metal) lead (past tense is "led")– swallow (bird) swallow (through mouth)– Bill (name) bill (for payment)

• What is the part-of-speech of the word?

Types of Ambiguities in Word Phrase 1/2

• Types of ambiguity in word phrase analysis– Ambiguity in Segmentation + POS tag

• 가ga면myeon

– Ambiguity in POS tag• 감gam을eul• goto (English simulation)

– go (N or V)

– Ambiguity in lemmatization• 도do와wa

Ending

Noun

postpositionp

NVerb

V

e

assumption if go

objective

objective

and

감gam감gam -을eul-을eul

도do도do -와wa-와wa

N

N p

p

도dob도dob -와a-와aeV

감gam감gam -을eul-을eulV e

가ga가ga -면myeon-면myeon

가ga-면myeon가ga-면myeonN

eVmask

go

persimon

go(noun form)

way

help with


• Mixed type–가ga-시si-는neun

gasigasi

gaga -si--si- -neun-neuneV f

Verb

Ending

NounN

V

e

Prefinal ending

Postpositionp

f

galgal -si--si- -neun-neuneV f

-neun-neunpN

gasigasi -neun-neuneV

go honoric

disappear ending

thorn subjective


• Types of ambiguity in word phrase analysis– Ambiguity in Segmentation + POS tag

• gamyeon

– Ambiguity in POS tag• gameul• goto (English simulation)

– go (N or V)

– Ambiguity in lemmatization• dowa

gamgam -eul-eul

gamgam -eul-eul

dobdob -a-a

dodo -wa-wa

Verb

Ending

Noun

postposition

N

N p

p

p

N

V

e

eV

V e

gaga -myeon-myeon

ga-myeonga-myeonN

eV


Verb

Ending

NounN

V

e

Prefinal ending

Postpositionp

f

• Mixed type– ga-si-neun

gaga -si--si- -neun-neuneV f

galgal -si--si- -neun-neuneV f

gasigasi -neun-neuneV

gasigasi -neun-neunpN

Metrics of Ambiguity• Considering ambiguities

– Average ambiguities per an word phrase : 3.5 – In case of 1-syllable morphems : 2.8 ambiguities average– In case of 2-syllable morphemes : 1.13 ambiguities average*Randomly selected 15 morphemes from the most frequent 10000 morpheme

• Considering phonetic transformation rules– About 20 rules are needed.

• More than 50 sub-rules (considering contextual information)

• Complexity of n-syllable word phrase segmentations– At least 2n-1

• 동서남북[東西南北/dong-seo-nam-buk]– east-west-south-north

• e.g., n=4 (Korean example) 東西南北

서 남 북

남 북서

동 서 남 북

남 북동 서

동 서 남 북

동

동

동 서 남 북 동 서 남 북

Typical Ambiguities of Word Phrase Analysis

• Categorial Disambiguation– e.g., gam+eun (Korean example),

• Noun+Particle• Verb stem + Ending

• Segmentation Disambiguation– e.g., gamgineun

• Noun+Particle = gamgi+neun• Verb stem + Ending+Particle = gam+gi+neun

• Stem Identification– e.g., naneun

• Noun+particle = na+neun• Verb stem+ending = nal+neun (with verbal ending change)

project leaders: sun maosong isahara hitoshi choi...

Documents

1 cognitive approach to japanese constructional phenomena:...

iso_tc 176_sc 2_n 544r3

iso/tc 37/sc 3 n 486 iso/cd 12620-2 -...

self-organizing conceptual map and taxonomy of adjectives...

atoll: equipe-projet atoll - liste des...

semanticweb.kaist.ac.krsemanticweb.kaist.ac.kr/research/tc37sc4/new_doc/iso_tc… ·...

update on the tkun project, by professor hitoshi isahara,...

japanese dependency structure analysis based on maximum...

princípios para uma postura de trabalho saudável em...

iso/tc 37/sc...

conjunto de documentos para la introducción y el...

presented to iso/tc 194 / wg11 chicago, illinois july 13…...

tc37 sc4 n033 -...

remarks on relational nouns and relational...