project leaders: sun maosong isahara hitoshi choi...
Post on 25-Dec-2019
2 Views
Preview:
TRANSCRIPT
-
Word segmentation: Part 2Chinese, Japanese, Korean
Project Leaders:Sun MaosongIsahara HitoshiChoi Key-Sun
-
Presentation
• General Introduction to CJK Word Segmentation (Choi Key-Sun)
• Chinese GB Standard on Word Segmentation (Song Min)
• Japanese Specification (Isahara Hitoshi)• Korean Specification
(Choi Key-Sun, Hwang Dosam)• Chinese Specification (Song Min)
-
Preparatory Works
• One CJK Tri-lingual Parallel Corpus– Mainichi Shinbun in Japanese source– Translated into Chinese and Korean– Word segmented and POS-tagged
• Word list – 1,000 word lists of CJK
• Independently collected, • Possibly sorted by frequency in each language’s
corpus
-
Tri-Parallel Corpus of CJI
• # S-ID:950101004-002 KNP:96/10/27 MOD:96/11/26• J# ロシア南部チェチェン共和国の首都グロズヌイに進
攻したロシア軍は三十一日、首都中心部を装甲車などで攻撃、大統領官邸など数カ所が炎上した。
• C# 进攻俄罗斯南部车臣共和国首都格罗兹尼的俄罗斯军队,31日出动装甲车等进攻首都中心地带,总统官邸等数处着火。
• K# 러시아남부체첸공화국수도그로즈니를공격하는러시아군은 31일장갑차등을동원해수도중심지대를진격하여대통령관저등많은곳에서불이났다.
-
Language and Morphology
• English– Lemmatizer + POS tagger– A word has at most several word forms
• Chinese– Segmentation System + POS tagger– Word has one word form.
• Korean (Japanese)– Morphological Analyzer is more essential than other
language• Complex Segmentation• Frequent Agglutination• Word Formation
– The Chinese characters in common use
-
Computational Morphology
• Morpheme– A minimal meaningful elements– "computational" = "comput+ation+al"
• Morphological Analysis– Segmentation
• To divide into morphemes or words• To include lemmatization (= to find its stem)
– Categorization• To assign Part-of-speech category (POS tags)• To assign Semantic features
-
Morphological Analysis: e.g.,
• These probabilities are learned from the raw corpus in an unsupervised manner.– Original text
• These probability+Plural be+Plural+Presentlearn+PP from the raw corpus in a+Vowelun_supervise+PP manner .– Lemmatized text
• These/pron probability/noun+pluralare/be_verb+plural+present learn/verb_ed/ppfrom/prep the/article raw/adjective corpus/nounin/prep an/article un/prefix_supervis/verb_ ed/pp manner/noun ./period– part-of-speech categorization (POS
tagging/Annotation)
-
Word Formation
• Types of Languages– Inflectional
• e.g., Latin• canta-bo
– Analytic - derivation• e.g., English • I will sing
– Agglutinative - concatenation• e.g., Korean, Japanese, Hindi, Turkish, • na-neun norayha-keyssda• 나는 노래하-겠다
-
Complexity Comparison• Complexity?
– In English, we have only to consider each word form.– In Chinese, segmentation rather than morphological analysis.– In Korean, MA should process the segmentation and
agglutination simultaneously. Much more Complexity in segmentation and analyzing functional words. (Japanese is similar with Korean )
Spacing
Order of verb formsPer one verb
Complexity of Segmentation
English
Word form
5
Easy
Korean
Word Phrase
More than 5000
Very Hard
Chinese
No
1
Very Hard
-
Table of Contents
• Introduction: Morphology and Characteristics
• English, Korean, Chinese, Japanese, – Complexity of Morphology– Types of Ambiguities
• General Representation– Components of Morphological Analysis– Implementation Scheme– Unknown Words
-
General Morphological Analyzer: 2 phases
• Candidate generation– Generates possible sequences of morphemes
• Segmentation– Lemmatization - Recovery of phonetic changes
• Processing for unknown word
• Candidate selection– “POS Tagging”– Methodologies
• Statistical methods: e.g., if we see two or three consecutive words and their POS tags, we can predict what the current word's tag is. (Noun-Verb-Noun)
• Rule-based methods• Hybrid methods
-
General Scheme of ImplementationMorphological AnalyzerMorphological Analyzer
POS POS TaggerTaggerStatsticsStatstics
Linguistic RulesLinguistic Rules
Additional Additional ComponentsComponents
Unknown wordsUnknown words
Symbols/NumbersSymbols/Numbers
Exceptional Exceptional word phraseword phrase
PostPost--processingprocessing
Foreign wordsForeign words
Basic ComponentsBasic Components
CodeCode SystemSystem
DictionaryDictionaryPhonologicalPhonological
RulesRules
ConnectionConnectionRulesRules
Algorithms Algorithms and and
Data StructureData Structure
ParserParser
-
Implementation (2/3)
• Morphological connection rules– idioms– patterns– weighted rules
• Dictionaries (Lexicons)– general lexicon
• level of segmentation depends on the lexicon– domain-specific dictionary
• e.g., economics, law, patent (science & technology), ...
– user-defined dictionary
-
Implementation (3/3)Diagram of word segmentation system for Chinese
Disambiguation
POS-Tagging
Unknown WordsIdentification
GenerateSubstrings
Word Matching
MorphologicalRules,Idioms,Patterns
Unknown WordModels
ChooseLexicon orLexicons
AccessLexicon
Find wordcandidates
level ofsegmentation
DomainSpecific/
User Defined Lexicon
GeneralLexicon
HeuristicRules
from (Keh-Jiann Chen, 2000)
-
Corpora • A collection of linguistic data, either written texts or a
transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language– David Crystal, A Dictionary of Linguistics and Phonetics
• A collection of naturally occurring language text, chosen to characterize a state or variety of a language.– John Sinclair, Corpus, Concordance, Collection
• Text Corpora– Written Language– Spoken Language
• Corpora– English: Brown, LOB, BNC and Penn TreeBank– Korean: KAIST Corpus and Sejong Corpus– Japan: Nihon Keizai Shimbun, EDR Corpus– China: Beijing University, LIVAC (HK City U.), Sinica corpus
• Statistics can be obtained from a large Corpus.
-
Basic Components of Morphological Analyzer (1/2)
• Dictionary– Lemma, Part-of-speech, Semantic Feature(optional)– Structure
• Connection Rules– Connection Table (POS bigram)– Word Formation Graph (n-gram)
• Phonological Rules– Declarative Rules
• 2-level model
– Procedural Rules
-
Basic Components of Morphological Analyzer (2/2)
• Algorithms and Data Structures– Top-Down and Bottom-Up
• Chart Parsing Recursive Parsing– Data Structure
• Chart, Table, Tree, Lattice, Graphs
• Disambiguation (Tagging phase)– Dictionary Searching
• All words in dictionary + rules for unknown words• Stem dictionary + rule = Derived dictionary
– Stochastic Model– Rule-based Model– Hybrid Model, Integrated Model – Simple Heuristics
• preferring to the longest morpheme.
-
Word Segmentation in Chinese- A Heuristics -
• Maximal matching rule– The most plausible segmentation is the
three word sequence with the maximal length.
– This heuristic rules achieves as high as 99.69% accuracy and a high applicability of 93.21%, i.e. the 93.21% of the ambiguities were resolved by this rule. [Chen & Liu 1992]
• 完 成 鑑定• 完成 鑑定 報
• 完成 鑑定 報告 (finish judgment report)
-
Additional Components of Analyzer
• Unknown Word Analysis– Suffix, endings and preposition information
• Symbol / Number expression– Templates
Ex) “2005年 6月 20日” (2005/6/20), “3千 4百圓” ($3,400), “+82-42-869-5565” (telephone number)
• Foreign Languages– TV, [컴퓨터 kom-pyu-teo] (computer)
• Exceptional Word Phrase– Word Phrase Dictionary
• Post-Processing– Spacing Problem– Processing Non-Standard language
-
Characteristics of different types of unknown words
• Possibly infinite number of elements but with closed form representations, such as numeric type compounds:– 2005年 = 2005-year
• Open-ended types without closed form representations, such as– proper names: "Microsoft"– derived words: "computer-ize"– compounds: "computer desk"– abbreviation (acronym): "LG" (Lucky-
Goldstar) from ACL [Chen, 2000]
-
Proper Nameswith Transliteration
• Personal– Bill Clinton (bil klintən, 克林頓)– Jian Zemin (jiang z'əmin, 江澤民)
• Geographical– Korea (koria, 韓國)– Cambodia (kambodia, カンボジア)
• Brands/Organization– Panasonic (panasonik, 파나소닉,樂聲)– Samsung (samsəŋ, 삼성, 三星, サムソン)
-
Numeric Expression Representation- Regular Expression -
• To represent the type of unknown words with possibly infinite number of elements but with a closed form representation, – such as numbers, dates, times, determinant-
measure compounds, etc.• e.g.,
– Number → Digit Number | Digit– Digit → 0 | 1 | 2 | ... | 9
-
Methodologies
• Tabular Parsing– Segmentation by looking-up dictionary
• Syllable information-based model– Segmentation by syllable information
• Multi-phase filtering• Brute-force• Etc.
– Head-tail : simple model for segmentation
-
POS Tagging: Statistical approaches
• Statistical Models– HMM-based
• HMM on Morpheme sequences (bi-gram, tri-gram)• HMM Using Word Phrase Structure(HMM using Intra-/inter- word phrases information)
– Weighted Network– Maximum Entropy Model
• More information can be integrated into model
• Pros – Easy for training, guarantees not bad performance
• Cons– Difficult to tune or modify – Requires more space
-
POS Tagging: other approaches
• Rule-based approach– Transformation Rules (Eric Brill’s Style)– Pros and cons
• Difficult to get rules and maintain consistency of rules
• Easy to lexicalize
• Hybrid Approach– Statistical approach + Rule Based Approach– Pros and cons
• Guarantees a good performance• Difficult to integrate
-
Morphological Analysis of English
• Affix Analysis : un+happy, be+er(X)• Additional Processing
– AbbreviationEx) I’d I would
– Symbols, Numerals and Units– Processing Idiomatic Expression– Unknown Word Processing
-
Part of speech Tagging (English)• Stochastic Part-of-Speech Tagging
– Hidden Markov Model and Viterbi Algorithm• Using Markov assumption – efficient
– A Categorization Problem: Machine Learning• Decision Tree, Neural Network, Markov Random Field, ...
FliesFlies like alike a flower.flower.NounNoun Prep. Art.Art. VerbVerb VerbVerb Noun NounNoun
Noun
flies/N
flies/Vlike/V
like/N
like/P
a/ART
a/N
flower/V
flower/N
-
Part of speech Tagging (English)• Rule based Part-of-Speech Tagging
– Transformation-Based [Brill95] [Brill92]
• Using contextual information• if Noun X Noun, then X may be Verb
• Hybrid Model– Integrating a stochastic tagger
and rule-based system – stochastic: tri-gram
• P(Noun Verb Noun) = 0.2, • P(Noun Noun Noun) = 0.01
• Integrated Model– Maximum Entropy Model– “Classifier Combination for Improved Lexical
Disambiguation” [Brill 99]• Various Models have complementary behaviors.
Initial StateInitial State
LearnerLearner
Unannotated Text
Annotated Text Truth
RulesRulesRules
-
Using word phrase Structure in Tagging
• “An HMM POS tagger for Korean based on Wordphrase” (J, Shin 1994) : simple model
• “Two-ply HMM” (J. Kim 1997)– HMM variation using POS tags of head and tail of word
phraseH
T$
H
T
H
T
$
Each word phrase has a conventional tagging HMM.Morpheme / POS tags
$
NounN
Noun
Objective.
N
Noun ModifierN p
AdverbAA
Subjective.
p
pNSNS
NMNM
a
NONO
Verb V connectingePCPC
Verb V final ePFPF
-
HMM
• Hidden Markov Model– What is the sequence of nodes?
• transient probability • usually 1- or 2- or 3-gram
– What is the tag (or label) of the node? ("hidden")
-
Applications and Word Segmentation
• Machine Translation• Spell checking and correction
– Spell correction– Auto-spacing and spacing correction
• Information Retrieval– Extracting index terms – noun and noun phrases– Question Answering
• Natural Language Interface• Text-To-Speech• Concordance
-
Application: TTS and IR- A Japanese Example -
• Text to speech synthesis– Word segmentation of orthographic text
• 試験|の|最中|に|映画|へ|行った• (I went to movie during examination)
– Homograph disambiguation • (grapheme to phoneme)• 最中 → saichu (during), monaka (Japanese sweets)• 行った → itta (went), okonatta (did)
• Information retrieval (indexing)– Word segmentation of orthographic text
• 試験|の|最中|に|映画|へ|行った• (I went to movie during examination)
– Part of speech tagging, keyword extraction, stemming• 試験 (examination), 映画 (movie) from the slide (Nagata, ACL2000)
-
Evaluation of Morphological Analysis (Word Segmentation)
• Correctness– Recall– Precision
• In unit of morpheme and word phrases
• Processing Speed• Robustness
– Processing erroneous input• Spacing• Spell errors
• Effectiveness of Result– Evaluation of Tag set
-
Recall and Precision
• Recall = how much the correct one was hit from the whole set
• Precision = how much is correct among the system generated set
-
Question
• A common tagset for word segmentation for CJK
-
Resources and Tools
• Corpora and Tagsets for morphological analysis– Corpus
• Contest for standardization of POS tagset.
• Visualization Tool– To provide a visible process – understandable to user and easy to debug
for developer
-
Korean Specification for Word Segmentation
Key-Sun CHOIDosam HWANG
-
Morphology:how to describe Korean language
• Surface and spacing unit: (Korean)– na+neun norayha+keyssda–나는노래하겠다.
• Grammar: – I+subjective sing+will (future & intention)– I+subjective postposition sing+
future&intention ending• meaning: (English)
– I will sing
-
Grammatical morphemes(functional words)
• Postposition– to express a case feature (or a functional role)
• subjective case, objective case, etc.• e.g., I sing a song. (I = subjective, a song = objective)
– to represent semantic roles of constituents– typically Noun + postposition
• Ending– to represent features like tenses, aspects, moods and voices– to derive relative clauses– typically Verb or Adjective + endings for its Auxiliary Verbs or
sentence connectives– E.g.: I “a song” sing+will (I will sing a song.)
• 나는 노래를 부른다• 私は 歌を 歌う。• 我。。。
-
Code System for Implementation (Korean)
• Standard (KSC5601, KSC5700, KSX001)– Syllable-based – 2-byte-code– including common Chinese Characters
• Combinative Code (one Korean character = CV(C))– Grapheme-based – 2-byte-code
– Other internal code systems for MA• N-byte code• 3 byte code• 2-3 variable byte • Symbol code (Romanization style)
5bitsInitial ConsonantInitial Consonant
5bitsVowelVowel
5bitsFinal consonantFinal consonant
1bitMSB
1byte 1byte
word phraseword phrasesyllablesyllable
graphemegraphemeMorphemeMorpheme
-
Example of Grammatical Morpheme 1/2
• Grammatical morpheme in a predicative word phrase: (word phrase = a segmentation unit)– "(someone) went but"
ga syeoss ji man가 셨 지 만
man지eosssiga
“go” honorfic PasttenseDeclarative
moodConcessive/Limitation
가 시 만었 지
Verb AUX Particle Ending Postposition
Predicate Grammatical Morphemes
Predicative word phrase
-
Example of Grammatical Morpheme 2/2
• Grammatical morpheme in a substantive word phrase– "(It should be prohibited) from home, even (others
cannot do)"
jib eseo buteo man
집 만
eun
에서 부터 은
집 에서 부터 만 은
jib eseo buteo man eun“house” place origin concessive topical marker
Adverbial postpos.Noun postposition postposition
Grammatical morphemesNoun
Substantive word phrase
-
Language units
• Grapheme– Alphabet, a minimal unit
• consonant: (e.g.,) b, c, d, f, ...• vowel: (e.g.,) a, e, i, o, u, ...
• Syllable– 2 or 3 graphemes form a syllable– e.g., ga, sun, wa, ... (가, 선, 와)
• Word phrase (or word)– Segmentation unit (or spacing unit)– it consists of one or more
morphemes
word phraseword phrasesyllablesyllable
graphemegraphemeMorphemeMorpheme
-
Types of Ambiguities- Homonymy -
• English examples:– lead (metal) lead (past tense is "led")– swallow (bird) swallow (through mouth)– Bill (name) bill (for payment)
• What is the part-of-speech of the word?
-
Types of Ambiguities in Word Phrase 1/2
• Types of ambiguity in word phrase analysis– Ambiguity in Segmentation + POS tag
• 가ga면myeon
– Ambiguity in POS tag• 감gam을eul• goto (English simulation)
– go (N or V)
– Ambiguity in lemmatization• 도do와wa
Ending
Noun
postpositionp
NVerb
V
e
assumption if go
objective
objective
and
감gam감gam -을eul-을eul
도do도do -와wa-와wa
N
N p
p
도dob도dob -와a-와aeV
감gam감gam -을eul-을eulV e
가ga가ga -면myeon-면myeon
가ga-면myeon가ga-면myeonN
eVmask
go
persimon
go(noun form)
way
help with
-
Types of Ambiguities in Word Phrase 2/2
• Mixed type–가ga-시si-는neun
gasigasi
gaga -si--si- -neun-neuneV f
Verb
Ending
NounN
V
e
Prefinal ending
Postpositionp
f
galgal -si--si- -neun-neuneV f
-neun-neunpN
gasigasi -neun-neuneV
go honoric
disappear ending
thorn subjective
-
Types of Ambiguities in Word Phrase 1/2
• Types of ambiguity in word phrase analysis– Ambiguity in Segmentation + POS tag
• gamyeon
– Ambiguity in POS tag• gameul• goto (English simulation)
– go (N or V)
– Ambiguity in lemmatization• dowa
gamgam -eul-eul
gamgam -eul-eul
dobdob -a-a
dodo -wa-wa
Verb
Ending
Noun
postposition
N
N p
p
p
N
V
e
eV
V e
gaga -myeon-myeon
ga-myeonga-myeonN
eV
-
Types of Ambiguities in Word Phrase 2/2
Verb
Ending
NounN
V
e
Prefinal ending
Postpositionp
f
• Mixed type– ga-si-neun
gaga -si--si- -neun-neuneV f
galgal -si--si- -neun-neuneV f
gasigasi -neun-neuneV
gasigasi -neun-neunpN
-
Metrics of Ambiguity• Considering ambiguities
– Average ambiguities per an word phrase : 3.5 – In case of 1-syllable morphems : 2.8 ambiguities average– In case of 2-syllable morphemes : 1.13 ambiguities average*Randomly selected 15 morphemes from the most frequent 10000 morpheme
• Considering phonetic transformation rules– About 20 rules are needed.
• More than 50 sub-rules (considering contextual information)
• Complexity of n-syllable word phrase segmentations– At least 2n-1
• 동서남북[東西南北/dong-seo-nam-buk]– east-west-south-north
• e.g., n=4 (Korean example) 東 西 南 北
서 남 북
남 북서
동 서 남 북
남 북동 서
동 서 남 북
동
동
동 서 남 북 동 서 남 북
-
Typical Ambiguities of Word Phrase Analysis
• Categorial Disambiguation– e.g., gam+eun (Korean example),
• Noun+Particle• Verb stem + Ending
• Segmentation Disambiguation– e.g., gamgineun
• Noun+Particle = gamgi+neun• Verb stem + Ending+Particle = gam+gi+neun
• Stem Identification– e.g., naneun
• Noun+particle = na+neun• Verb stem+ending = nal+neun (with verbal ending change)
top related