intelligent database systems lab advisor : dr. hsu graduate : chien-shing chen author :...
DESCRIPTION
Intelligent Database Systems Lab N.Y.U.S.T. I.M. Motivation any Chinese character can either represent a word or be a part of other words no blank between Chinese words for identifying the boundaries some drawbacks- Statistics and Rules Based “ 拍打皮卡丘 ” “ 觀光協會 ” 、 ” 神奇寶貝 ”TRANSCRIPT
Intelligent Database Systems Lab
Advisor : Dr. HsuGraduate : Chien-Shing ChenAuthor : Tao-Hsing Chang
Chia-Hoang Lee
國立雲林科技大學National Yunlin University of Science and Technology
Automatic Chinese unknown word extraction using small-corpus-based method
Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on, IEEE
Intelligent Database Systems Lab
Outline Motivation Objective Introduction Extracting possible unknown words
SPLR Modification
Prefixed/suffixed, Compound word selection Experiment Conclusion Opinion
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.I.M.Motivation
any Chinese character can either represent a word or be a part of other words
no blank between Chinese words for identifying the boundaries
some drawbacks- Statistics and Rules Based “ 拍打皮卡丘” “ 觀光協會”、”神奇寶貝”
Intelligent Database Systems Lab
Objective Extract Chinese unknown words
efficiency accuracy words occur rarely small size of document for training
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
1-1.Introduction unknown words which don’t exist in dictionary or vocab
ulary Identifying the boundaries “ 拍打皮卡丘” “ 資料探勘非常有意思” Semantic ambiguity “ 觀光協會” ,” 神奇寶貝”
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
1-2.Introduction Restrict scope for Particular types of the unknown words
‘Prefixes/suffixes’ identify proper name Hybrid method to estimate the probability
Identifying general unknown words difficultly “ 熱鬧非凡”、”回味無窮”、”神奇寶貝” “ 發生什麼”、”老師問問題”
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
1-3.Introduction Statistics-based methods
Small documents cause low accuracy
Develop a method Advantage of the efficiency of statistics-based Accuracy of identify when small size of document
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
2.Previous Works The proper name can’t be identified (compound word)
“ 中國國際商業銀行” “ 中國”,”國際”,”商業”,”銀行”
Statistics-based method occur frequency
PLU-based likelihood ration (PLR) Not only efficient but also fast Occur rarely can’t be extracted
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
3-1.Extracting Possible Unknown Words
Preprocessing Retrieving possible character sequences Maximum length of character sequences is limited Eliminate stop words from character sequences
The frequently occurring character sequences are then regarded as possible unknown words.
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
3-2.Extracting Possible Unknown Words
sequence occur follows the subsequence, the sequence should not be unknown words “ 去福利社” occur follow “ 福利社” , so “ 去福利社” isn’t a possible unknown word
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
3-3.Extracting Possible Unknown Words
Defined:
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
3-4.Extracting Possible Unknown Words
“ 去福利社” 200 times “ 福利社” 1000 times
SPLR(tp)= =
N.Y.U.S.T.I.M.
Tolerate error coefficients
Intelligent Database Systems Lab
4.Modification1.one-charactered prefix( 前綴 ) or suffix( 字尾 )
“ 導師室”“ 導師”results in low SPLR of “ 導師室”
2.Familiar sequences“ 從教室裡衝出來” isn’t an unknown word but would be identified by simple SPLR method
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
4-1-1. Prefixed/Suffixed Word Revising Some words which contain the prefixed or suffixes have
been collected by dictionaries which are available. For example, an unknown word :
“ 總領隊” includes the prefix, “ocw + mcw” “ 導師室” includes the suffix, “mcw + ocw”
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
4-1-2. Prefixed/Suffixed Word Revising The one-charactered prefixes/suffixes can be extracted in
advance from available dictionaries.
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
4-2-1. Compound Word Selection Familiar sequence in the document:
includes one or more common words while the compound words consists of particular words
“ 從教室裡衝出來” consists of the common words “ 教室” and “ 出來”
“ 文具用品” 100 times “ 文具” 100 times “ 用品” 100 times
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
4-2-2. Compound Word Selection
ts is the word included by tp and not a one-charactered word is the threshold A sequences consist of the common words, should not be pos
sible unknown words
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
4-2-3. Compound Word Selection Familiar sequences and compound words can be differentiate
d efficiently “ 神奇寶具” 200 times “ 神奇” 230 times “ 寶貝” 250 times
“ 發生什麼” 200 times “ 發生” 2000 times “ 什麼” 4000 times
N.Y.U.S.T.I.M.
200/230
200/2000
Intelligent Database Systems Lab
5.Experimtents Data set : 1,285 students essays Theme: “Recess at School” Characters: 470,665
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
5-1.Experimtents-SPLRN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
5-2.Experimtents-FamiliarN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
5-3.Experimtents-prefixed/suffixed Prefixed or suffixed pattern in CKIP lexicon
( 中央研究院資訊科學研究所 - 中文知識庫小組 )
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
6.Conclusion efficiency accuracy words occur rarely small set of training corpus
N.Y.U.S.T.I.M.
Intelligent Database Systems Lab
Opinion Information Retrieval
unknown Word compound word
Semantic web
N.Y.U.S.T.I.M.