intelligent database systems lab advisor ： dr. hsu graduate ： chien-shing chen author ：...

Intelligent Database Systems Lab

Advisor ： Dr. HsuGraduate ： Chien-Shing ChenAuthor ： Tao-Hsing Chang

Chia-Hoang Lee

國立雲林科技大學National Yunlin University of Science and Technology

Automatic Chinese unknown word extraction using small-corpus-based method

Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on, IEEE


Outline Motivation Objective Introduction Extracting possible unknown words

SPLR Modification

Prefixed/suffixed, Compound word selection Experiment Conclusion Opinion

N.Y.U.S.T.I.M.


N.Y.U.S.T.I.M.Motivation

any Chinese character can either represent a word or be a part of other words

no blank between Chinese words for identifying the boundaries

some drawbacks- Statistics and Rules Based “ 拍打皮卡丘” “ 觀光協會”、”神奇寶貝”


Objective Extract Chinese unknown words

efficiency accuracy words occur rarely small size of document for training

N.Y.U.S.T.I.M.


1-1.Introduction unknown words which don’t exist in dictionary or vocab

ulary Identifying the boundaries “ 拍打皮卡丘” “ 資料探勘非常有意思” Semantic ambiguity “ 觀光協會” ,” 神奇寶貝”

N.Y.U.S.T.I.M.


1-2.Introduction Restrict scope for Particular types of the unknown words

‘Prefixes/suffixes’ identify proper name Hybrid method to estimate the probability

Identifying general unknown words difficultly “ 熱鬧非凡”、”回味無窮”、”神奇寶貝” “ 發生什麼”、”老師問問題”

N.Y.U.S.T.I.M.


1-3.Introduction Statistics-based methods

Small documents cause low accuracy

Develop a method Advantage of the efficiency of statistics-based Accuracy of identify when small size of document

N.Y.U.S.T.I.M.


2.Previous Works The proper name can’t be identified (compound word)

“ 中國國際商業銀行” “ 中國”，”國際”，”商業”，”銀行”

Statistics-based method occur frequency

PLU-based likelihood ration (PLR) Not only efficient but also fast Occur rarely can’t be extracted

N.Y.U.S.T.I.M.


3-1.Extracting Possible Unknown Words

Preprocessing Retrieving possible character sequences Maximum length of character sequences is limited Eliminate stop words from character sequences

The frequently occurring character sequences are then regarded as possible unknown words.

N.Y.U.S.T.I.M.



sequence occur follows the subsequence, the sequence should not be unknown words “ 去福利社” occur follow “ 福利社” , so “ 去福利社” isn’t a possible unknown word

N.Y.U.S.T.I.M.



Defined:

N.Y.U.S.T.I.M.



“ 去福利社” 200 times “ 福利社” 1000 times

SPLR(tp)= =

N.Y.U.S.T.I.M.

Tolerate error coefficients


4.Modification1.one-charactered prefix( 前綴 ) or suffix( 字尾 )

“ 導師室”“ 導師”results in low SPLR of “ 導師室”

2.Familiar sequences“ 從教室裡衝出來” isn’t an unknown word but would be identified by simple SPLR method

N.Y.U.S.T.I.M.


4-1-1. Prefixed/Suffixed Word Revising Some words which contain the prefixed or suffixes have

been collected by dictionaries which are available. For example, an unknown word :

“ 總領隊” includes the prefix, “ocw + mcw” “ 導師室” includes the suffix, “mcw + ocw”

N.Y.U.S.T.I.M.


4-1-2. Prefixed/Suffixed Word Revising The one-charactered prefixes/suffixes can be extracted in

advance from available dictionaries.

N.Y.U.S.T.I.M.


N.Y.U.S.T.I.M.


4-2-1. Compound Word Selection Familiar sequence in the document:

includes one or more common words while the compound words consists of particular words

“ 從教室裡衝出來” consists of the common words “ 教室” and “ 出來”

“ 文具用品” 100 times “ 文具” 100 times “ 用品” 100 times

N.Y.U.S.T.I.M.


4-2-2. Compound Word Selection

ts is the word included by tp and not a one-charactered word is the threshold A sequences consist of the common words, should not be pos

sible unknown words

N.Y.U.S.T.I.M.


4-2-3. Compound Word Selection Familiar sequences and compound words can be differentiate

d efficiently “ 神奇寶具” 200 times “ 神奇” 230 times “ 寶貝” 250 times

“ 發生什麼” 200 times “ 發生” 2000 times “ 什麼” 4000 times

N.Y.U.S.T.I.M.

200/230

200/2000


5.Experimtents Data set : 1,285 students essays Theme: “Recess at School” Characters: 470,665

N.Y.U.S.T.I.M.


5-1.Experimtents-SPLRN.Y.U.S.T.

I.M.


5-2.Experimtents-FamiliarN.Y.U.S.T.

I.M.


5-3.Experimtents-prefixed/suffixed Prefixed or suffixed pattern in CKIP lexicon

( 中央研究院資訊科學研究所 - 中文知識庫小組 )

N.Y.U.S.T.I.M.


6.Conclusion efficiency accuracy words occur rarely small set of training corpus

N.Y.U.S.T.I.M.


Opinion Information Retrieval

unknown Word compound word

Semantic web

N.Y.U.S.T.I.M.

intelligent database systems lab advisor ： dr. hsu graduate ： chien-shing chen author ：...

Documents