intelligent database systems lab advisor : dr. hsu graduate : chien-shing chen author :...

25
Intelligent Database Systems Lab Advisor Dr. Hsu Graduate Chien-Shing Chen Author Tao-Hsing Chang Chia-Hoang Lee 國國國國國國國國 National Yunlin University of Science and Technology Automatic Chinese unknown word extraction using small-corpus-based method Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on, IEEE

Upload: robert-wilkerson

Post on 19-Jan-2018

237 views

Category:

Documents


0 download

DESCRIPTION

Intelligent Database Systems Lab N.Y.U.S.T. I.M. Motivation any Chinese character can either represent a word or be a part of other words no blank between Chinese words for identifying the boundaries some drawbacks- Statistics and Rules Based “ 拍打皮卡丘 ” “ 觀光協會 ” 、 ” 神奇寶貝 ”

TRANSCRIPT

Page 1: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

Advisor : Dr. HsuGraduate : Chien-Shing ChenAuthor : Tao-Hsing Chang

Chia-Hoang Lee  

國立雲林科技大學National Yunlin University of Science and Technology

Automatic Chinese unknown word extraction using small-corpus-based method

Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on, IEEE

Page 2: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

Outline Motivation Objective Introduction Extracting possible unknown words

SPLR Modification

Prefixed/suffixed, Compound word selection Experiment Conclusion Opinion

N.Y.U.S.T.I.M.

Page 3: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

N.Y.U.S.T.I.M.Motivation

any Chinese character can either represent a word or be a part of other words

no blank between Chinese words for identifying the boundaries

some drawbacks- Statistics and Rules Based “ 拍打皮卡丘” “ 觀光協會”、”神奇寶貝”

Page 4: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

Objective Extract Chinese unknown words

efficiency accuracy words occur rarely small size of document for training

N.Y.U.S.T.I.M.

Page 5: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

1-1.Introduction unknown words which don’t exist in dictionary or vocab

ulary Identifying the boundaries “ 拍打皮卡丘” “ 資料探勘非常有意思” Semantic ambiguity “ 觀光協會” ,” 神奇寶貝”

N.Y.U.S.T.I.M.

Page 6: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

1-2.Introduction Restrict scope for Particular types of the unknown words

‘Prefixes/suffixes’ identify proper name Hybrid method to estimate the probability

Identifying general unknown words difficultly “ 熱鬧非凡”、”回味無窮”、”神奇寶貝” “ 發生什麼”、”老師問問題”

N.Y.U.S.T.I.M.

Page 7: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

1-3.Introduction Statistics-based methods

Small documents cause low accuracy

Develop a method Advantage of the efficiency of statistics-based Accuracy of identify when small size of document

N.Y.U.S.T.I.M.

Page 8: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

2.Previous Works The proper name can’t be identified (compound word)

“ 中國國際商業銀行” “ 中國”,”國際”,”商業”,”銀行”

Statistics-based method occur frequency

PLU-based likelihood ration (PLR) Not only efficient but also fast Occur rarely can’t be extracted

N.Y.U.S.T.I.M.

Page 9: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

3-1.Extracting Possible Unknown Words

Preprocessing Retrieving possible character sequences Maximum length of character sequences is limited Eliminate stop words from character sequences

The frequently occurring character sequences are then regarded as possible unknown words.

N.Y.U.S.T.I.M.

Page 10: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

3-2.Extracting Possible Unknown Words

sequence occur follows the subsequence, the sequence should not be unknown words “ 去福利社” occur follow “ 福利社” , so “ 去福利社” isn’t a possible unknown word

N.Y.U.S.T.I.M.

Page 11: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

3-3.Extracting Possible Unknown Words

Defined:

N.Y.U.S.T.I.M.

Page 12: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

3-4.Extracting Possible Unknown Words

“ 去福利社” 200 times “ 福利社” 1000 times

SPLR(tp)= =

N.Y.U.S.T.I.M.

Tolerate error coefficients

Page 13: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

4.Modification1.one-charactered prefix( 前綴 ) or suffix( 字尾 )

“ 導師室”“ 導師”results in low SPLR of “ 導師室”

2.Familiar sequences“ 從教室裡衝出來” isn’t an unknown word but would be identified by simple SPLR method

N.Y.U.S.T.I.M.

Page 14: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

4-1-1. Prefixed/Suffixed Word Revising Some words which contain the prefixed or suffixes have

been collected by dictionaries which are available. For example, an unknown word :

“ 總領隊” includes the prefix, “ocw + mcw” “ 導師室” includes the suffix, “mcw + ocw”

N.Y.U.S.T.I.M.

Page 15: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

4-1-2. Prefixed/Suffixed Word Revising The one-charactered prefixes/suffixes can be extracted in

advance from available dictionaries.

N.Y.U.S.T.I.M.

Page 16: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

N.Y.U.S.T.I.M.

Page 17: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

4-2-1. Compound Word Selection Familiar sequence in the document:

includes one or more common words while the compound words consists of particular words

“ 從教室裡衝出來” consists of the common words “ 教室” and “ 出來”

“ 文具用品” 100 times “ 文具” 100 times “ 用品” 100 times

N.Y.U.S.T.I.M.

Page 18: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

4-2-2. Compound Word Selection

ts is the word included by tp and not a one-charactered word is the threshold A sequences consist of the common words, should not be pos

sible unknown words

N.Y.U.S.T.I.M.

Page 19: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

4-2-3. Compound Word Selection Familiar sequences and compound words can be differentiate

d efficiently “ 神奇寶具” 200 times “ 神奇” 230 times “ 寶貝” 250 times

“ 發生什麼” 200 times “ 發生” 2000 times “ 什麼” 4000 times

N.Y.U.S.T.I.M.

200/230

200/2000

Page 20: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

5.Experimtents Data set : 1,285 students essays Theme: “Recess at School” Characters: 470,665

N.Y.U.S.T.I.M.

Page 21: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

5-1.Experimtents-SPLRN.Y.U.S.T.

I.M.

Page 22: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

5-2.Experimtents-FamiliarN.Y.U.S.T.

I.M.

Page 23: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

5-3.Experimtents-prefixed/suffixed Prefixed or suffixed pattern in CKIP lexicon

( 中央研究院資訊科學研究所 - 中文知識庫小組 )

N.Y.U.S.T.I.M.

Page 24: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

6.Conclusion efficiency accuracy words occur rarely small set of training corpus

N.Y.U.S.T.I.M.

Page 25: Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University

Intelligent Database Systems Lab

Opinion Information Retrieval

unknown Word compound word

Semantic web

N.Y.U.S.T.I.M.