tsinghua university 1 statistical properties of overlapping ambiguities in chinese word segmentation...

Post on 11-Jan-2016

218 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Tsinghua University 1

Statistical Properties of Overlapping Ambiguities inChinese Word Segmentation and aStrategy for Their Disambiguation

Wei Qiao, Maosong Sun and Wolfgang MenzelState Key Lab of Intelligent Tech. & Sys.

Tsinghua UniversityDepartment Informatic, Hamburg University

Tsinghua University 2

Part Ⅰ

Background

Tsinghua University 3

Introduction

Chinese word segmentationCombination ambiguity 火 把 (torch) 火 (fire) 把 (make)

Overlapping ambiguity

a. 先解决其主要问题,再解决其次要问题 其 次要 (the subordinate) b. 首先要关注整体,其次要注意细节 其次 要 (secondly we

should)

火 把

Tsinghua University 4

Overlapping ambiguity string (OAS)Length; Order; Intersection length; Structure

Maximal overlapping ambiguity string (MOAS)

True / Pseudo ambiguity MOAS e.g. 其次要 ( TM ) : 其次 要 & 其 次要 e.g. 部长篇小说 (PM) : 部 (measure word) 长篇小说

Related Terms

order2order10 1 2 3

0-2, 1-3

3

Tsinghua University 5

[Sun et al.,1999]100 million characterA set of core for MOAS is found

[Li, et al., 2003] 650 million characterSimilar method is used to improve the performance of segmenter

Previous Work

Tsinghua University 6

Two basic issues remain unsolved in their work:

Only include news data, the results need further validatedDetermine the core of pseudo OA strings. both for general-purpose and domain-specific.

Motivation

Tsinghua University 7

Statistical Properties of MOAS

From General CorpusFrom Domain-specific Corpus

Part Ⅱ

Tsinghua University 8

Data SetCBC : 929,963,468 charactersRich in content (from 1920’s) covering rich categories such as novel, essay, news……

Chinese Word ListPeking University, with 74,191 entries

Automatically find totally 733,066 distinct MOAS types in CBC

From General Corpus

Tsinghua University 9

Detailed DistributionPerspective 1: Length

From General Corpus

Tsinghua University 10

Perspective 2: Order

From General Corpus

Tsinghua University 11

Perspective 3: Intersection Length

From General Corpus

Tsinghua University 12

Perspective 4: Structure distribution

From General Corpus

Tsinghua University 13

Top N Frequent MOAS --Core candidate

3500 ~ 50.78%

7000 ~ 60.43%

40000 ~ 80.39%

From General Corpus

Tsinghua University 14

Stability VS Corpus size

From General Corpus

# of MOAS VS Corpus size

# of top N MOASVS Corpus size

Top 7000

Tsinghua University 15

Pseudo MOAS DetectionRelax definition on “Pseudo”

Eg. “ 出国门”: 出 国门 (go abroad) in almost all the

cases 出国 门 (the way to go abroad) small

possibility

5,507 PM and 1,439 TM judged by hand

Token coverage of PM and TM over CBC

From General Corpus

Tsinghua University 16

Domain-Specific CorporaEncy55: 90.02 million charactersWeb55: 54.97 million characters

Common Parts

From Domain-specific Corpora

Tsinghua University 17

Frequent MOAS Coverage in Domain Specific Corpora (N=3,500)

From Domain-specific Corpora

Tsinghua University 18

From Domain-specific Corpora

Frequent MOAS Coverage in Domain Specific Corpora (N=7,000)

Tsinghua University 19

From Domain-specific Corpora

Frequent MOAS Coverage in Domain Specific Corpora (N=40,000)

Tsinghua University 20

From Domain-specific Corpora

PM and TM distribution over Domain Corpora

42% of overlapping ambiguities in any Chinese text can be 100% solved.

Tsinghua University 21

Part Ⅲ

Disambiguation

Tsinghua University 22

Disambiguation Method

Current performance on OAPerformance of ICTCLAS1.0 http://www.nlp.org.cn on OAs

e.g. 公安局 长 是 主管 这一 事故 的

The police chief ( 公安 局长 ) is the person who in charge of

this accident.

Performance of MSR-Seg1.0 http://research.microsoft.com/-S-MSRSeg on OAs

e.g. 核电站的特殊性 质 The special properties ( 特殊 性质 ) of nuclear power

station

Tsinghua University 23

Disambiguation Method

Performance of CRF-base[Lafferty 2001] CWS on OAs

e.g. 这一 现状 先 天地 决定 了 他们 的 使命

This situation congenitally ( 先天 地 ) makes them to take the mission

About 2% of OAS are mistakenly segmented

——it is a net gain

Tsinghua University 24

Individual-based methodSimple table lookup: record the PMs and the correct segmentation in a table

AdvantageSatisfactory token coverage to MOASsFull correctness for segmentation of pseudo MOASsLow cost in time and space complexity.

Disambiguation Method

Tsinghua University 25

An extension of [Sun et. al, 1999]Adjust the exist results in large corporaFurther verify the properties on

domain-specific corporaAn disambiguation strategy is

proposedOver 42% Overlapping ambiguity can

be resolved without any mistakeWill be more effective when facing

running text

Conclusion

Tsinghua University 26

Reference Lafferty J., A. McCallum, and F. Pereira. 2001. Conditional random fields:

Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18th International Conference of ICML, pages 282-289.

Li R., S.H. Liu, S.W. Ye, and Z.Z. Shi. 2001. A method for resolving overlapping ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6): 13-18. (In Chinese)

Li M., J.F. Gao, C.N. Huang, and J.F. Li. 2003. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of SIGHAN’2003, pages 1-7.

Sun M.S. and Z.P. Zuo. 1998. Overlapping ambiguities in Chinese text. Quantitative and Computational Studies on the Chinese Language, pages 323-338.

Sun M.S., C.N. Huang, and B.K.Y. T’sou. 1997. Using character bigram for ambiguity resolution In Chinese word segmentation. Computer Research and Development, 34(5): 332-339. (In Chinese)

Sun M.S., Z.P. Zuo and B.K.Y. T’sou. 1999. The role of high frequent maximal crossing ambiguities in Chinese word segmentation. Journal of Chinese Information Processing, 13(1): 27-37. (In Chinese)

Tsinghua University 27

Thank you

any comments ? ^.^

top related