chu-ren huang academia sinica cwn.ling.sinica.tw/huang/huang.htm

68
4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007 Integrating multiple language resources Part II: Creating Synergy and Multi- functionality of Language Resources Chu-Ren Huang Academia Sinica http:// cwn.ling.sinica.edu.tw/huang/huang.htm

Upload: mandar

Post on 03-Feb-2016

63 views

Category:

Documents


0 download

DESCRIPTION

From Synergy to Knowledge: Integrating multiple language resources Part II: Creating Synergy and Multi-functionality of Language Resources. Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm. Outline. From Language Resources to Language Technology A word’s company - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

From Synergy to Knowledge: Integrating multiple language resources

Part II: Creating Synergy and Multi-functionality of Language Resources

Chu-Ren Huang

Academia Sinica

http://cwn.ling.sinica.edu.tw/huang/huang.htm

Page 2: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 2C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Outlineo From Language Resources to Language Technology

o A word’s company

o Classical Paradigm of Language Resource Development

o A new paradigm: Integrating Multiple Language resources

o Introduction: CGW Corpus

o Chinese WordSketch: Integrating multiple resources

o Wen-Guo: Merging different resources to create new synergy

Page 3: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 3C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

From Language Resources to Language Technology

Language Modeling and Knowledge Generation: How to acquire linguistic model and/or generalization from language resources?

Sharability: can two or more resources be combined to create bigger and better resources

Re-usability: Can a resource be used for a different purpose than what it is designed for?

Page 4: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 4C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

A word’s company: Corpus KeyWord In Context (KWIC) and the color pen

1 political association 4 person in an agreement/dispute4 person in an agreement/dispute 2 social event 5 to be party to something...3 group of peopleThe coloured pens method from Kilgarriff et al. 2005

1 arity, which will be used to take a party of under-privileged children to D 2 from outside. You are invited to a party and after a couple of drinks you d 3 tion, we believe politicians of all parties will listen to our views. &equo 4 ould be reaching agreement with all parties concerned, as to which events, 5 lack people. I have certainly been party to one or two discussions amongst 6 . These should be discussed by both parties before entering into the relatio 7 presents They had hosted a cocktail party at Kensington palace, for example 8 akes. By midnight the end-of-course party is in full swing, but most cadet 9 e should be a right for the injured party to terminate the contract. A mana 10 by the Safran Peoples ' Liberation Party. This presents the powerful neigh 11 s. Ahead I could see the rest of my party plodding towards the final slope t 12 cial ethic. The two main political parties - the Tories and the Liberals - 13 ritish successes in Perth The small party of British players competing in th 14 to help control. One member of the party went to summon the rescue team and 15 rket society fashion magazine. The party was held at his flat which was a l 16 security and secrecy than any Tory Party Conference : it seems that bootleg

Page 5: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 5C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

A Word’s Company Automatically Detected: WordSketch w BNC Data

Page 6: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 6C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Sketch Engine and Chinese WordSketch Sketch Engine http://www.sketchengine.co.uk

Developed by team led by Adam Kilgarriff

A new corpus viewing tool

Discovering grammatical information from a gigantic corpus

Chinese Wordsketch by Academia Sinica

http://www.ling.sinica.edu.tw/wordsketch (for Taiwan only)

Academia Sinica, Taiwan (Huang, Smith, Ma, Simon 黃居仁,史尚明,馬偉雲,石穆 )

Page 7: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 7C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Classical Paradigm of Language Resource Development

Data Collection and Preparation:

Design Criteria : by human

Data collection : executed or supervised by human

digitization : input and/or proofreading by human

Knowledge Enrichment: tagging and structural annotation

Knowledge source : by human

Representational standard and annotation : by human

Quality and speed of human labor becomes the bottleneck of language resources development

Page 8: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 8C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Current Challenges to Corpus and Language Resource Research

Corpus size is too small : Disambiguation

Collocation

Grammatical functions and other dependencies

usually requires corpus size of 100 million words or above to yield significant distributional information.

Resources development is slow and tedious

Semantic Role Tagging

POS tagging post-processing

Page 9: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 9C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Estimating Corpus Scale for Automatic Extraction of Linguistic Knowledge

How many events do we need to establish reliable description of a word from corpus? automatically?

Grammatical Information based on Word-word Collocation

V + N :「開立」+「發票」 A + N :「不實」+ 「發票」

Collocational information between any given two mid-frequency words (frequency rank 10,000 or above)

That occur within a 10 word window of the keyword (5 before and 5 after

Requires a corpus size of 1 billion words or above

Page 10: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 10C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Classical Chinese Corpora:Million Word Scale

Corpus Name Online Year

Data Duration/

Content

Sinica 4.0

(Taiwan)1996

5.2 M words

7.9 M characters 1990-1996

Fully Tagged

Sinica 5.0

(Taiwan)2006

10 M words1990-2004

Fully Tagged

Sinorama

(Taiwan)2003

3.2 M English words

5.3 M Chinese characters

1976 – 2000

(1999-2000)

Aligned

CCL

(Peking)2003 85 M simplified characters

1919 -2003

Partially tagged

(1 million) M= million

Page 11: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 11C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

A new paradigm Integrating Multiple Language resources

From Synergy to Knowledge

Integrate multiple existing (language) resources to create new resource

Allow resources to scale up beyond existing resources,

Generate new knowledge which does not exist in any individual resource

General methodology (without too much additional manual work):

merging existing, similarly annotated resources, or

creating an overall conceptual framework for different knowledge/language resources to be integrated

Automatically

Page 12: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 12C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

From Synergy to Knowledge When A and B have synergy, we say in Chinese that

A and B bring out the advantages of each other

Knowledge is what we know about the world, either descriptive or explanatory

Knowledge cannot be created from nothing, it comes by

Keen observation of facts

Sharp reasoning when we put two or more facts together

Different language resources can be put together to

Facilitate observation of facts, and

Create an environment where different linguistic facts can be more easily associated (for knowledge discovery)

Page 13: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 13C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Synergy: Integrating different types of language resoureces

Research based on Chinese Gigaword Corpus

Chinese Gigaword Corpus: Introduction

Implementation of fully automatic corpus tagging

Word Sketch Engine: Introduction

Chinese Word Sketch

Integrating corpus program with

lexico-grammatical information

Page 14: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 14C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Introduction: CGW Corpus IChinese Gigaword Second Edition (2005)

Produced and released by Linguistic Data Consortium (LDC) in 2003 (first edition).

Newswire text data in Chinese.

Second edition contains additional data collected after the publication of the first edition.

Three distinct international sources :

Central News Agency of Taiwan

Xinhua News Agency of Beijing

Zaobao Newspaper of Singapore

Page 15: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 15C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Introduction: CGW Corpus II

CNA Xinhua Zaobao

First Edition 1991-2002 1990-2002

New in Second EditionOct. 2002 -

Dec. 2004

Jan. 2003 -

Dec. 2004

Oct. 2000 -

Sep. 2003

Table 1. Coverage of Chinese GigaWord Corpus

Page 16: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 16C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Introduction: CGW Corpus IIIMarkup Structure

All text data are presented in SGML form, using a very simple,

minimal markup structure.

<DOC id="CNA19910101.0003" type="story"><HEADLINE>捷運局對工程噪音採多項防治措施</HEADLINE><DATELINE>( 中央社台北一日電 )</DATELINE><TEXT><P>台北都會區捷運工程正處於積極趕工階段 ,…</P><P>淡水線工程進度百分之三十六點一九 , 落後百分之二點六七 ,…</P></TEXT></DOC>

Page 17: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 17C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Introduction: CGW Corpus IVStatistics

Resource Characters Words Documents

First

Edition

CNA 735 462 1,649

Xinhua 382 252 817

TOTAL 1,118 714 2,466

Second

Edition

CNA 792 497 1,769

Xinhua 471 310 992

Zaobao 28 18 41

TOTAL 1,291 825 2,803

Table 2. Content of data from each source

Unit: Million

Page 18: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 18C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CGW after fully automatic tagging

  Word Type Word Token

CNA 1,917,093 496,465,879

XIN 1,409,747 305,595,420

ZBN 273,111 18,328,571

Total 2,999,590 820,389,870

 

Page 19: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 19C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

II. 1. Corpus Preparation: (Almost) Fully Automatic Segmentation and Tagging

Strategy (Ma and Chen 2005)(Ma and Chen 2005) : HMM method for POHMM method for POS tagging for words existing in basic lexicon and morS tagging for words existing in basic lexicon and morpheme-analysis-based method (Tseng and Chen 200pheme-analysis-based method (Tseng and Chen 2002) to predict POS’s for new words.2) to predict POS’s for new words.

Integrating Language Resources Sinica lexicon with 80,000 word entries. Sinica lexicon with 80,000 word entries.

A 50,000-words’ set collected from Sinica Corpus 3.0 A 50,000-words’ set collected from Sinica Corpus 3.0 (10 million words balanced corpus).(10 million words balanced corpus).

5,000 new words from Xinhua new-words dictionary.5,000 new words from Xinhua new-words dictionary.

Tagset : Adopting Sinica Tagset as a uniform tagging set.

Page 20: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 20C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Preparation: Implementation Environment: Environment: 2 PC (2.8GHz CPU) 2 PC (2.8GHz CPU)

Time ConsumedTime Consumed :: over 3 daysover 3 days

OutputOutput : : 462 million words of CNA462 million words of CNA

252 million words of XIN252 million words of XIN

Ma and Huang 2006 (LREC 2006)Ma and Huang 2006 (LREC 2006)

See http://ckipsvr.iis.sinica.edu.tw/ for demo of the CKIP Segmentation program

Page 21: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 21C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Preparation: Tagging

Segmented and Tagged Article

<DOC id="CNA19910101.0003" type="story"><HEADLINE>捷運局 (Nc)  對 (P31)  工程 (Nac)  噪音 (Nad)  採 (VC2)  多 (Neqa)  項 (Nfa)  防治 (VC2)  措施(Nac)</HEADLINE><DATELINE>((PARENTHESISCATEGORY)  中央社 (Nca)  台北 (Nca)  一日 (Nd)  電 (VC2)   )(PARENTHESISCATEGORY)</DATELINE><TEXT><P>台北 (Nca)  都會區 (Ncb)  捷運 (Nad)  工程 (Nac)  正 (Dd) 處於 (VJ3)  積極 (VH11)  趕工 (VA4)  階段 (Nac)  , (COMMACATEGORY) …</P><P>淡水線 (Na)  工程 (Nac)  進度 (Nad) 百分之三十六點一九 (Neqa), (COMMACATEGORY)落後 (VJ1)  百分之二點六七 (Neqa)  , (COMMACATEGORY)…</P></TEXT></DOC>

Page 22: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 22C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Summary of Fully Tagged CGW Corpus Fully segmented and tagged with Sinica tagset by Ac

ademia Sinica

Being processing by PKU with their tagset

Potentially the most important source for processing and comparative studies of Mandarin Chinese

Will be available from LDC in 2007.

Page 23: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 23C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CWS and Integration of Corpus Search Engine with Lexico-grammatical Information

Overview

A word sketch is a one-page, automatic, corpus-derived summary of a word's grammatical and collocational behavior.

The Word Sketch Engine, which takes as input a corpus of any language and a corresponding grammar patterns, generates word sketches for the words of that language.

We synergize rich lexicon-based grammatical information (ICG, Chen and Huang 1992) with stochastic information.

Page 24: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 24C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Word SketchWord Sketch Engine (Kilgarriff et al.)

Register for trial usage at http://www.sketchengine.co.uk

A Versatile Corpus Viewing and Searching Tool

The Word Sketch Engine, which takes as input a corpus of any language and a corresponding grammar patterns, generates word sketches for the words of that language.

Based on pre-defined context-free rules to identify grammatical functions (relations)

Ranked by Saliency: frequency adjusted MI (based on Dekang Lin’s definition of Pair-wise MI)

Page 25: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 25C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Design Criteria of Sketch Engine Grammatical relation is the information that is both of

most interest to HLT and linguistic research

However, GR’s can only be discovered based on collocational data, hence requires very large corpus and high quality annotation at the same time, a seeming unsolvable dilemma

There is a solution when corpus is big enough Context-free patterns allows fairly reliable extraction of

a substantial number, if not all, relations

(When there are enough instances of relations extracted), the saliency ranking correctly picks the distributional tendencies and allows users to ignore idiosyncrasies/errors.

Page 26: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 26C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

WordSketch’s Approach:From Lexical Types to Relations Types

BNC has 100,000,000 Words 939,028 word types

70,000,000 tuples (relations) Extracted

More than 70 relations per lemma

For CWS II, and CGW corpus (CNA data) 1,917,093 word Types

59,183,238 tuples (<eat, obj, rice>)

More than 30 relations per lemma

Page 27: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 27C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Chinese WordSketch: An Overview Concordance

WordSketch

Sketch Difference

Thesaurus

Page 28: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 28C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 29: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 29C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 30: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 30C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 31: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 31C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Page 32: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 32C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CWS: SketchDiffComparing the behaviors of two words

Page 33: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 33C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CWS: Thesaurus of 快樂

Page 34: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 34C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Application to Chinese Corpus: Comparing ThesaurusWe shall know a word by the company it keeps

Page 35: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 35C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Context-free patterns: Does Quality of Grammatical Knowledge Matter?

The implementation of CWS I simply adopts English like CF grammatical patterns (since Chinese and English supposedly share very similar PS rules)

However, the result was not very satisfactory

Missing a lot of relations, such as objects which do not appear right next to a verb

Mis-classifying topicalized objects as subjects

Missing objects in non-canonical positions

Page 36: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 36C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Linguistic Knowledge Should Solve the above Problems

Comprehensive Lexical Knowledge of Verb Frames exists Information-based Case Grammar (ICG) Encoded on over 40,000 verbs in Sinica Lexicon

ICG Basic Patterns for Stative Pseudo-transitive Verb (VI)

EXPERIENCER<GOAL[PP[ 對 ]]<VI

EXPERIENCER<VI<<GOAL[PP[ 於 ]]

THEME<GOAL[PP{ 對、以 }]<VI

THEME<VI<<GOAL[PP[ 於 ]]

THEME<VI<<SOURCE[PP{ 自、於 }]

THEME< SOURCE[PP{ 歸、為 }]<VI

Page 37: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 37C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Comparing Lexical Knowledge Between CWS I and CWS II CWS I: 11 definitions, 11 patterns

One single patter for verb-object relation

CWS II: 32 definitions, 80 patterns

20 patterns for verb-object relation

59,183,238 tuples (<eat, obj, rice>)

from 496,465,879 words

English has 39 definitions, 40 patterns

Page 38: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 38C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Synergy among tagging, statistics, and linguistic knowledge

Collocations are identified with Context free rules in Word Sketch Engine

Collocating Pattern for Object from CSE I

1:"V[BCJ]" "Di"? "N[abc]"? "DE"? "N[abc]"? 2: "Na" [tag!= "Na"]

Challenge: Long-distance relations

全穀麵包,吃了很健康。

quan.gu mian.bao, chi le hen jian.kang

有人嘗試要將這荷花分類,卻越分越累。 you ren chang.shi yao jiang zhe he.hua fen.lei, que yue fen yue lei

他 只 吃了 一 口 飯 …

Ta zhi chi let yi kou fan

Page 39: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 39C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrating prior Knowledge in Processing

Knowledge Source

Information-based Case Grammar (ICG, Chen and Huang 1992)

Encoded on over 40,000 verbs in Sinica Lexicon

ICG Basic Patterns for Stative Pseudo-transitive Verb (VI)

EXPERIENCER<GOAL[PP[ 對 ]]<VI

EXPERIENCER<VI<<GOAL[PP[ 於 ]]

THEME<GOAL[PP{ 對、以 }]<VI

THEME<VI<<GOAL[PP[ 於 ]]

THEME<VI<<SOURCE[PP{ 自、於 }]

THEME< SOURCE[PP{ 歸、為 }]<VI

Page 40: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 40C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrating prior Knowledge in Processing: Examples

村莊 (object) 明天將 被 夷為平地 (VB11)

cunzhuang mingtian jiang bei yiweipingdi

begin time1 location time1 adv? passive_prep adv_string 1:"V[BCJ].*" [tag!="DE"]

大量 的 遊客 破壞 (VC2) 公園 景觀 (object)

daliang de youke pohuai gongyuan jingguan

1:"VC.*" (particle|prep)? NP not_noun

(NP is defined as “…noun_modifier{0,2} 2:noun…”.

Page 41: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 41C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrating prior Knowledge in Processing: Partial Result

Object Recall Comparison

CSE I CSE II

hong2 (red) 0 0

pao3 (run) 0 8,704

kan4 (look) 32,350 64,096

da3 (hit) 26,016 47,182

song4 (give) 0 76,378

shuo1 (say) 0 20,350

xiang1xin4 (believe) 0 52,373

quan4 (persuade) 0 3,852

Page 42: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 42C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrating prior Knowledge in Processing: Partial Result II

Most salient objects for chi1 「吃」 in CSEII

Those among top 20 salient object fromCSE1, but not II

飯 fan4 rice 802 70.96 (4),

虧 kui disadvantage 329 59.24 (12)

苦頭 ku3tou2 suffering 194 58.71 (14)

Page 43: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 43C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Applications: Chinese WordSketch Test version of Chinese Word Sketch is available

Permanent version of CWS will be available from Academia Sinica Soon

http://wordsketch.ling.sinica.edu.tw

Page 44: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 44C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Application: Resolving Nominalization

])[(log

][|])[(|])[(|log

][|])[(|])[(|log

][|])[(|])[(|log

][|])[(|])[(|log

111

111

111

111

nomtvP

nomvtPnomttPnomtvtP

nomvwPnomtwPnomtvwP

nomvtPnomttPnomtvtP

nomvwPnomtwPnomtvwP

ii

iiiiii

iiiiii

iiiiii

iiiiii

Chinese verbs are nominalized without overt markup

Resolving Categorical ambiguity with distributional information only

Two Approaches: HMM and Bayesian Classifier

HMM: N-grams

Classifier: left, right contexts, plus own verb sub-class, weighted

2.0 ,3.0 ,5.0

Page 45: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 45C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Nonminalization Results (Ma and Huang 2006)

0

20

40

60

80

100

文學 生活 社會 科學 哲學 藝術 綜合

Topics

F-sc

ore(

%)

HMM-1

HMM-2

Classifier-1

Classifier-2

Classifier-3

Best overall HMM performance: 69%

Best Overall Bayesian classifier performance: 74%

    

Page 46: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 46C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Mining Cross-Strait Lexical Difference Strategy: Using a pair of know contrasting words

as seeds and lookup SketchDifference

Clinton 克林頓 ke4 Vs. 柯林頓 ke1

What is found: Other unique translation for either PRC or Taiwan

克林頓 (PRC) only and/or patterns (vs 柯林頓 only)

葉利欽 88 54.6 Yeltin 葉爾勤 (3)

布什 65 49.7 Bush 布希 (4)

萊溫斯基 10 41.3 Lewinsky 呂茵斯基 / 呂女 (1)

戈爾 20 39.4 Gore 高爾 (2)

Page 47: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 47C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Adventures in Wen-Land: 文國尋寶記http://www.sinica.edu.tw/Wen/

Integrating the following

Corpora: Sinica Corpus, Textbook Corpus (3 different editions), Tang poems, Dream of the Red Chamber, On the Water Margin…

Lexicon: General, Classifier, Idiom ( 成語 )

Linked with a corpus/lexicon interface

Developed by: Huang, Fengju Lo, Hui-chun Hsiao, and team of teachers

Page 48: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 48C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

The Substantive IssuesLanguage Resources Used in WenGuo

Textual Databases (of classical texts)

Text Corpora

Linguistic and Philological Knowledge from previous research

LKB Extracted and composed from the above

Page 49: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 49C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Adventures in Wen-Land (2001)

Page 50: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 50C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Adventures in Wen-Land What: Is a virtual theme park for on-line Chinese lang

uage learning and teaching .

How: Is the end product of a National Digital Museum Project sponsored by the National Science Council, ROC (A Linguistic and Literary KnowledgetNet for Elementary School Children)

When: Was completed in spring, 2001

Page 51: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 51C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Adventures in Wen-Land Who: The team included

Chu-Ren Huang a linguist

Feng-ju Lo a literary scholar

Hui-chun Hsiao a web-based art-designer

Ching-Chun Hsieh a computer scientist

Chi-chao Liao, Chiu-Jung Lu Pei-chuan Wei...

Mei-ling Li, Hsiou-Hua Chiu, Shu-wen Huang, Cheng-chi Jiang elementary school Chinese teachers

Page 52: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 52C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

An Adventure in Seven PartsThe Geography of Wen-Land

天罡地煞梁山泊論英雄好漢 On the Water Margin梁山 mountain

大觀園 garden

西園 music hall

倒影湖 lake

接龍瀑布 falls

黑白宮 castle

學堂 colleges

名稱 Scene

大觀園一探紅樓兒女情懷 The Dream of the Red Chamber

進入時光隧道,回味唐宋流行歌 Song Poetry

語文的無窮趣味,遊戲的新鮮挑戰 Games

出口成章,妙語串成珠璣 Chinese Idiom Dict.

名詞語量詞配出中文的特色 Noun-class. Dict.

由教科書有限的字數裡找出豐富的知識與無窮的趣味 Three versions of textbooks

學習目標 Content

Page 53: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 53C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

The Adventure’s Seven GuidesThe Denizens of Wen-Land

神行太保 The Chinese Mercury (one of the 108 heroes)梁山 mountain

大觀園 garden

西園 music hall

倒影湖 lake

接龍瀑布 falls

黑白宮 castle

學堂 colleges

名稱 Scene

鴛鴦 A Maid who knows the ins and outs

宋代少婦 Young Song Dynasty Woman平平與明明 A Twin哪吒 The mythical flying child acrobat

林三本 A medieval estate owner

李小哲 a learned young scholar (a miniature version of Y.T Lee)

導覽人物 Featured Character

Page 54: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 54C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Designing Adventures: Threads that hold the KnowledgeNet Together

A Thread without a guiding needle goes nowhere

穿針引線 A Lexical Needle Picks Up & Connects

-Only the Textual Materials that it is allowed to go through

Page 55: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 55C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Pulling Through and Pulling Together Lexicalthread and hyperlink

Lexical KnowledgeBase (LKB) guides us through all language resources that use the same word

-In WenGuo, we assume users will be using textbook vocabulary to guide them

Page 56: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 56C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Pulling Through and Pulling Together Lexicalthread and Textutal Filter

LKB provides the chronological (such as when a word is first taught/learned) and distributional (such as frequency) feature of each word.

-In WenGuo, by knowing a user’s level at school, we can gauge/pace learning

Page 57: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 57C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrated Resources as Learning Background in Wenguo

Page 58: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 58C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Using LKB to Pace Linguistic Knowledge A learner identifies his/her school year (3rd grade

etc.) when log in

-control vocabulary level of learning activity

-pace/monitor development of ling. Skill

A user can also specify which textbook version to view

-allows cross-track comparison of linguistic development

-allows supplementation at corresponding learning level

Page 59: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 59C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Synergizing Archive-based LKB LKB’s based on classical or prototypical texts facilitat

es quick and accurate lexical comparison and allows immediate reference to original text

-In WenGuo, users can easily find out the literary references and citations in several classics and go immediately from vocabulary to text

Page 60: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 60C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrated Lexical KnowledgeBase entry of 雲海 yun2hai3

Page 61: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 61C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Citation of Classifier 個 ge5 in Three Textbooks

Page 62: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 62C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Collocating Nouns of Classifier 張 From Huang et al. 1997 國語日報量辭典

Page 63: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 63C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Concluding Remarks-Corpus is a sample of the ‘real words in action’

-Corpora and other language resources can be combined to create powerful language teaching and learning tools

-The integration must be linked by lexical terms

-Corpora must be tagged with POS

-In practice, different editions of textbooks can be treated as different corpora

-And be linked for comparison or borrowing

-Corpus facilitates creation of synergy for learning and teaching

Page 64: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 64C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Other Useful Resources Sinica Corpus 中央研究院現代漢語平衡語料庫 , first t

agge corpus of Chinese, online since 1996

http://www.sinica.edu.tw/SinicaCorpus

SouWenJieZi - A Linguistic KnowledgeNet. August 1999.

http://words.sinica.edu.tw/

SINICA BOW 2002

http://bow.sinica.edu.tw

Chinese Wordnet 2005, >16,000 synsets

http://cwn.ling.sinica.edu.tw/

Page 65: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 65C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Conclusion

The synergy of different language resources crea

tes

Knowledge

生生不息生生不息

Page 66: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 66C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Concluding RemarksOther NLP Research Activities at Academia Sinica

Chinese Wordnet: ongoing, >10,000 synsets

http://cwn.ling.sinica.edu.tw

Bilingual Wordnet linked to SUMO ontology

http://bow.sinica.edu.tw

Fully Sense-tagged corpus: combining cwn and Sinica corpus with machine learning algorithm

Directed by Sue-Jin Ker of Soochow Univ.

Subset to be available soon

Asian lexicon standard: NEDO project

Tokunaga, Calzolari, Shirai, Virach, Prevot…

Page 67: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 67C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Reference

DLC CGW Corpus: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14

Chinese Word Sketch 試用網址 : http://corpora.fi.muni.cz/chinese_all/ (帳號 :chinese 密碼 :chinese)

Wei-yun Ma, and Chu-Ren Huang. 2006. Uniform and Effective Tagging of a Heterogeneous Giga-word Corpus. To be Presented at the 5th International Conference on Language Resources and Evaluation (LREC2006). Genoa, Itlay. 24-28 May, 2006.

CKIP (Chinese Knowledge Information Processing Group). (1995/1998). The Content and Illustration of Academica Sinica Corpus. (Technical Report no 95- 02/98-04). Taipei: Academia Sinica

Huang Chu-Ren, Keh-Jiann Chen, Feng-Yi Chen, Keh- Jiann Chen, Zhao-Ming Gao and Kuang-Yu Chen. (2000). Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface. Proceedings of 2nd Chinese Language Processing Workshop pp. 29-37.

Kilgarriff, Adam, Chu-Ren Huang, Pavel Rychly, Simon Smith, and David Tugwell. (2005). Chinese Word Sketches. ASIALEX 2005: Words in Asian Cultural Context.

Page 68: Chu-Ren Huang  Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

p. 68C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Reference ( 續 )

Ma, Wei-Yun and Keh-Jiann Chen, (2005). Design of CKIP Chinese Word Segmentation System, Chinese and Oriental Languages Information Processing Society, Vol 14. No. 3. pp. 235-249.

Tseng, H.H. & K.J. Chen, (2002). Design of Chinese Morphological Analyzer,” Proceedings of SIGHAN Workshop on Chinese Language Processing, pp. 49-55.

Tsai Yu-Fang and Keh-Jiann Chen, 2003, "Reliable and Cost-Effective Pos-Tagging", Proceedings of ROCLING XV, pp161-174.

Tsai Yu-Fang and Keh-Jiann Chen, 2003, "Context-rule Model for POS Tagging", Proceedings of PACLIC 17, pp146-151.

Tsai Yu-Fang and Keh-Jiann Chen, 2004, "Reliable and Cost-Effective Pos-Tagging", International Journal of Computational Linguistics & Chinese Language Processing, Vol. 9 #1, pp83-96.