1 chen yirong, lu qin, li wenjie, cui gaoying department of computing the hong kong polytechnic...

26
1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual Term Bank

Upload: branden-hodges

Post on 20-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

1

Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying

Department of Computing The Hong Kong Polytechnic University

Chinese Core Ontology Constructionfrom a Bilingual Term Bank

Page 2: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

2

Outline

Introduction

Related Works

Algorithm Design– COCA

Performance Evaluation

Conclusion

Page 3: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

3

IntroductionWhat is a Core Ontology

A mid-level ontology Bridges the gap between an upper ontology

and a domain ontology

Upper Ontology

Core Ontology

Domain Ontology

ComputerProgram

Free Software Public Domain Software

Software

Page 4: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

4

Concepts and TerminologiesUpper Ontology

A general ontology to ensure reusability across different domains (e.g.: Computer Program in SUMO)

Domain OntologyAn ontology conceptualize a specific domain (e.g.: Free

Software in IT domain)More application dependent, more extents of concepts

Midlevel Ontology(Core Concept)Basic concepts of a domain

More application independent, more intents of concepts.core ontology (e.g.: Software)Frequently used, ability to form other concepts

Core TermsLexical units of core concepts

Page 5: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

5

Related WorksManually constructed ontologies

SUMOFamous upper level ontology works based on lexicon

CoreLex (Buitelaar, P., 1998)EuroWordnet (Rodríguez, 1998 )

Ontology harmonization: Core ontology“Towards a Core Ontology for Information Integration” (M.

Doerr, 2003)A most similar work

“Enriching Core Ontology with Domain Thesaurus through Concept and Relation Classification ” (Huang, 2007)Use Concept and Relation Classification to Enrich core

ontology

Page 6: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

6

Our Previous WorksChinese terminology extractionChinese core term extraction(Ji et al, 2007) Preliminary work on automatic construction of

core ontology construction using English-Chinese Term Bank (MRCOCA, Ontolex 2007, Chen, 2007)

Bilingual lexicon Extended stringsFrequency information in synset Weight from extended strings are integrated into final

weight by simple additionMapping to synset and SUMO can only achieve accuracy of

about 50%

Page 7: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

7

IssuesWhat kind of concept should be

included? How to identify core concepts

If through core terms, disambiguation

What and how to identify relations? Making use of available resources

Chinese NLP resource scaresEnglish NLP resources abundant

Page 8: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

8

Requirements of Core Ontology

The concepts must be widely accepted and commonly referenced

Corresponding core terms must be highly used and productive

The concepts/terms can be mapped to upper ontology. So the core ontology can inherit the attributes provided by upper ontology

Page 9: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

9

Core Ontology Construction Algorithm(COCA) for Chinese Extract Chinese core terms from a

bilingual term bank Mapped core term Tc to English terms Mapping English terms to WordNet

Mapping synset to a upper ontology concept in SUMO

)|(maxarg C

S

TSP

Page 10: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

10

COCA - Resources UsedITCTerm

a domain specific core term list (Chen, 2007 )CETBank

Chinese-English bilingual term bank 1,500 most productive core terms extracted can

serve as suffixes to form more than 50% of the terms in CETBank)

WordNetSUMOMappings between WordNet and SUMO

Page 11: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

11

The Framework of COCA

Chinese Core Term

Sense Disambiguation

Module

SUMO Concept

Statistical Translation

Module

Bilingual Term Bank

English Core Term Candidate

SUMO-WordNet Mapping Module

Core Concept Candidate

WordNet

Concept Selection Module

Core Concept

Mapping Data Additional Features

Page 12: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

12

COCA – Statistical Translation ModuleTranslation ambiguity:

Each Chinese core term TC ∈ ITCTerm has a set of translations T_SetE , TE ∈T_SetE

Objectiveto estimate the likelihood of every translation using

extended terms of TC

P(TE | TC) for all TE ∈ T_SetE.

) ExtT_Set __ )(

)()|(

CeE TT eE

ECE

Tlen

TlenTTW

i

TSetTT

EiC

ECCE

CEEi

TTW

TTWTTP

)(_

),(

),()|(

Page 13: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

13

COCA - Sense Disambiguation ModuleMapping a given TC to the Synset S through

its translation set T_SetE (TC)Mapping probability of a English term TE to

take a synset S using freq. info in WordNet

Mapping probability of TC to take a particular synset S via an English translation TE

)(

)1),((

1),()|(

ETsynsetx

E

EE

xTF

STFTSP

)|(*)|()|( ECEC TSPTTPTSP

Page 14: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

14

COCA - Concept Selection ModuleCombining three features

multi-path featurehypernyms featurepart-of-speech feature

Using Union Probability of Independent Events

0||)(*)()()(

0||0)(

}{}{EpUxppUxp

EpU

xEyxEyEx

Page 15: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

15

Feature 1 –Multi-Paths to SynsetMultiple paths is

the path between Chinese core terms and synset

via different English translations

)(TT_Set

CE

)|()|(ET

CC TSPTSSP电流

CurrentElectric Current

Current

Current, Electric Current

The feature merges the probability of multiple paths

Page 16: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

16

Feature 2 – Hyponyms in domainIncorporate info on all the extended strings

) )|()(

)(()|(

)( )(T Suffix_Etx

C

Shyponymh

tC thSP

tlen

TlenUTSHP

C

计算机 大型计算机Head-of

Mainframe Computer

Computer

Computer, Computing Machine

Calculator, ComputerHyponym-of

Calculator

Extended String uses the core term as headword and is the hyponym of the core term

Length Ratio

Union Probability of Independent Events

Page 17: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

17

Feature 3 – Part of SpeechProbability of the POS tag pos(S)

owned by a synset S

given a core term Tc

PoS Tag estimation: Heuristics on Adj, Verb, and noun based on position

标准

Standard

Standard, Criterion (n.)

Standard (adj.)

Relatively low

probability to be a noun

High probability

to be adjective

},,{

),(_

))(,(_ )|(S

adjectiveverbnounpo

C

CC

poTposfreq

SposTposfreqTOP

Page 18: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

18

Integrate FeaturesUsing Union Probability of

Independent Events

})|(),|(),|({ )()|(

CCC TSOPTSHPTSSPxC xUTSFP

Page 19: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

19

EvaluationAlgorithm Output

A pair of < Tc_i, Synseti > for each Chinese core term with the highest mapping weight

Evaluation Standard For each Tc_i, whether their mappings to Synset are the

best match with respect to this domain

Answer Preparation Answer is manually made by two experts in IT domain

respectively on the same set of data

Page 20: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

20

PerformanceThe evaluation conducted on the top N

frequent core termsThe algorithm COCA achieves 71% in

accuracy (N is 28 in this paper) Compared to the result of MRCOCA (Chen,

2007) which achieved only 50%Two examples of core term to syntset

mapping generated by the algorithm are given for “软件” and “网络” .

Page 21: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

21

No. Zh En SUMO Concept Synset

1 软 件(SC)

Software ComputerProgram + software,software_system

(computer science) written programs or procedures or rules and associated documentation pertaining to the operation of a computer system and that are stored in read/write memory

2 软件 Facility StationaryArtifact + facility,installation

something created to provide a particular service; "the assembly plant is an enormous facility"

3 软件 Facility SubjectiveAssessmentAttribute

+ proficiency, facility, technique

skillfulness in the command of fundamentals deriving from practice and familiarity; "practice greatly improves proficiency"

4 软件 Facility SubjectiveAssessmentAttribute

+ adeptness,adroitness,deftness,facility,quickness

skillful performance without difficulty; "his quick adeptness was a product of good design"

5 软件 facility Room + toilet, lavatory, lav, can, facility, john, privy, bathr

a room equipped with washing and toilet facilities

6 软件 facility SubjectiveAssessmentAttribute

+ facility,readiness

a natural effortlessness; "a happy readiness of conversation"--Jane Austen

7 网络 (S) net Artifact + network,net,mesh,meshwork,reticulation

an interconnected or intersecting configuration or system of components

8 网络 (C) network Collection + network,web

an intricately connected system of things or people; "a network of spies" or "a web of intrigue"

9 网络 network SocialInteraction + network

communicate with and within a group; "You have to network if you want to get a good job"

10 网络 net Pursuing + net,nett

catch with a net; "net a fish"

11 网络 net Making + web,net

construct or form a web, as if by weaving

12 网络 net SubjectiveAssessmentAttribute

+ final,last,net

conclusive in a process or progression; "the final answer"; "a last resort"; "the net result"

13 网络 net CurrencyMeasure + net,nett

remaining after all deductions; "net profit"

Page 22: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

22

ConclusionEvaluation of COCA repeated on an English-

Chinese bilingual Term bank with more than 130K entries show that the algorithm is “42%” improved in accuracy compared to

MRCOCA (Our Previous Works)The three features and the new algorithm

based on probability made the improvement

Page 23: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

23

Term bank can help to quickly construct domain core ontology by selecting the concept nodes and relations used in domain

Bilingual term bank can further introduce the second language realization of the core ontology effectively and automatically

Page 24: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

24

Future WorksEvaluation on three features

how effective they arehow much they contribute to the final

performance

Consideration of more features such as abbreviation, synset of head word of core term and etc.

Use of other resources

Page 25: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

25

Q&A

Page 26: 1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual

26

QA