a metric-based framework for automatic taxonomy induction

6/27/13

1

A Metric-based Framework for Automatic Taxonomy Induction

Hui Yang and Jamie Callan Language Technologies Institute Carnegie Mellon University ACL2009, Singapore

ROADMAP

¥  Introduc)on

¥  Related Work

¥  Metric-‐Based Taxonomy Induc)on Framework

¥  The Features

¥  Experimental Results

¥  Conclusions

6/27/13

2

INTRODUCTION

¥  Seman)c taxonomies, such as WordNet, play an important role in solving knowledge-‐rich problems

¥  Limita)ons of Manually-‐created Taxonomies ¤  Rarely complete

¤  Difficult to include new terms from emerging/changing domains

¤  Time-‐consuming to create; May make it unfeasible for specialized domains and personalized tasks

INTRODUCTION

¥  Automa)c Taxonomy Induc)on is a solu)on to ¤  Augment exis)ng resources

¤  Quickly produce new taxonomies for specialized domains and personalized tasks

¥  Subtasks in Automa)c Taxonomy Induc)on ¤  Term extrac)on

¤  Rela)on forma)on

¥  This paper focuses on Rela)on Forma)on

6/27/13

3

Related Work ¥ Pa#ern-‐based Approaches ¥  Define lexical-‐syntac)c paPerns for rela)ons, and use these paPerns to discover instances

¥  Have been applied to extract Is-‐a, part-‐of, sibling, synonym, causal, etc, rela)ons

¥  Strength: Highly accurate

¥ Weakness: Sparse coverage of paPerns

¥  Clustering-‐based Approaches ¥  Hierarchically cluster terms based

on similari)es of their meanings usually represented by a feature vector

¥  Have only been applied to extract is-‐a and sibling rela)ons

¥  Strength: Allowing discovery of rela)ons which do not explicitly appear in text; higher recall

¥  Weaknesses: Generally fail to produce coherent cluster for small corpora [Pantel and PennacchioV 2006]; Hard to label non-‐leaf nodes

A UNIFIED SOLUTION

¥  Combine strengths of both approaches in a unified framework ¤  Flexibly incorporate heterogeneous features ¤  Use lexical-‐syntac)c paPerns as one types of features in a

clustering framework

Metric-‐based Taxonomy Induc)on

6/27/13

4

THE FRAMEWORK

¥  A novel framework, which ¤  Incrementally clusters terms ¤  Transforms taxonomy induc)on into a mul)-‐criteria op)miza)on ¤  Using heterogeneous features

¥  Op)miza)on based on two criteria ¤  Minimiza)on of taxonomy structures ó

Minimum Evolu)on Assump)on ¤  Modeling of term abstractness ó

Abstractness Assump)on

LET’S BEGIN WITH SOME IMPORTANT DEFINITIONS

¤  A Taxonomy is a data model

Concept Set Rela)onship Set Domain

6/27/13

5

MORE DEFINITIONS

ball table

Game Equipment

A Full Taxonomy:

AssignedTermSet={game equipment, ball, table, basketball, volleyball, soccer, table-‐tennis table, snooker table} UnassignedTermSet={}

MORE DEFINITIONS

ball

Game Equipment

A Par)al Taxonomy

table

AssignedTermSet={game equipment, ball, table, basketball, volleyball} UnassignedTermSet={soccer, table-‐tennis table, snooker table}

6/27/13

6

MORE DEFINITIONS Ontology Metric

distance = 1.5 distance = 2

distance =1

distance =1

d( , ) = 2

d( , ) = 1 ball

d( , ) = 4.5 table

ASSUMPTIONS Minimum Evolu)on Assump)on: The

Op)mal Ontology is One that Introduces Least Informa)on

Changes!

6/27/13

7

ILLUSTRATION Minimum Evolu)on Assump)on


6/27/13

8


ball

ILLUSTRATION Minimum Evolu)on Assump)on ball

table

6/27/13

9


ball table

Game Equipment


ball table

Game Equipment

6/27/13

10


ball table

Game Equipment

ASSUMPTIONS Abstractness

Assump)on: Each abstrac)on level

has its own Informa)on func)on

6/27/13

11

ASSUMPTIONS Abstractness Assump)on

ball table

Game Equipment

MULTIPLE CRITERION OPTIMIZATION

Minimum Evolu)on

objec)ve func)on

Abstractness objec)ve func)on

Scalariza)on variable

6/27/13

12

ESTIMATING ONTOLOGY METRIC

¥  Assume ontology metric is a linear interpola)on of some underlying feature func)ons

¥  Ridge Regression to es)mate and predict the ontology metric

THE FEATURES

¥  Our framework allows a wide range of features to be used

¥  Input for the Feature Func)ons: Two terms

¥  Output: A numeric score to measure seman)c distance between these two terms

¥  We can use the following types of feature func)ons, but not restricted to only these: ¤  Contextual Features ¤  Term Co-‐occurrence ¤  Lexical-‐Syntac)c PaPerns ¤  Syntac)c Dependency Features ¤  Word Length Difference ¤  Defini)on Overlap, etc

6/27/13

13

EXPERIMENTAL RESULTS

¥  Task: Reconstruct taxonomies from WordNet and ODP ¤  Not the en)re WordNet or ODP, but fragments of WordNet or

ODP

¥  Ground Truth: 50 hypernym taxonomies from WordNet; 50 hypernym taxonomies from ODP; 50 meronym taxonomies from WordNet.

¥  Auxiliary Datasets: 1000 Google documents per term or per term pair; 100 Wikipedia documents per term.

¥  Evalua)on Metrics: F1-‐measure (averaged by Leave-‐One-‐Out Cross Valida)on).

DATASETS

6/27/13

14

PERFORMANCE OF TAXONOMY INDUCTION

¥  Compare our system (ME) with other state-‐of-‐the-‐art systems ¤  HE: 6 is-‐a paPerns [Hearst 1992]

¤  GI: 3 part-‐of paPerns [Girju et al. 2003]

¤  PR: a probabilis)c framework [Snow et al. 2006]

¤  ME: our metric-‐based framework

PERFORMANCE OF TAXONOMY INDUCTION

¥  Our system (ME) consistently gives the best F1 for all three tasks.

¥  Systems using heterogeneous features (ME and PR) achieve a significant absolute F1 gain (>30%)

6/27/13

15

FEATURES VS. RELATIONS

¥  This is the first study of the impact of using different features on taxonomy induc)on for different rela)ons

¥  Co-‐occurrence and lexico-‐syntac0c pa3erns are good for is-‐a, part-‐of, and sibling rela)ons

¥  Contextual and syntac0c dependency features are only good for sibling rela)on

FEATURES VS. ABSTRACTNESS

¥  This is the first study of the impact of using different features on taxonomy induc)on for terms at different abstrac)on levels

¥  Contextual, co-‐occurrence, lexical-‐syntac0c pa3erns, and syntac0c dependency features work well for concrete terms;

¥  Only co-‐occurrence works well for abstract terms

6/27/13

16

CONCLUSIONS

¥  This paper presents a novel metric-‐based taxonomy induc)on framework, which ¤  Combines strengths of paPern-‐based and clustering-‐based

approaches

¤  Achieves bePer F1 than 3 state-‐of-‐the-‐art systems

¥  The first study on the impact of using different features on taxonomy induc)on for different types of rela)ons and for terms at different abstrac)on levels

CONCLUSIONS

¥  This work is a general framework, which

¤  Allows a wider range of features

¤  Allows different metric func)ons at different abstrac)on levels

¥  This work has a poten)al to learn more complex taxonomies than previous approaches

6/27/13

17

THANK YOU AND QUESTIONS [email protected] [email protected]

a metric-based framework for automatic taxonomy induction

Documents