comparing taxonomies for organising collections of documents presentation

25
Comparing taxonomies for organising collections of documents Samuel Fernando, Mark Hall, Eneko Agirre, Aitor Soroa, Paul Clough, Mark Stevenson COLING 2012, 14th December 2012, Mumbai, India

Upload: pathsproject

Post on 11-May-2015

106 views

Category:

Education


5 download

TRANSCRIPT

Page 1: Comparing taxonomies for organising collections of documents presentation

Comparing taxonomies for organising collections of documents

Samuel Fernando, Mark Hall, Eneko Agirre,

Aitor Soroa, Paul Clough, Mark Stevenson

COLING 2012, 14th December 2012, Mumbai, India

Page 2: Comparing taxonomies for organising collections of documents presentation

Introduction

● Large collections of diverse data are available

online. PATHS project aims to support user

exploration in digital library collections.

● Search box is useful but taxonomies are better

suited for exploration and browsing.

● We apply taxonomies to organise data from a large

digital library collection.

● Process is automatic – either map items to an

existing taxonomy, or induce a taxonomy from the

data.

COLING 2012, 14th December 2012, Mumbai, India

Page 3: Comparing taxonomies for organising collections of documents presentation

Evaluation data

● We use items from Europeana, a large online collection

of cultural heritage.

● Use English subset, approx. 550,000 items.

● Item typically contains a picture, a title, description and

subject keywords.

● Very diverse data comprising artifacts, places, people.

Topics include fashion, archaeology, architecture and

many other subjects.

● Data from many providers, some of which use

taxonomies, some don’t – need unified approach

COLING 2012, 14th December 2012, Mumbai, India

Page 4: Comparing taxonomies for organising collections of documents presentation

Example item

COLING 2012, 14th December 2012, Mumbai, India

Title: Design Council Slide Collection Subject: colour, exhibitions, industrial design Description: Display on the theme of colour matching at the Design Centre, London, 1960

Page 5: Comparing taxonomies for organising collections of documents presentation

Manually created taxonomies

● We use four existing manually created taxonomies:

– LCSH (Library of Congress)

– WordNet domains

– Wikipedia Taxonomy

– DBpedia ontology

● The taxonomies already exist and are of good

quality - but problem is to map Europeana items

into the correct place in the taxonomy.

COLING 2012, 14th December 2012, Mumbai, India

Page 6: Comparing taxonomies for organising collections of documents presentation

LCSH

● A controlled vocabulary maintained by the US

Library of Congress for bibliographic records.

● Used by libraries to organise collections and also by

curators of cultural heritage.

● Subject keywords are used to map Europeana

items into the appropriate LCSH category nodes.

industrial design design creation (literary, artistic, etc.)

intellect

+30 more higher level headings

COLING 2012, 14th December 2012, Mumbai, India

Page 7: Comparing taxonomies for organising collections of documents presentation

WordNet domains

● WordNet domains (Bernardo Magnini, LREC 2000)

applies a small set of 164 domain labels to each of the

WordNet synsets.

● Again use subject keywords to map Europeana items -

first to Yago2 (for proper nouns) then to synset and

finally to WordNet domain label.

tourism social

color factotum

art humanities

+ 5 more

COLING 2012, 14th December 2012, Mumbai, India

Page 8: Comparing taxonomies for organising collections of documents presentation

Wikipedia Taxonomy

● Wikipedia category hierarchy preserving only is-a

relations - all others are discarded.

● Use Wikipedia Miner over each Europeana item to

identify Wikipedia articles in the subject keywords. Then

map item to all categories that contain these articles

design visual_arts criticism

image_processing digital_signal_processing signal_processing

museology museums educational_organizations

organizations

+35 more

COLING 2012, 14th December 2012, Mumbai, India

Page 9: Comparing taxonomies for organising collections of documents presentation

DBpedia ontology

A formalised shallow ontology manually created

based on Wikipedia (with inference capability).

Again use Wikipedia Miner to find Wikipedia articles

in subject keywords of each item and map item to

the categories which these articles belong.

musical_work work

work

album musicalwork work

COLING 2012, 14th December 2012, Mumbai, India

Page 10: Comparing taxonomies for organising collections of documents presentation

Automatic data-derived taxonomies

● We use two approaches to derive taxonomies

automatically from the Europeana data.

– LDA (Latent Dirichlet Allocation) topic modelling

– WikiFreq (Wikipedia Frequency hierarchy)

● Taxonomies fit data - no unnecessary nodes to

prune.

● Mapping from items to concept nodes is implicit

during derivation.

COLING 2012, 14th December 2012, Mumbai, India

Page 11: Comparing taxonomies for organising collections of documents presentation

Latent Dirichlet Allocation (LDA) maps each

item to one or more topics.

Distribution of items over topics - each topic is

a distribution over words

Item-topic and topic-word distributions are

learned using collapsed Gibbs sampling

Has been used for improving results from IR

Previous work has developed hierarchical LDA

but this is infeasible over our large data set

LDA topic modelling

COLING 2012, 14th December 2012, Mumbai, India

Page 12: Comparing taxonomies for organising collections of documents presentation

Hierarchical LDA topics

● Run LDA over corpus to determine item-topic probabilities.

● Identify set of items for each topic. Each item assigned to

highest probability topic. Topic labelled with highest

probability word.

● If a topic has less than 60 items then stop. Otherwise go

back to first step with the set of items identified in previous

part as the corpus.

COLING 2012, 14th December 2012, Mumbai, India

Page 13: Comparing taxonomies for organising collections of documents presentation

Hierarchical LDA topics (example)

COLING 2012, 14th December 2012, Mumbai, India

Bangle design design design

brooch collection

Page 14: Comparing taxonomies for organising collections of documents presentation

Wikipedia link frequencies

● Novel approach.

● Run Wikipedia Miner to find links in all Europeana

items – use title, subject and description.

● Find frequency counts for each link.

● For each item take the set of links found.

● Create taxonomy branch (if not already present)

with links in order of frequency (most frequent first).

● Map item to least frequent link.

COLING 2012, 14th December 2012, Mumbai, India

Page 15: Comparing taxonomies for organising collections of documents presentation

Wikipedia link frequencies (cont.)

● Large number of concept nodes - limit to 24

children for each node.

● Require at least 2 links for each item - filter out

items with little metadata.

● Filter out concepts with fewer than 20 items.

industrial design design council

COLING 2012, 14th December 2012, Mumbai, India

Page 16: Comparing taxonomies for organising collections of documents presentation

Statistics

COLING 2012, 14th December 2012, Mumbai, India

Type Taxonomy Items Nodes Avg. parents

Avg. Depth

Top nodes

Manual LCSH DBpedia WikiTax WN domains

99259 178312 275359 308687

285238 273 121359 170

1.8 4.2 11.7 7.1

1.97 2 1.13 7.1

28901 30 10417 6

Automatic LDA topics Wiki Freq

545896 66558

22494 502

1 1

7.3 3.39

9 24

Page 17: Comparing taxonomies for organising collections of documents presentation

Evaluation - cohesion

Intruder detection originally proposed in (Chang et. al,

2009). A cohesive unit is defined as one in which the

items are similar while at the same time different from

items in other clusters.

Present 5 items to each annotator. 4 from one concept

node, and an intruder item randomly from elsewhere in

the taxonomy. The more cohesive the unit, the more

obvious the intruder will be.

Crowd-sourcing: 111 annotators, 30 units from each

taxonomy. 1255 answers – average 7 annotators for

each unit

COLING 2012, 14th December 2012, Mumbai, India

Page 18: Comparing taxonomies for organising collections of documents presentation

Example of a cohesive unit

COLING 2012, 14th December 2012, Mumbai, India

Page 19: Comparing taxonomies for organising collections of documents presentation

Evaluation - cohesion results

COLING 2012, 14th December 2012, Mumbai, India

Type Taxonomy Cohesive units

Percentage

Manual LCSH DBpedia

Wiki Taxonomy WN domains

19 17 18 15

63.3 56.7 60.0 50.0

Automatic LDA topics Wiki Freq

17 29

56.7 96.7

Number of cohesive units (out of a possible 30)

Page 20: Comparing taxonomies for organising collections of documents presentation

Evaluation - relation classification

Previous work has typically used a simple boolean

question “is it true that ChildNode is-a ParentNode?”

We ask two questions for each child-parent pair A and

B:

Are the concepts A and B related?

If they are, is A more specific than B, less specific

than B, or neither?

Crowd sourcing: 173 annotators, 40 pairs from each

taxonomy, each pair evaluated on average 16 times

COLING 2012, 14th December 2012, Mumbai, India

Page 21: Comparing taxonomies for organising collections of documents presentation

Evaluation - example pairs

COLING 2012, 14th December 2012, Mumbai, India

Taxonomy Child (A) Parent(B)

LCSH Work Braid

Human Behaviour Weaving

DBpedia Mountain Range Fern

Place Plant

Wiki Taxonomy

Mammals of Africa Schools in Wiltshire

Wildlife of Africa Schools in England

WN domains vehicles mechanics

transport engineering

LDA topics earthenware view

dish church

Wiki Freq Corrosion Interior Design

Coin Industrial Design

Page 22: Comparing taxonomies for organising collections of documents presentation

Are A and B related?

COLING 2012, 14th December 2012, Mumbai, India

Taxonomy Yes No Don't know

LCSH DBpedia

Wiki Taxonomy WN domains

74.2 86.6 96.1 77.1

8.8 11.2 1.7

14.5

17.0 2.2 2.3 8.4

LDA topics Wiki Freq

30.3 47.6

50.3 16.5

19.3 35.8

Page 23: Comparing taxonomies for organising collections of documents presentation

Which is more specific?

COLING 2012, 14th December 2012, Mumbai, India

Taxonomy A<B A>B Neither Don't know

LCSH DBpedia

WikiTaxonomy WN domains

65.4 76.2 78.3 63.6

8.7 4.9 4.7 6.3

23.4 18.1 16.0 28.0

2.5 0.7 0.9 2.0

LDA topics Wiki Freq

21.4 30.9

14.8 22.6

62.1 43.6

1.6 2.9

Page 24: Comparing taxonomies for organising collections of documents presentation

Conclusions

Wikipedia Taxonomy is conceptually well organised,

even better than LCSH which has been widely used

for organising library collections.

WikiFreq gives very high cohesion for items

although the conceptual relations are not well

defined.

Future work continues with different intrinsic and

user evaluations. Also aim to combine Wikipedia

Taxonomy and WikiFreq to get the best of both.

COLING 2012, 14th December 2012, Mumbai, India

Page 25: Comparing taxonomies for organising collections of documents presentation

The End

[email protected]

Supported by the PATHS project http://paths-project.eu Funded by the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 270082. This research was also partially funded by the Ministry of Economy under grant TIN2009-14715-C04-01 (KNOW2 project

Questions?