thesaurus-based index term extraction olena medelyan digital library laboratory

11
Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

Upload: ursula-lewis

Post on 16-Dec-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

Thesaurus-Based Index Term Extraction

Olena Medelyan

Digital Library Laboratory

Page 2: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

• Describe the topics in a document

• Index terms: controlled vocabulary (e.g. predatory birds, damage, aquaculture)

• Keyphrases: freely chosen (e.g. techniques, bird predation, aquaculture)

• Purposes: – Organize library’s holding– Provide thematic access to documents– Represent documents as brief summary– Aid navigation in search results

• Manual assignment: expensive, time-consuming

Index Terms vs. Keyphrases

Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities

Page 3: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

Extraction vs. Assignment

• Select significant n-grams or NPs according to their characteristics

• Classify documents according to their content words into classes (lables = keyphrases)

- Restriction to syntax- Bad quality phrases- No consistency

+ Easy and fast implementation+ Not much training required

- Need large corpora- Long compuational time- Not practical

+ Word coocurrence+ High accuracy

Page 4: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

KEA++

• Combines extraction with controlled vocabulary

• Considers semantic relations

• Controlled vocabulary = thesaurus

• Experiment: – agricultural documents (www.fao.org/docrep)– Agrovoc thesaurus (www.fao.org/agrovoc)

Page 5: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

How does it Work?

1. Extract n-grams, transform them to pseudo-phrases, map to pseudo-phrases of thesaurus´ descriptors bird predation predat bird

2. Each document = set of candidate phrases

3. Training (document + manually assinged phrases)a. Compute the features

b. Compute the model

4. Testing (new documents, no phrases)a. Compute the features

b. Compute probabilities according to the model

5. Classification model: Naïve Bayes

Page 6: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

Features

• TF×IDF – phrases that are specific for a given document are significant

• First Position – phrases that are in the beginning (or the end) of the document are significant

• Phrase Length – phrases with certain number of words are significant (2!)

• Node Degree – phrases that are related to the most other phrases in the document are significant

Page 7: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

Example

fisheries

fish culture

aquaculture

fish ponds

aquaculturetechniques

birdcontroll

predatorybirds

noxiousbirds

scares

pestconroll

controllmethods

monitoring

methods

equipmentprotectivestructures

electricalinstallation

fencing

Indexers:1 2 3 4 5 6

Agrovoc relation:

KEA++:

damage

noise

north america

techniques

fisheryproduction

predation

predators

birds

ropes

fishingoperations

Page 8: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

Evaluation I

• Standard Evaluation:– Number of exact matches in the test set– Precision, Recall, F-measure

• Problem: – Semantic similarity is not considered– Comparison only to one indexer, although

indexing is subjective

Page 9: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

Evaluation II

• Inter-indexer consistency, e.g. Rolling’s measure:

Indexers vs. other vs. KEA vs.KEA++indexers

1 42 7 29 2 39 8 28 3 37 9 26 4 37 6 31 5 37 6 25 6 36 4 20 avg 38 7 27

Rolling‘s IIC = 2C

A+B

C – number of phrases in commonA – number of phrases in the first setB – number of phrases in the second set

-11%

Page 10: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

“Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities”

.

Results

Indexer KEA++Exact aquaculture aquaculture

damage damagefencing fencingscares scaresnoise* noise*

Similar bird control birdspredatory birds predatorsfish culture fishing operationsfishery production

No match noxious birdscontrol methodsropes

*Selected by only one indexer

Page 11: Thesaurus-Based Index Term Extraction Olena Medelyan Digital Library Laboratory

Problems & Future Work

• Trivial problems (e.g. stemming errors)• Document chunking

– What are important and disturbing parts of the document?

• Topic coverage– exploring thesaurus’ structure– Lexical chains

• Term occurrence– Including other NLP resources (e.g. WordNet)

• Multi-linguality, other domains