thesaurus-based index term extraction olena medelyan digital library laboratory

Thesaurus-Based Index Term Extraction

Olena Medelyan

Digital Library Laboratory

• Describe the topics in a document

• Index terms: controlled vocabulary (e.g. predatory birds, damage, aquaculture)

• Keyphrases: freely chosen (e.g. techniques, bird predation, aquaculture)

• Purposes: – Organize library’s holding– Provide thematic access to documents– Represent documents as brief summary– Aid navigation in search results

• Manual assignment: expensive, time-consuming

Index Terms vs. Keyphrases

Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities

Extraction vs. Assignment

• Select significant n-grams or NPs according to their characteristics

• Classify documents according to their content words into classes (lables = keyphrases)

- Restriction to syntax- Bad quality phrases- No consistency

+ Easy and fast implementation+ Not much training required

- Need large corpora- Long compuational time- Not practical

+ Word coocurrence+ High accuracy

KEA++

• Combines extraction with controlled vocabulary

• Considers semantic relations

• Controlled vocabulary = thesaurus

• Experiment: – agricultural documents (www.fao.org/docrep)– Agrovoc thesaurus (www.fao.org/agrovoc)

How does it Work?

1. Extract n-grams, transform them to pseudo-phrases, map to pseudo-phrases of thesaurus´ descriptors bird predation predat bird

2. Each document = set of candidate phrases

3. Training (document + manually assinged phrases)a. Compute the features

b. Compute the model

4. Testing (new documents, no phrases)a. Compute the features

b. Compute probabilities according to the model

5. Classification model: Naïve Bayes

Features

• TF×IDF – phrases that are specific for a given document are significant

• First Position – phrases that are in the beginning (or the end) of the document are significant

• Phrase Length – phrases with certain number of words are significant (2!)

• Node Degree – phrases that are related to the most other phrases in the document are significant

Example

fisheries

fish culture

aquaculture

fish ponds

aquaculturetechniques

birdcontroll

predatorybirds

noxiousbirds

scares

pestconroll

controllmethods

monitoring

methods

equipmentprotectivestructures

electricalinstallation

fencing

Indexers:1 2 3 4 5 6

Agrovoc relation:

KEA++:

damage

noise

north america

techniques

fisheryproduction

predation

predators

birds

ropes

fishingoperations

Evaluation I

• Standard Evaluation:– Number of exact matches in the test set– Precision, Recall, F-measure

• Problem: – Semantic similarity is not considered– Comparison only to one indexer, although

indexing is subjective

Evaluation II

• Inter-indexer consistency, e.g. Rolling’s measure:

Indexers vs. other vs. KEA vs.KEA++indexers

1 42 7 29 2 39 8 28 3 37 9 26 4 37 6 31 5 37 6 25 6 36 4 20 avg 38 7 27

Rolling‘s IIC = 2C

A+B

C – number of phrases in commonA – number of phrases in the first setB – number of phrases in the second set

-11%

“Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities”

.

Results

Indexer KEA++Exact aquaculture aquaculture

damage damagefencing fencingscares scaresnoise* noise*

Similar bird control birdspredatory birds predatorsfish culture fishing operationsfishery production

No match noxious birdscontrol methodsropes

*Selected by only one indexer

Problems & Future Work

• Trivial problems (e.g. stemming errors)• Document chunking

– What are important and disturbing parts of the document?

• Topic coverage– exploring thesaurus’ structure– Lexical chains

• Term occurrence– Including other NLP resources (e.g. WordNet)

• Multi-linguality, other domains

thesaurus-based index term extraction olena medelyan digital library laboratory

Documents