thesaurus-based index term extraction olena medelyan digital library laboratory
TRANSCRIPT
Thesaurus-Based Index Term Extraction
Olena Medelyan
Digital Library Laboratory
• Describe the topics in a document
• Index terms: controlled vocabulary (e.g. predatory birds, damage, aquaculture)
• Keyphrases: freely chosen (e.g. techniques, bird predation, aquaculture)
• Purposes: – Organize library’s holding– Provide thematic access to documents– Represent documents as brief summary– Aid navigation in search results
• Manual assignment: expensive, time-consuming
Index Terms vs. Keyphrases
Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities
Extraction vs. Assignment
• Select significant n-grams or NPs according to their characteristics
• Classify documents according to their content words into classes (lables = keyphrases)
- Restriction to syntax- Bad quality phrases- No consistency
+ Easy and fast implementation+ Not much training required
- Need large corpora- Long compuational time- Not practical
+ Word coocurrence+ High accuracy
KEA++
• Combines extraction with controlled vocabulary
• Considers semantic relations
• Controlled vocabulary = thesaurus
• Experiment: – agricultural documents (www.fao.org/docrep)– Agrovoc thesaurus (www.fao.org/agrovoc)
How does it Work?
1. Extract n-grams, transform them to pseudo-phrases, map to pseudo-phrases of thesaurus´ descriptors bird predation predat bird
2. Each document = set of candidate phrases
3. Training (document + manually assinged phrases)a. Compute the features
b. Compute the model
4. Testing (new documents, no phrases)a. Compute the features
b. Compute probabilities according to the model
5. Classification model: Naïve Bayes
Features
• TF×IDF – phrases that are specific for a given document are significant
• First Position – phrases that are in the beginning (or the end) of the document are significant
• Phrase Length – phrases with certain number of words are significant (2!)
• Node Degree – phrases that are related to the most other phrases in the document are significant
Example
fisheries
fish culture
aquaculture
fish ponds
aquaculturetechniques
birdcontroll
predatorybirds
noxiousbirds
scares
pestconroll
controllmethods
monitoring
methods
equipmentprotectivestructures
electricalinstallation
fencing
Indexers:1 2 3 4 5 6
Agrovoc relation:
KEA++:
damage
noise
north america
techniques
fisheryproduction
predation
predators
birds
ropes
fishingoperations
Evaluation I
• Standard Evaluation:– Number of exact matches in the test set– Precision, Recall, F-measure
• Problem: – Semantic similarity is not considered– Comparison only to one indexer, although
indexing is subjective
Evaluation II
• Inter-indexer consistency, e.g. Rolling’s measure:
Indexers vs. other vs. KEA vs.KEA++indexers
1 42 7 29 2 39 8 28 3 37 9 26 4 37 6 31 5 37 6 25 6 36 4 20 avg 38 7 27
Rolling‘s IIC = 2C
A+B
C – number of phrases in commonA – number of phrases in the first setB – number of phrases in the second set
-11%
“Overview of Techniques for Reducing Bird Predation at Aquaculture Facilities”
.
Results
Indexer KEA++Exact aquaculture aquaculture
damage damagefencing fencingscares scaresnoise* noise*
Similar bird control birdspredatory birds predatorsfish culture fishing operationsfishery production
No match noxious birdscontrol methodsropes
*Selected by only one indexer
Problems & Future Work
• Trivial problems (e.g. stemming errors)• Document chunking
– What are important and disturbing parts of the document?
• Topic coverage– exploring thesaurus’ structure– Lexical chains
• Term occurrence– Including other NLP resources (e.g. WordNet)
• Multi-linguality, other domains