multilingual semantic annotation engine for agricultural documents
TRANSCRIPT
Benjamin Chu Min Xian
Arun Anand Sadanandan
Fadzly Zahari
Dickson Lukose
Multilingual Semantic Annotation
Engine for Agricultural
Documents
04.09.2012
International Symposium on Agricultural
Ontology Service (AOS2012)
Outline
Introduction
Related Work
System Description: Text Annotation Engine
Challenges
Conclusion
2
Introduction
3
Related Work
• Semantic Annotation techniques are
typically categorized into pattern-based
and machine learning-based
• Most of the annotation tools can only deal
with a single language
• Not easily customized to work for different
domains
4
Text Annotation Engine (T-ANNE1)
• Semantic tagging system
– Semantic web of tags
• Knowledge base approach
• Scalable system
– Handles large sets of documents
– Web services
• Distributed approach
– Document Splitter
• Multilingual tagging
– Language identifier
5
1. Chu, M.X., Bahls, D., Lukose, D.: A System and Method for Concept and Named Entity Recognition (2012). (Patent Pending)
Text Annotation Engine (T-ANNE)
Multilingual Semantic Annotation System Overview
Text Annotation Engine (T-ANNE)
Semantic Annotation
Engine (T-ANNE)
Semantic Annotations
Documents Knowledge Base
AGROVOC
Knowledge Base
TAGS
Text Annotation Engine (T-ANNE)
Semantic Annotation
Engine (T-ANNE)
Knowledge Base
AGROVOC
Example (Japanese)
Knowledge Base
TAGS
Text Annotation Engine (T-ANNE)
• Knowledge-based approach
• The number of languages and domains it can
handle is only limited by the knowledge base
it uses
• Easily customized
• Utilizes AGROVOC as the knowledge base
for recognition and annotation of agriculture
related documents
9
Text Annotation Engine (T-ANNE)
• Multilingual capability • Automatically determines the language of the text
• AGROVOC – multilingual thesaurus more than
40,000 concepts in up to 22 languages
10
Challenges
1. Ambiguity
2. Morphological Variations
3. Detail / Granularity Level
11
Challenges
1. Ambiguity
12
“They performed Kashmir, written by Page and Plant. Page played unusual chords on
his Gibson”.
Guitar brand or actor “Mel Gibson”?
Guitarist “Jimmy Page” or the Google founder “Larry Page”?
A song or the Himalayan region?
Challenges
2. Morphological Variations
Variation of entities representing the same concept using:
Plurals
Acronyms / Abbreviations
Different Spellings
Compound Words
Language
13
Challenges
3. Detail / Granularity Level
Some annotation system will issue more generic tags while
others issue more specific tags.
For example, a general tag as ‘Cereals’ in contrast to a specific
tag as ‘Waxy maize’.
It really depends what would be the actual need of the results,
whether the system should return coarse-grained or fine-grained
annotation tags. It is important to choose the right granularity (detail)
level.
14
Conclusions
Annotation engine uses knowledge based approach
that performs concept entity recognition
Application domains and the number of languages it can
handle relies on the knowledge base used for the
recognition purpose.
Future work - Address the challenges (Entity Resolution,
Disambiguation)
15
16