an evaluation procedure for word net based lexical chaining: methods and issues
Post on 05-Jan-2016
27 Views
Preview:
DESCRIPTION
TRANSCRIPT
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. ..
.. ..
..
Text TechnologicalModel l ing of Informat ion
An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues
Irene Cramer & Marc FinthammerFaculty of Cultural Studies,
Technische Universität Dortmund, Germany
irene.cramer@uni-dortmund.de
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Outline
• Project Context and Motivation
• Lexical Chaining – Evaluation Steps
1. Preprocessing and Coverage
2. Sense Disambiguation
3. Semantic Relatedness/Similarity
4. Application
• Open Issues and Future Work
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Project Context and Motivation
• Research project HyTex funded by DFG (German Research Foundation) – part of the research unit "Text Technological Modelling of Information"
• Research objective in HyTex: text-grammatical foundations for the (semi-) automated text-to-hypertext conversion
• One focus of our research: topic-based linking strategies using lexical and topic chains/topic views
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Project Context and Motivation
• Lexical chains
• Topic chains/views– based on the concept of lexical cohesion, – regarded as partial text representation, – valuable resource for many NLP applications, such as
text summarization, dialog systems etc.
– (to our knowledge) 2 lexical chainers for German (Mehler, 2006 and Cramer/Finthammer), in additon: research on semantic similarity based on GermaNet by Gurevych et al.
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Project Context and Motivation
• Lexical chains• Topic chains/views
– based on a selection of central words, so called topic words,
– intended to support the user‘s orientation and navigation.
Steps:Lexical chains are used to select topic words (1-3 topic
words per passage),topic words are used to construct the topic view
(~"thematic index"),topic words are re-connected via semantically meaningful
edges (as in lexical chaining) to construct topic chains
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Project Context and Motivation
Chapter 1.1text topic word text text text text text text text text text texttext text text text text text texttext text text text topic word texttext text text text text texttext text text text text text texttext text text text text text text text text text topic word text texttext text text text text text text text text
Chapter 1.1
topic word 1 topic word 2 topic word 3
Chapter 1.2
topic word 1topic word 2topic word 3
Chapter 1.3 …Chapter 2 …Chapter 3.1 …
Chapter 1.2text topic word text text text text text text text text text texttext text text text text text texttext text text text topic word texttext text text text text texttext text text text text text texttext text text text text text text text text text topic word text texttext text text text text text text text text
Topic ViewTopic View
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Lexical Chaining – Evaluation Steps
• To evaluate our chainer, called GLexi, test data is required;
• experiments to develop such gold standard for German emphasize:– manual annotation of lexical chains is very demanding,– rich interaction between various principles to achieve a
cohesive text structure distracts annotators;
• results of these experiments partially reported in Stührenberg et al., 2007.
• Our conclusion: Evaluation of lexical chainer might be best performed in several steps.
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Lexical Chaining – Evaluation Steps
• Our conclusion: Evaluation of lexical chainer might be best performed in several steps.
– Evaluation of coverage– Evaluation of disambiguation quality– Evaluation of semantic relatedness measures– Evaluation of chains wrt. specific application
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Lexical Chaining – Evaluation Steps
• Remainder of talk: – very short presentation of GLexi‘s architecture and – exemplary demonstration of applicability of evaluation
procedure
• Resources used: – GermaNet (version 5.0)– HyTex corpus (specialized text) – approx. 29,000 (noun)
tokens– set of word pairs + results of human judgement
experiment– German word frequency list (thanks to Sabine Schulte im
Walde)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Lexical Chaining - GLexi
Basic modules:– preprocessing of corpora
tokenization, POS tagging, chunking chaining candidate selection
– core algorithm lexical semantic look-up, scoring of relation, sense disambiguation
– output creation rating of chain strength application specific representation
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Lexical Chaining - GLexi
Preprocessing
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Lexical Chaining - GLexi
Core algorithm:
lexical semantic look-up
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Outline
• Project Context and Motivation
• Lexical Chaining – Evaluation Steps
1. Preprocessing and Coverage
2. Sense Disambiguation
3. Semantic Relatedness/Similarity
4. Application
• Open Issues and Future Work
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 1: Coverage
approx. 29,000 (noun) tokens in our corpus split into
56 % in GermaNet 44 % not in GermaNet, of these:
15 % inflected 12 % compounds 17 % remaining, uncovered nouns
• Coverage without preprocessing: approx. 56 %
• Approx. 44 % not included in chaining
preprocessing necessary to improve coverage!
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 1: Coverage
• Coverage without preprocessing: approx. 56 %
• Lemmatization: increase coverage to approx. 71 %• Compound analysis: increase coverage to approx. 83 %
Simple Named Entity Recognition in preprocessing phase• Open issues: abbreviations, foreign words, nominalized
verbs
remaining, uncovered nouns split into:
15 % Named Entities
30 % foreign words
25 % abbreviations 20 % nominalized verbs
theoretical value – open issue e.g. Medien – Medium (Engl.
media – psychic or data carrier)
theoretical value – open issue e.g. Agrarproduktion (Engl. agricultural production) Agrar (Engl. agricultural) + Produkt (Engl. artifact) + Ion (Engl. ion [chem.])
future work: include TermNet (domain specific language) as a resource – for more information: talk by Lüngen et al. – tomorrow, session 6, 10:40 h …
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 2: Chaining-based WSD
• Approx. 45 % of tokens in our corpus more than 1 synset
• Basis for chaining-based WSD: manual annotation
• Compare manually annotated data and disambiguation decision of semantic measure
word A word B sense A sense B Wu-Palmer rank
Text Hypertext
Text Hypertext
1 1
2 1
0,9231
0,8333
1
2
manually annotated word senses
Text Hypertext 1 1
correct word senses (word A, sense A = 1 and word B, sense B = 1) of word pair on rank 1 ( semantic measure Wu-Palmer best value): correct disambiguation
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 2: Chaining-based WSD
• Approx. 45 % of tokens in our corpus more than 1 synset
• Basis for chaining-based WSD: manual annotation
• Compare manually annotated data and disambiguation decision of semantic measure
word A word B sense A sense B Wu-Palmer rank
Text Hypertext
Text Hypertext
1 1
2 1
0,9231
0,8333
1
2
manually annotated word senses
Text Hypertext 1 1
correct word senses (word A, sense A = 1 and word B, sense B = 1) of word pair on rank 1 ( semantic measure Wu-Palmer best value): correct disambiguation
best value therefore rank 1
compare with manual annotation
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 2: Chaining-based WSD
• Approx. 45 % of tokens in our corpus more than 1 synset
• Basis for chaining-based WSD: manual annotation
• Compare manually annotated data and disambiguation decision of semantic measure
word A word B sense A sense B Wu-Palmer rank
Text Hypertext
Text Hypertext
1 1
2 1
0,9231
0,8333
1
2
manually annotated word senses
Text Hypertext 1 1
correct word senses (word A, sense A = 1 and word B, sense B = 1) of word pair on rank 1 ( semantic measure Wu-Palmer best value): correct disambiguation
compare with manual annotation
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 2: Chaining-based WSD
• Performance of chaining-based WSD: mediocre!
• Best semantic measures (Resnik, Wu-Palmer, and Lin):– approx. 50-60% correct disambiguation compared to
manual annotation– majority voting increased performance to approx. 63-65 %
• Future work– include WSD in preprocessing? – machine learning based new measure?
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 3: Semantic Relatedness
• Implemented 11 semantic relatedness measures (SRM) (GermaNet: 8 measures, Google co-occurrence counts: 3 measures)
focus this talk: GermaNet measures
• Evaluation of SRM performance used results of human judgement experiment:– list of 100 word pairs, subject’s judgement of "semantic
distance" (35 subjects) on 5-level-scale– compare human judgement and SRM values
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 3: Semantic Relatedness
almost 2/3 extreme values (not related / strongly related)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 3: Semantic Relatedness
human judgement experiment example of the results
Engl. printerEngl. fin
Engl. fluid Engl. water
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 3: Semantic Relatedness
0.00
0.20
0.40
0.60
0.80
1.00
Word-Pairs Ordered by Relatedness Value
Rel
ated
ness
Human Judgement Resnik
all SRM values scatter
correlation between human judgement and SRM values low
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 3: Semantic Relatedness
Open issues – semantic relatedness:– continuous SRM values necessary / helpful? instead: classes (e.g. 3 classes: not related, related,
strongly related)machine learning (ML) experiment using parameters of
SRM
– interactions between SRM quality and disambiguation quality?
– combination of GermaNet and Google co-occurrence based measures (and further resources) useful?
integration in ML experiment?
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 4: Application-oriented Evaluation
Example: newspaper article about child poverty in Germany
Topic words according to lexical chaining results
Kind, Engl. child
Geld, Engl. money
Deutschland, Engl. Germany
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 4: Application-oriented Evaluation
• Features used in calculation of topic words and views:– chain / meta-chain info:
link density link strength
– in addition to chains: frequency (relative passage and document) mark-up
• application-oriented evaluation gold standard topic words, topic views, and topic chains necessary
• manual annotation of topic words and topic views – work in progress – current annotation agreement > 75 % (before accordance)
initial results show: link density and frequency are most relevant features
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Outlook and Future Work
To sum up:– Application: use lexical chaining for construction of topic
views– Lexical chaining for German corpora: several challenges
coverage, disambiguation, SRM– room for improvement: disambiguation and SRM
possible solutions: WSD as preprocessing step alternative SRM (potentially ML based) additional resources
– initial results using lexical chains for construction of topic views chaining useful!
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. ..
.. ..
..
Text TechnologicalModel l ing of Informat ion
Thank you!
Comments, ideas, questions are very welcome.
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Literature (back-up slide)
• Alexander Budanitsky and Graeme Hirst. 2001. Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In Workshop on WordNet and Other Lexical Resources at NAACL-2000, Pittsburgh, PA, June 2001.
• M. A. K. Halliday und Ruqaiya Hasan. 1976. Cohesion in English. Longman, London.
• Graeme Hirst und David St-Onge. 1998. Lexical chains as representation of context for the detection and correction malapropisms. In C. Fellbaum, editor, WordNet: An electronic lexical database, chapter 13, pages 305–332. The MIT Press, Cambrige, MA.
• Alexander Mehler. 2005. Lexical chaining as a source of text chaining. In Proceedings of the 1st Computational Systemic Functional Grammar Conference, Sydney.
• Grogory H. Silber und Kathleen F. McCoy. 2002. Efficiently computed lexical chains as an intermediate representation for automatic text summarization. Computational Linguistics, 28(4):487 – 496.
top related