comparative analysis of automatic term and collocation extraction
DESCRIPTION
Comparative Analysis of Automatic Term and Collocation Extraction. Sanja Seljan , Bojana Dalbelo Bašić , Jan Šnajder , Davor Delač , Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of I nformation Sciences - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/1.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FER
Comparative Analysis of Automatic Term and Collocation
Extraction
Sanja Seljan, Bojana Dalbelo Bašić, Jan Šnajder,Davor Delač, Matija Šamec-Gjurin, Dina Crnec
Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing
![Page 2: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/2.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FEROverview
I. Introduction– Reasons for extraction
II. Research– Resources & tools– Extracted lists
III. Evaluation– Precision, recall, F-measure
IV. Conclusion
![Page 3: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/3.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERI. Introduction
• Monolingual and multilingual resources– Helpful– Integrated– Require human intervention
• EU pre-accession activities– Speed up + consistency
• Used in further research and practice
![Page 4: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/4.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FER
• List:– Terms (Member State, European Union)
– Collocations (adopt a/the resolution, decided as follows)
– Multi-word units (depend on, well-being)
• Term extraction process:– Term extraction (term acquisition)- identification– Term recognition - verification
![Page 5: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/5.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERII. Research
• Resources– 10 documents – legislation, Cro-Eng
• Tools– TermeX tool (FER) – list A– SDL Multi Term Extract + NooJ (FF) – list B
• Reference list– Evaluation – reference list
![Page 6: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/6.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERReference list
• 470 terms and collocations• Exclude unigrams• Balance between lexical coverage, adequacy,
practicality– terms (NPs: 346/470)– collocations (VPs)
![Page 7: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/7.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERReference list
• Contains:– Terms (acquiring company, applicant country)
– Collocations (adopt a/the resolution, decided as
follows, entry into force, having regard to) – Names and abbreviations (Economic and
Monetary Union EMU, European Union EU)
– Relevant embedded terms (crime prevention, crime prevention bodies, national crime prevention measures).
![Page 8: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/8.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FER
• Language-independent statistically-based SDL Multi Term Extract tool – Frequency treshold set to 4– Filtered by the list of stop-words -> 369 cand.
• Language dependant NooJ tool– 36 local grammars -> 512 cand.
List B
![Page 9: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/9.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERList A
• TermeX– Lexical association measures (AMs)– 14 AMs (PMI, Dice, Chi-square,…)– Lemmatization– POS filtering– Frequency treshold set to ?
![Page 10: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/10.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERList A
• Extracted terms ranked by AM value – 1816 candidates
• AMs used:– 2-grams – PMI
– 3-grams, 4-grams – heuristic extensions
• Noun phrases only
![Page 11: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/11.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERResults
• Evaluation– F1-measure (precision, recall)
– True positives calculated by taking into account inflection (suffix stripping)
List A List B
No. of terms 1816 508
Valid terms 202 234
Precision (%) 11.56 47.37
Recall (%) 42.98 49.79
F1 (%) 18.22 48.55
![Page 12: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/12.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERResults
• List A unsatisfactory– Low recall – Verb phrases, terms consisting of
more than 4 words
– Low precision – ranked list, can be improved with cut-off (true positives are better ranked)
• List B modest– can be improved with lemmatization, definition of
upper/lower cases, more detailed local grammar
![Page 13: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/13.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERConclusion
• Comparison of two hybrid approaches to term extraction
• Human created lists differ from extracted lists– human knowledge, experience and intuition
• Space for improvement – automatic extraction combined human intervention
![Page 14: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/14.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FER
Thank you!
![Page 15: Comparative Analysis of Automatic Term and Collocation Extraction](https://reader036.vdocuments.site/reader036/viewer/2022070411/56814883550346895db59792/html5/thumbnails/15.jpg)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FER