![Page 1: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/1.jpg)
Extracting bilingual terminologies from comparable corporaBy: Ahmet Aker, Monica Paramita, Robert Gaizauskasl
CS671: Natural Language Processing Prof. Amitabha Mukerjee
Presented By:Ankit Modi (10104)
![Page 2: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/2.jpg)
Introduction» Bilingual terminologies are important for various
applications of human language technologies
» Earlier studies may be distinguished by whether they work on parallel or comparable corpora
» Focus on Comparable corpora is crucial as Parallel corpora is tough to find for all language pairs
![Page 3: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/3.jpg)
TaskTo extract bilingual terminologies from comparableCorpora
![Page 4: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/4.jpg)
TaskTo extract bilingual terminologies from comparableCorpora
Comparable corpora:Collection of source-target language document pairs that are not direct translations but topically related.
![Page 5: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/5.jpg)
Method
» Pair each term extracted from S with each term extracted from T
Term: Contiguous sequence of words (No particular syntactic restriction)
![Page 6: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/6.jpg)
Method
» Pair each term extracted from S with each term extracted from T
» Treat term alignment as a binary classification task
![Page 7: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/7.jpg)
Method
» Pair each term extracted from S with each term extracted from T
» Treat term alignment as a binary classification task
» Extract features for each S-T potential term pair
Decide whether to classify it as term equivalent or not ( SVM binary classifier with linear kernel)
![Page 8: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/8.jpg)
Feature Extraction» Dictionary Based Features
1. isFirstWordTranslated ( Binary Feature)
2. isLastWordTranslated
3. percentageOfTranslatedWord
4. percentageOfNotTranslatedWords
![Page 9: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/9.jpg)
Feature Extraction» Dictionary Based Features
5. longestTranslatedUnitInPercentage
6. longestNotTranslatedUnitInPercentage
7. averagePercentageOfTranslatedWords
» First 6 features are computed in both directions (S -> T and T -> S) .In total, we have 13 Dictionary Based Features
![Page 10: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/10.jpg)
Feature Extraction» Cognate Based Features
1. Longest Common Subsequence Ratio:Ex: LCSR (‘dollar’, ‘dolari’) = 5/6
2. Longest Common Substring Ratio: Ex: LCSTR (‘dollar’, ‘dolari’) = 3/6
3 Dice Similarity: Dice = 2*LCST / (len(X) + len(Y))
![Page 11: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/11.jpg)
Feature Extraction» Cognate Based Features
4. Needlemann Wunsch Distance (NWD): NWD = LCST /min[ len(X) + len(Y)]
5. Levenshtein Distance: LDn = 1 - ( LD / max[len(X), len(Y)] )
» We have 5 Cognate Based Features
![Page 12: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/12.jpg)
Feature Extraction» Cognate based features with term matching
Applicable to those pair of languages whose alphabets belong to a common character set
A mapping is performed from a source term to a target writing system or vice versa.
Same cognate features as previous are calculated in both directions
» We have 10 such features
![Page 13: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/13.jpg)
Feature Extraction» Combined Features
1. isFirstWordCovered:Translation + Transliteration
2. isLastWordCovered:
3. percentageOfCoverage:
4. percentageOfNonCoverage
5. difBetweenCoverageAndNonCoverage
» Calculated in both directions - 10 Combined Features
![Page 14: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/14.jpg)
Feature Extraction» We have 38 features
Dictionary based features : 13
Cognate based features : 5
Cognate based features with term matching : 10
Combined features :10
![Page 15: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/15.jpg)
Evaluation 1» Some positive and negative examples are
created
» Precision, recall and f-score are calculated
» The precision score ranges from 100 to 67 percent
![Page 16: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/16.jpg)
Evaluation 2» Manual Evaluation
» Human assessors are asked to categorize each term pair into one of the following categories:
Equivalence, Inclusion, Overlap and Unrelated
» Over 80 percent of the term pairs were assessed to be of the first category i.e. Equivalence.
![Page 17: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/17.jpg)
Dataset» Training data taken from EUROVOC thesarus
» English-German term-tagged comparable corpora for manual evaluation
![Page 18: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/18.jpg)
Thank You
![Page 19: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/19.jpg)
Manual Evaluation» Equivalence: Exact translation/ transliteration of
each other
» Inclusion: An exact translation/ transliteration of one term contained within the other
» Overlap: Terms share at least one translated/ transliterated word
» Unrelated: No word in either term is a translation/ transliteration of a word in other
![Page 20: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/20.jpg)
Error» Error percentage was generally low
» Reason for errors:
Existence of words with very similar spellings but completely different meanings
![Page 21: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/21.jpg)
SVM Binary Classifier
» Pair each term extracted from S with each term extracted from T
» Treat term alignment as a binary classification task
» Linear Kernel» Trade-off between training error and margin
parameter, c = 10.
![Page 22: Extracting bilingual terminologies from comparable corpora](https://reader035.vdocuments.site/reader035/viewer/2022062520/568162a9550346895dd32938/html5/thumbnails/22.jpg)
Future Work» Looking into the usefulness of the term pairs in
various application scenarios such as machine translation etc