bilingual terminology extraction - sketch engine · pdf filebilingual terminology extraction...
TRANSCRIPT
![Page 1: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/1.jpg)
Bilingual terminology extraction
Vít Baisa
6th Sketch Engine WorkshopHerstmonceux, August 10, 2015
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 1 / 12
![Page 2: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/2.jpg)
Terminology extraction: recap
combination of rules & statisticslanguages: Czech, Dutch, English, French, German, ChineseSimplified, Chinese Traditional, Italian, Japanese, Korean, Polish,Portuguese, Russian, Spanishyou can help us to add your languagecurrently in progress: Turkish, Hungarian
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 2 / 12
![Page 3: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/3.jpg)
Terminology extraction: what is a term?
unithoodhow it is grammatically defined? (e.g. noun phrases)
termhooddoes it belong to a domain?
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 3 / 12
![Page 4: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/4.jpg)
Terminology extraction: what is a term?
unithoodhow it is grammatically defined? (e.g. noun phrases)
termhooddoes it belong to a domain?
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 3 / 12
![Page 5: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/5.jpg)
Terminology extraction: what is a term?
unithoodhow it is grammatically defined? (e.g. noun phrases)
termhooddoes it belong to a domain?
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 3 / 12
![Page 6: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/6.jpg)
Unithood
Sketch Grammar formalismCQL (corpus query language) rules
# English, computer mouse*COLLOC "%(2.lc)_%(1.lc)"
2:[tag=="NN" | tag=="JJ" | tag=="VVG"] 1:[tag=="NN"]
# German, kleines Haus*COLLOC "%(2.adj_stem)%(1.gender_ending)_%(1.lemma_cap)"
2:[kind="ADJA"] 1:[kind="N"] & 1.case = 2.case
# Czech, Ústav národního zdraví*COLLOC "%(1.gender_lemma)_%(2.lc)_%(3.lc)"
1:noun 2:adj_genitive 3:noun_genitive & agree(2,3)
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 4 / 12
![Page 7: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/7.jpg)
Termhood
simple math parameter N
ffocus + Nfref + N
f is relative (per million) frequency of a termthe formula is used also for keyword extractionN influences whether rare or frequent words are preferreda reference corpus in the same language is needed
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 5 / 12
![Page 8: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/8.jpg)
![Page 9: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/9.jpg)
Fine-tuning: options
stoplists (blacklists)simple math parameterminimum frequencyminimum scoreminimum character lengthonly alphanumerical strings. . .
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 7 / 12
![Page 10: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/10.jpg)
Bilingual (multilingual) terminology extraction
recent developmentparallel corpora needed
Two-step procedure1 extraction of terms in source and target languages2 counting co-occurrences of the terms
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 8 / 12
![Page 11: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/11.jpg)
Bilingual (multilingual) terminology extraction
recent developmentparallel corpora needed
Two-step procedure1 extraction of terms in source and target languages2 counting co-occurrences of the terms
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 8 / 12
![Page 12: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/12.jpg)
![Page 13: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/13.jpg)
Bilingual terminology extraction
we need to evaluate the extraction properlydata can be saved as TBXgranularity affects quality
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 10 / 12
![Page 14: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/14.jpg)
The (not so distant) future for BTE
parallel vs. comparable corporadefinition findingterm hyper-, hyponyms findingterm thesaurusthe ultimate goal: one-click terminology :)terminology consistency checkingmulti- instead of bilingual extraction
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 11 / 12
![Page 15: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese](https://reader036.vdocuments.site/reader036/viewer/2022062401/5aa3de3e7f8b9a80378ecc9c/html5/thumbnails/15.jpg)
The last slide
API availableIntelliWebSearch configurationsplugins for SDL, Kilgray products plannedone-off terminology extractionspromissing results so far
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 12 / 12