bilingual terminology extraction - sketch engine · pdf filebilingual terminology extraction...

15
Bilingual terminology extraction Vít Baisa [email protected] 6 th Sketch Engine Workshop Herstmonceux, August 10, 2015 Vít Baisa (Lexical Computing Ltd.) August 10, 2015 1 / 12

Upload: vothu

Post on 10-Mar-2018

227 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

Bilingual terminology extraction

Vít Baisa

[email protected]

6th Sketch Engine WorkshopHerstmonceux, August 10, 2015

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 1 / 12

Page 2: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

Terminology extraction: recap

combination of rules & statisticslanguages: Czech, Dutch, English, French, German, ChineseSimplified, Chinese Traditional, Italian, Japanese, Korean, Polish,Portuguese, Russian, Spanishyou can help us to add your languagecurrently in progress: Turkish, Hungarian

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 2 / 12

Page 3: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

Terminology extraction: what is a term?

unithoodhow it is grammatically defined? (e.g. noun phrases)

termhooddoes it belong to a domain?

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 3 / 12

Page 4: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

Terminology extraction: what is a term?

unithoodhow it is grammatically defined? (e.g. noun phrases)

termhooddoes it belong to a domain?

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 3 / 12

Page 5: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

Terminology extraction: what is a term?

unithoodhow it is grammatically defined? (e.g. noun phrases)

termhooddoes it belong to a domain?

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 3 / 12

Page 6: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

Unithood

Sketch Grammar formalismCQL (corpus query language) rules

# English, computer mouse*COLLOC "%(2.lc)_%(1.lc)"

2:[tag=="NN" | tag=="JJ" | tag=="VVG"] 1:[tag=="NN"]

# German, kleines Haus*COLLOC "%(2.adj_stem)%(1.gender_ending)_%(1.lemma_cap)"

2:[kind="ADJA"] 1:[kind="N"] & 1.case = 2.case

# Czech, Ústav národního zdraví*COLLOC "%(1.gender_lemma)_%(2.lc)_%(3.lc)"

1:noun 2:adj_genitive 3:noun_genitive & agree(2,3)

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 4 / 12

Page 7: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

Termhood

simple math parameter N

ffocus + Nfref + N

f is relative (per million) frequency of a termthe formula is used also for keyword extractionN influences whether rare or frequent words are preferreda reference corpus in the same language is needed

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 5 / 12

Page 8: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese
Page 9: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

Fine-tuning: options

stoplists (blacklists)simple math parameterminimum frequencyminimum scoreminimum character lengthonly alphanumerical strings. . .

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 7 / 12

Page 10: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

Bilingual (multilingual) terminology extraction

recent developmentparallel corpora needed

Two-step procedure1 extraction of terms in source and target languages2 counting co-occurrences of the terms

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 8 / 12

Page 11: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

Bilingual (multilingual) terminology extraction

recent developmentparallel corpora needed

Two-step procedure1 extraction of terms in source and target languages2 counting co-occurrences of the terms

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 8 / 12

Page 12: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese
Page 13: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

Bilingual terminology extraction

we need to evaluate the extraction properlydata can be saved as TBXgranularity affects quality

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 10 / 12

Page 14: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

The (not so distant) future for BTE

parallel vs. comparable corporadefinition findingterm hyper-, hyponyms findingterm thesaurusthe ultimate goal: one-click terminology :)terminology consistency checkingmulti- instead of bilingual extraction

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 11 / 12

Page 15: Bilingual terminology extraction - Sketch Engine · PDF fileBilingual terminology extraction VítBaisa ... Terminology extraction: recap combinationofrules&statistics languages: Czech,Dutch,English,French,German,Chinese

The last slide

API availableIntelliWebSearch configurationsplugins for SDL, Kilgray products plannedone-off terminology extractionspromissing results so far

Vít Baisa (Lexical Computing Ltd.) August 10, 2015 12 / 12