checking terminology consistency with statistical methods lrc xiii 2 nd october 2008 alfredo...
TRANSCRIPT
Checking Terminology Checking Terminology Consistency with Statistical Consistency with Statistical MethodsMethodsLRC XIIILRC XIII
22ndnd October 2008 October 2008
Alfredo Maldonado GuerraAlfredo Maldonado GuerraMicrosoft European Development CentreMicrosoft European Development Centre
Masaki ItagakiMasaki ItagakiMicrosoft CorporationMicrosoft Corporation
About this presentationAbout this presentation
IntroductionIntroduction
Internal Consistency CheckInternal Consistency CheckStep 1: Mine Source TermsStep 1: Mine Source TermsStep 2: Identify translations of Source Terms Step 2: Identify translations of Source Terms (Alignment)(Alignment)Step 3: Consistency CheckStep 3: Consistency Check
Current ChallengesCurrent Challenges
TipsTips
Future ImprovementsFuture Improvements
IntroductionIntroduction
Terminology Consistency: A key element of Terminology Consistency: A key element of localisedlocalised language qualitylanguage quality
Terminology Consistency: Difficult to maintainTerminology Consistency: Difficult to maintainDifficulty to keep source and target in synch during dev/loc Difficulty to keep source and target in synch during dev/loc processprocessTranslation done by several people (often working remotely)Translation done by several people (often working remotely)Terminology changes (e.g. between product versions)Terminology changes (e.g. between product versions)
Manual Language Quality Assurance (QA) can help, Manual Language Quality Assurance (QA) can help, howeverhowever
QA costs time and moneyQA costs time and moneyQA usually concentrates on a sample of the textQA usually concentrates on a sample of the textReviewer must be familiar with reference materialReviewer must be familiar with reference materialIt’s hard for humans to keep track of terminologyIt’s hard for humans to keep track of terminology
IntroductionIntroduction
Can we use technology to control Can we use technology to control consistency?consistency?
Yes, but…Yes, but…Existing tools require term lists or term basesExisting tools require term lists or term basesNot all software companies have term bases set upNot all software companies have term bases set upCompanies that do have term bases won’t have Companies that do have term bases won’t have every single term captured – building a term base every single term captured – building a term base is always a work in progressis always a work in progress
IntroductionIntroduction
Our Approach doesn’t require a term baseOur Approach doesn’t require a term base
By using Term Mining technology we identify By using Term Mining technology we identify terms on the source stringsterms on the source strings
We then check the translation consistency of We then check the translation consistency of the terminology mined the terminology mined
Internal Consistency CheckInternal Consistency Check
112233
InconsistencInconsistency!y!
Step 1: Source Term MiningStep 1: Source Term Mining
Step 2: Translation AlignmentStep 2: Translation Alignment
Problem statement:Problem statement:
Given a mined source term S, identify the Given a mined source term S, identify the corresponding target term T in the translation corresponding target term T in the translation column.column.
Example:Example:Mined term: “input field” (S)Mined term: “input field” (S)
“ “champ d’entrée” (T)champ d’entrée” (T) “ “champ d’entrée” (T)champ d’entrée” (T)
Step 2: Translation AlignmentStep 2: Translation Alignment
We need to consider all possible term We need to consider all possible term combinationscombinations
We call each combination an NGramWe call each combination an NGram
NGrams: where N = 2, 3, 4, maybe 5. NGrams: where N = 2, 3, 4, maybe 5.
For languages like German For languages like German
we even consider N = 1we even consider N = 1
How do we decide which NGram is the correct How do we decide which NGram is the correct translation for the term?translation for the term?
Bayesian statistics can help!Bayesian statistics can help!
Réattribue leurs valeurs initiales à tous les champs d'entrée.
Réattribue leurs
leurs valeurs
valeurs initiales
Initiales à
à les
…
Réattribue leurs valeurs
Leurs valeurs initiales
…
Step 2: Translation AlignmentStep 2: Translation Alignment
Problem statement:Problem statement:
Given a source term S, obtain the NGram T that Given a source term S, obtain the NGram T that maximises the conditional probability functionmaximises the conditional probability function
[1][1]
But how do we calculate this?!But how do we calculate this?!
Step 2: Translation AlignmentStep 2: Translation Alignment
[1][1]
Well, the multiplication rule of conditional probability tells us Well, the multiplication rule of conditional probability tells us thatthat
So [1] becomes:So [1] becomes:
[2][2]
And we also know that:And we also know that:
|NGrams| is the number of |NGrams| is the number of NGrams of the same N as T. NGrams of the same N as T. For example, if T is a 2 word For example, if T is a 2 word term (a bigram), term (a bigram), |NGrams| will be the amount |NGrams| will be the amount of NGrams made up of 2 of NGrams made up of 2 words.words.
|STSeg| is the number of |STSeg| is the number of segments (strings) that segments (strings) that contain both S in the source contain both S in the source column and T in the target column and T in the target column.column.
Step 2: Translation AlignmentStep 2: Translation AlignmentIn our Best Target Term Selection Routine we will be comparing In our Best Target Term Selection Routine we will be comparing probabilities of different target terms (Tprobabilities of different target terms (Tkk’s):’s):
Since P(S) remains constant during these comparisons, we can eliminate Since P(S) remains constant during these comparisons, we can eliminate it.it.
We call the resulting equation I(TWe call the resulting equation I(Tkk):):
[3][3]
The candidate TThe candidate Tkk with the highest I, is our with the highest I, is our Best Target Term CandidateBest Target Term Candidate
Step 2: Translation AlignmentStep 2: Translation Alignment
NormalisationNormalisationDepending on context any particular term can be Depending on context any particular term can be translated in a slightly different way.translated in a slightly different way.
For example: “file name” could be translated in Spanish For example: “file name” could be translated in Spanish as:as:
nombre de archivo nombre del archivo nombres de archivo nombres de archivos nombres de los archivos
Our algorithm has to be clever enough to realise that Our algorithm has to be clever enough to realise that “nombres de archivo” is just a form of “nombre de “nombres de archivo” is just a form of “nombre de archivo”. archivo”.
Step 2: Translation AlignmentStep 2: Translation Alignment
NormalisationNormalisationSo, during NGram generation, we need to generate So, during NGram generation, we need to generate regular expressions for our termsregular expressions for our termsSince Asian languages do not inflect, regular Since Asian languages do not inflect, regular expressions are simpler for these languagesexpressions are simpler for these languages
For European languages we use more complex For European languages we use more complex regular expressionsregular expressions
Source Term Target Term (Italian)
Regular Expression Matches (admitted translations)
Error code codice errore \bcod\w{0,3}(\s\w{1,4}'?){0,2}\s?err\w{0,3}\b
codice d'errorecodice di errorecodice errorecodici di errore
Source Term Target Term (Japanese)
Regular Expression Matches (admitted translations)
Error code エラー コード \bエラー \s?コード \b エラー コード
Step 3: Consistency CheckStep 3: Consistency Check
Detect the strings that do not use any of our Detect the strings that do not use any of our admitted translations admitted translations
Report these strings along with our findings Report these strings along with our findings to the userto the user
Current ChallengesCurrent Challenges
False PositivesFalse PositivesDue to “heavy” rephrasingDue to “heavy” rephrasing
Unreliable for short, generic monogramsUnreliable for short, generic monograms
Source Term Admitted translations (Italian)
data d, d3d, da, dac, dai, dal, dall, data, dati, dato, dc, ddc, dei, del, dell, deny, der, deve, dfs, dhcp, di, dir, disk, dll, dma, dns, dopo, dos, dove, dpc, dsis, dtr, due, dvd, dwm
Current ChallengesCurrent Challenges
Verbs can potentially cause problemsVerbs can potentially cause problemsDue to high inflection: Due to high inflection: amar => amo, amas, ama, amamos, amáis/aman, amanamar => amo, amas, ama, amamos, amáis/aman, amanvenir => vengo, vienes, viene, venimos, venís/vienen, venir => vengo, vienes, viene, venimos, venís/vienen, vienenvienenDifficult to differentiate from other parts of speech Difficult to differentiate from other parts of speech
Not all languages supported:Not all languages supported:ArabicArabicComplex Script languagesComplex Script languages
Source term Admitted translations Target Language
download descarga, descargar, descargó, descargue Spanish
install install, installa, installare, installata, installati, installato, installer
Italian
Current ChallengesCurrent Challenges
Best Candidate Selection logic is very good, Best Candidate Selection logic is very good, but it’s not perfect. About 70% of term but it’s not perfect. About 70% of term selections are correct.selections are correct.
Incorrect selectionsIncorrect selections
Correct term highlightedCorrect term highlighted
Correct selectionsCorrect selections
TipsTips
Make sure your data is clean to a certain Make sure your data is clean to a certain degree.degree.
Remove any HTML/XML tags from your stringsRemove any HTML/XML tags from your strings
Filter out any unlocalised strings and Filter out any unlocalised strings and non-localisable strings.non-localisable strings.
For Asian languages, run a word breaker tool For Asian languages, run a word breaker tool on your target strings (this is required for on your target strings (this is required for proper NGram handling)proper NGram handling)
TipsTips
If you already have source term lists you’re If you already have source term lists you’re interested in, you can use them to bypass interested in, you can use them to bypass the term mining processthe term mining process
If your source terms are well selected, you’ll If your source terms are well selected, you’ll achieve very good results – A well selected achieve very good results – A well selected source term has a precise technical meaning. source term has a precise technical meaning. Source term Good/
BadReason
failure bad Too generic
data bad Too generic, forms part of many other terms: data type, data structure, etc.
worker process good Has a precise meaning
user account control good Has a precise meaning
TipsTips
The more data you have, the more accurate The more data you have, the more accurate your results will beyour results will be
Try combining software data with help / user Try combining software data with help / user education data to increase term repetitionseducation data to increase term repetitions
Future ImprovementsFuture Improvements
More work with More work with Adj + NounAdj + Noun
Work with verbsWork with verbs
Add support for Complex Script languages Add support for Complex Script languages and languages that inflect on different parts and languages that inflect on different parts of the wordof the word
Further refine Best Translation Candidate Further refine Best Translation Candidate Selection logicSelection logic
Questions?Questions?
Thank You!Thank You!