mašīntulkojumu kombinēšana - tildes pētniecības seminārs
Post on 19-Jan-2017
163 Views
Preview:
TRANSCRIPT
mantulkojumu kombinana
Matss Rikters
Darba vadtja: Dr. Dat., prof. Inguna Skadia
Tildes ptniecbas seminrs
Rg, 2016. gada 4. mart
Pirm prezentcija
http://tbs/dev/ai/ResearchSeminar/MSMT.ppt
http://www.slideshare.net/matissrikters/msmt-47423148
Pirm prezentcija
Paststts par
MT virzieniem
Hibrds MT veidiem
Literatras izpti par daudzsistmu hibrdo MT
Plnotiem eksperimentiem, citiem padartajiem un plnotajiem darbiem
Saturs
Hibrd mantulkoana
Daudzsistmu hibrd MT
Vienkra mantulkojumu kombinana
Veselu tulkojumu kombinana
Tulkojumu dau kombinana
Lingvistiski motivta mantulkojumu kombinana
Citi darbi
Tlki plni
Hibrd mantulkoana
Statistisk likumu enerana
RBMT sistmas likumi enerti no treniu korpusiem
Vairkkrtja apstrde (multi-pass)
Secga datu apstrde skum ar RBMT, tad SMT
Daudzsistmu hibrd MT
Paralli darbintas vairkas MT sistmas
Daudzsistmu hibrd MT
Ldzgi ptjumi:
SMT + RBMT (Ahsan and Kolachina, 2010)
Confusion networks (Barrault, 2010)
+ neironu tklu modelis (Freitag et al., 2015)
SMT + EBMT + TM + NE (Santanu et al., 2014)
Rekursva teikumu dekompozcija (Mellebeek et al., 2006)
mantulkojumu kombinana
Veselu tulkojumu kombinana
Iztulko pilnu teikumu ar vairkm MT sistmm
Izvlas labko
mantulkojumu kombinana
Veselu tulkojumu kombinana
Iztulko pilnu teikumu ar vairkm MT sistmm
Izvlas labko
Tulkojumu fragmentu kombinana
Sadala teikumu fragmentos
K fragmenti tiek emti teikuma sintakses koka augstkie apakkoki
Iztulko katru fragmentu ar vairkm MT sistmm
Izvlas labkos fragmentus un tos apvieno
Veselu tulkojumu kombinana
Veselu tulkojumu kombinana
Labk tulkojuma izvle:
KenLM (Heafield, 2011) calculates probabilities based on the observed entry with longest matching history :
where the probability and backoff penalties are given by an already-estimated language model. Perplexity is then calculated using this probability: where given an unknown probability distribution p and a proposed probability model q, it is evaluated by determining how well it predicts a separate test sample x1, x2... xN drawn from p.
Veselu tulkojumu kombinana
Labk tulkojuma izvle:
Trents 5-grammu valodas modelis ar
KenLM
JRC-Acquis korpusu v. 2.2 (Steinberger, 2006) - 1.4 miljoniem latvieu valodas juridisk domna teikumu
Teikumi novrtti attiecb pret valodas modeli ar KenLM query programmu
Veselu tulkojumu kombinana
Labk tulkojuma izvle:
Trents 5-grammu valodas modelis ar
KenLM
JRC-Acquis korpusu v. 2.2 (Steinberger, 2006) - 1.4 miljoniem latvieu valodas juridisk domna teikumu
Teikumi novrtti attiecb pret valodas modeli ar KenLM query programmu
Testa dati
1581 patvagi izvlti teikumi no JRC-Acquis korpusa
Mints ar ACCURAT balanstais izvrtanas korpuss - 512 visprgu teikumu(Skadi et al., 2010), bet rezultti bija mazk labi
Veselu tulkojumu kombinana
SistmaBLEUIzvlto tulkojumu patsvarsGoogleBingLetsMTViendiGoogle Translate16.92100 %---Bing Translator17.16-100 %--LetsMT28.27--100 %-Hibrds Google + Bing17.2850.09 %45.03 %-4.88 %Hibrds Google + LetsMT22.8946.17 %-48.39 %5.44 %Hibrds LetsMT + Bing22.83-45.35 %49.84 %4.81 %Hibrds Google + Bing + LetsMT21.0828.93 %34.31 %33.98 %2.78 %Maijs 2015
Tulkojumu fragmentu kombinana
Tulkojumu fragmentu kombinana
Sintaktisk analze:
Berkeley Parser (Petrov et al., 2006)
Teikuma sadalana fragmentos no sintakses koka augj lmea apakkokiem
Tulkojumu fragmentu kombinana
Sintaktisk analze:
Berkeley Parser (Petrov et al., 2006)
Teikuma sadalana fragmentos no sintakses koka augj lmea apakkokiem
Labk fragmenta izvle:
5-grammu valodas modelis ar KenLM un JRC-Acquis korpusu - 1.4 miljoniem latvieu valodas juridisk domna teikumu
Teikumi novrtti attiecb pret valodas modeli ar KenLM query programmu
Tulkojumu fragmentu kombinana
Sintaktisk analze:
Berkeley Parser (Petrov et al., 2006)
Teikuma sadalana fragmentos no sintakses koka augj lmea apakkokiem
Labk fragmenta izvle:
5-grammu valodas modelis ar KenLM un JRC-Acquis korpusu - 1.4 miljoniem latvieu valodas juridisk domna teikumu
Teikumi novrtti attiecb pret valodas modeli ar KenLM query programmu
Testa dati
1581 patvagi izvlti teikumi no JRC-Acquis korpusa
Mints ar ACCURAT balanstais izvrtanas korpuss - 512 visprgu teikumu, bet rezultti bija mazk labi
Tulkojumu fragmentu kombinana
Septembris 2015
Lingvistiski motivta mantulkojumu kombinana
Gudrka teikumu dalana fragmentos
Teikuma koku apstaig no lejas uz augu, no labs uz kreiso pusi
Pievieno vrdu aktulajam fragmentam, ja
Fragment nav prk daudz vrdu (teikuma vrdu skaits / 4)
Vrds ir tikai vienu simbolu gar vai nesatur alfabta simbolus
Aktulais fragments skas ar enitva frzi (of )
Citdk veido jaunu fragmentu
Ja sank oti daudz fragmentu, process tiek atkrtots, pieaujot fragment vairk k (teikuma vrdu skaits / 4) vrdu
Lingvistiski motivta mantulkojumu kombinana
Gudrka teikumu dalana fragmentos
Teikuma koku apstaig no lejas uz augu, no labs uz kreiso pusi
Pievieno vrdu aktulajam fragmentam, ja
Fragment nav prk daudz vrdu (teikuma vrdu skaits / 4)
Vrds ir tikai vienu simbolu gar vai nesatur alfabta simbolus
Aktulais fragments skas ar enitva frzi (of )
Citdk veido jaunu fragmentu
Ja sank oti daudz fragmentu, process tiek atkrtots, pieaujot fragment vairk k (teikuma vrdu skaits / 4) vrdu
Izmaias MT API sistms
LetsMT Tildes biroja sistmas API viet pagaidm Hugo.lv API
Pievienots Yandex API
Lingvistiski motivta mantulkojumu kombinana
Labk tulkojuma izvle:
Trenti 6-grammu un 12-grammu valodas modei ar
KenLM
JRC-Acquis korpusu v. 2.2 - 1.4 miljoniem latvieu valodas juridisks nozares teikumu
DGT-Translation Memory korpusu (Steinberger, 2011) 3.1 miljoniem latvieu valodas juridisks nozares teikumu
Teikumi novrtti attiecb pret valodas modeli ar KenLM query programmu
Lingvistiski motivta mantulkojumu kombinana
Labk tulkojuma izvle:
Trenti 6-grammu un 12-grammu valodas modei ar
KenLM
JRC-Acquis korpusu v. 2.2 - 1.4 miljoniem latvieu valodas juridisks nozares teikumu
DGT-Translation Memory korpusu (Steinberger, 2011) 3.1 miljoniem latvieu valodas juridisks nozares teikumu
Teikumi novrtti attiecb pret valodas modeli ar KenLM query programmu
Testa dati
1581 patvagi izvlti teikumi no JRC-Acquis korpusa
ACCURAT balanstais izvrtanas korpuss - 512 visprgu teikumu
Lingvistiski motivta mantulkojumu kombinana
Lingvistiski motivta mantulkojumu kombinana
Teikuma fragmenti ar SyMHyTTeikuma fragmenti ar ChunkMTRecentlytherehas been an increased interest in the automated discovery of equivalent expressions in different languages.Recently there has been an increased interestin the automated discovery of equivalent expressionsin different languages .Lingvistiski motivta mantulkojumu kombinana
Lingvistiski motivta mantulkojumu kombinana
Lingvistiski motivta mantulkojumu kombinana
SistmaBLEUViendiBingGoogleHugoYandexBLEU--17.4317.7317.1416.04MSMT - Google + Bing 17.707.25%43.85%48.90%--MSMT- Google + Bing + LetsMT17.633.55%33.71%30.76%31.98%-SyMHyT - Google + Bing 17.954.11%19.46%76.43%--SyMHyT - Google + Bing + LetsMT17.303.88%15.23%19.48%61.41%-ChunkMT - Google + Bing 18.2922.75%39.10%38.15%--ChunkMT visas etras19.217.36%30.01%19.47%32.25%10.91%Janvris 2016
Publikcijas
Matss Rikters"Multi-system machine translation using online APIs for English-Latvian" ACL-IJCNLP 2015
Matss Rikters and Inguna Skadia"Syntax-based multi-system machine translation" LREC 2016
Darbi proces
Matss Rikters and Inguna Skadia"Combining machine translated sentence chunks from multiple MT systems"Iesniegts uz CICLING 2016
Matss Rikters"K-translate - interactive multi-system machine translation"Iesniegts uz Baltic DB & IS 2016
Matss Rikters and Pteris ikiforovs"iEMS an interactive experiment management system for the Moses SMT toolkit "Plnots iesniegt uz EAMT 2016
Matss Rikters"Recent research in Multi-System Machine Translation"Plnots iesniegt Baltic Journal of Modern Computing
Darbi proces
K-translate - interactive multi-system machine translation
Aptuveni tas pats ChunkMT ietrpts vizul noformjum
Uzzm sintakses koku ar iekrsotiem fragmentiem
Attlo, no kuras MT sistmas kur fragments izvlts
Attlo izvles prliecbas koeficientu
Piedv tulkoanai izmantot tiesaistes API vai lietotja ievadtus tulkojumus
Tiks nodrointi resursi tulkoanai starp angu, franu, latvieu, vcu valodm
Darbinms tmeka prlkprogramm
Darbi proces
K-translate - interactive multi-system machine translation
Kods pieejams
http://ej.uz/MSMT
http://ej.uz/SyMHyT
http://ej.uz/chunker
Tlki plni
Vl uzlabojumi teikumu dalanai fragmentos
Hibrdaj MT risinjum ieviest pau daudzvrdu savienojumu apstrdi un pievrst tiem lielku uzmanbu
Citu veidu valodas modei
POS tag + lemma
Recurrent Neural Network Language Model (Mikolov et al., 2010)
Continuous Space Language Model (Schwenk et al., 2006)
Character-Aware Neural Language Model (Kim et al., 2015)
Labk kandidta izvle ar MT kvalittes prognozi
QuEst++ (Specia et al., 2015)
SHEF-NN (Shah et al., 2015)
Tlkas idejas
Atsauces
Ahsan, A., and P. Kolachina. "Coupling Statistical Machine Translation with Rule-based Transfer and Generation, AMTA-The Ninth Conference of the Association for Machine Translation in the Americas." Denver, Colorado (2010).
Barrault, Loc. "MANY: Open source machine translation system combination." The Prague Bulletin of Mathematical Linguistics 93 (2010): 147-155.
Santanu, Pal, et al. "USAAR-DCU Hybrid Machine Translation System for ICON 2014" The Eleventh International Conference on Natural Language Processing. , 2014.
Mellebeek, Bart, et al. "Multi-engine machine translation by recursive sentence decomposition." (2006).
Heafield, Kenneth. "KenLM: Faster and smaller language model queries." Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011.
Steinberger, Ralf, et al. "The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages." arXiv preprint cs/0609058 (2006).
Petrov, Slav, et al. "Learning accurate, compact, and interpretable tree annotation." Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006.
Steinberger, Ralf, et al. "Dgt-tm: A freely available translation memory in 22 languages." arXiv preprint arXiv:1309.5226 (2013).
Raivis Skadi, Krlis Goba, Valters ics. 2010. Improving SMT for Baltic Languages with Factored Models. Proceedings of the Fourth International Conference Baltic HLT 2010, Frontiers in Artificial Intelligence and Applications, Vol. 2192. , 125-132.
Mikolov, Tomas, et al. "Recurrent neural network based language model." INTERSPEECH. Vol. 2. 2010.
Schwenk, Holger, Daniel Dchelotte, and Jean-Luc Gauvain. "Continuous space language models for statistical machine translation." Proceedings of the COLING/ACL on Main conference poster sessions. Association for Computational Linguistics, 2006.
Kim, Yoon, et al. "Character-aware neural language models." arXiv preprint arXiv:1508.06615 (2015).
Specia, Lucia, G. Paetzold, and Carolina Scarton. "Multi-level Translation Quality Prediction with QuEst++." 53rd Annual Meeting of the Association for Computational Linguistics and Seventh International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing: System Demonstrations. 2015.
Shah, Kashif, et al. "SHEF-NN: Translation Quality Estimation with Neural Networks." Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015.
Paldies!
Jautjumi?
Teikumu dalana tekstvienbsTulkoana ar tiesaistes MT APIGoogle TranslateBing TranslatorLetsMTLabk tulkojuma izvleTulkojuma izvade
Teikumu dalana tekstvienbsTulkoana ar tiesaistes MT APIGoogle TranslateBing TranslatorLetsMTLabk tulkojuma izvleTulkojuma izvade
Teikumu dalana tekstvienbsTulkoana ar tiesaistes MT APIGoogle TranslateBing TranslatorLetsMTLabko fragmentu izvleTulkojumu izvadeTeikumu sadalana fragmentosSintaktisk analzeTeikumu apvienoana
Teikumu dalana tekstvienbsTulkoana ar tiesaistes MT APIGoogle TranslateBing TranslatorLetsMTLabko fragmentu izvleTulkojumu izvadeTeikumu sadalana fragmentosSintaktisk analzeTeikumu apvienoana
Teikuma sintakses koksKoka datu struktraFragmentu sarakstsKoka datu struktra ar martiem fragmentiemApstaig koku/apakkokuAktul koka/apakkoka fragmentsfvs < tvs / 4 fvs > 1Pievieno fragmentu sarakstamApvieno ar pdjo fragmentu sarakstfvs = 1enitva frzeNealfabtisksfvs fragmenta vrdu skaitstvs teikuma vrdu skaits
Start pageTranslate with online systemsInput translations to combineInput translated chunksSettingsTranslation resultsInput source sentenceInput source sentence
Start pageTranslate with online systemsInput translations to combineInput translated chunksSettingsTranslation resultsInput source sentenceInput source sentence
top related