introduction to...

21
Introduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank Annotation, December 2010, Prague

Upload: others

Post on 02-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Introduction to TectoMT

1/21

Zdeněk Žabokrtský, Martin Popel

Institute of Formal and Applied LinguisticsCharles University in Prague

CLARA Course on Treebank Annotation, December 2010, Prague

Page 2: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Outline

n PART 1n What is TectoMT?n TectoMT’s architecturen Overview of TectoMT’s tools and applications

2/21

n Overview of TectoMT’s tools and applications

n PART 2 - demo

Page 3: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

What is TectoMT?

n multi-purpose NLP software frameworkn created at UFAL since 2005

n main linguistic featuresn layered language representationn linguistic data structures adopted from the Prague Dependency

3/21

n linguistic data structures adopted from the Prague Dependency Treebank

n main technical featuresn highly modular, open-sourcen numerous NLP tools already integrated (both existing and new)n all tools communicating via a uniform OO infrastructuren Linux + Perln reuse of PDT technology (tree editor TrEd, XML…)

Page 4: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Why “TectoMT” ?

n Tecto..n refers the (Praguian) tectogrammarn deep-syntactic dependency-oriented sentence representationn developed by Petr Sgall and his colleagues since 1960sn large scale application in the Prague Dependency Treebank

4/21

n .....MTn the main application of TectoMT is Machine Translation

n however, not only “tecto” and not only “MT” !!!

n re-branding planned for 2011: TectoMT → Treex

Page 5: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

What is not TectoMT?

n TectoMT (as a whole) is not an end-user applicationn it is rather an experimental lab for NLP researchers

5/21

n however, releasing of single-purpose stand-alone applications is possible

Page 6: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Motivation for creating TectoMT

n First, technical reasons:n Want to make use of more than two NLP tools in your

experiment? Be ready for endless data conversions, need for other people's source code tweaking, incompatibility of source code and model versions…

6/21

code and model versions…

n ⇒ unified software infrastructure might help us in many aspects.

n Second, our long-term MT plan:n We believe that tectogrammar (deep syntax) as implemented in

Prague Dependency Treebank might help to (1) reduce data sparseness, and (2) find and employ structural similarities revealed by tectogrammar even between typologically different languages.

Page 7: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Main Design Decisions

n Linuxn Perl as the core language

n set of well-defined, linguistically relevant layers of language representation

7/21

n neutral w.r.t. chosen methodology ("rules vs. statistics")

n emphasis on modularityn each task implemented by a sequence of blocksn each block corresponds to a well-defined NLP subtaskn reusability and substitutability of blocks

n support for distributed processing

Page 8: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Data Flow Diagramin a typical application in TectoMT

INPUTDATAFILES

OUTPUTDATAFILES

MEMORY REPRESENTATION

OF SENTENCESTRUCTURES

input formatconverter

output formatconverter

8/21

FILES FILESSTRUCTURES

block 1 block 2 block n

non-Perltool X

non-Perltool Y

block 3 …

scenario:

Page 9: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Hierarchy of data-structure units

n documentn the smallest independently storable unit (~ xml file) n represents a text as a sequence of bundles, each representing one

sentence (or sentence tuples in the case of parallel documents) n bundle

n set of tree representationsof a given sentence

9/21

of a given sentence

n treen representation of a sentence on a given layer of linguistic

descriptionn noden attribute

n document's, node's, or bundle's name-value pairs

Page 10: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Tree types adopted from PDT

n tectogrammatical layern deep-syntactic dependency tree

n analytical layern surface-syntactic dependency tree

10/21

n 1 word (or punct.) ~ 1 node

n morphological layern sequence of tokens with their lemmas

and morphological tags

Page 11: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Trees in a bundle

n in each bundle, there can be at most one tree for each "layer"

n set of possible layers = {S,T} x {English,Czech,...} x {M,A,T,P, N}

n S - source, T-target (analysis vs. synthesis, MT perspective)

11/21

n M - morphological analysisn P - phrase-structure treen A - analytical treen T - tectogrammatical treen N - instances of named entities

n Example: SEnglishA - tectogrammatical analysis of an English sentence on the source-language side

Page 12: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Hierarchy of processing units

n blockn the smallest individually executable unitn with well-defined input and outputn block parametrization possible (e.g. model size choice)

n scenario

12/21

n scenarion sequence of blocks, applied one after another on given

documents

n applicationn typically 3 steps:n 1. conversion from the input formatn 2. applying the scenario on the datan 3. conversion into the output format source

languagetargetlanguage

MT triangle:interlingua

tectogram.

surf.synt.

morpho.

raw text.

Page 13: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Blocks

n technically, Perl classes derived from ������������neither method ������������ (if sentences are processed

independently) or method ����������� �� must be defined

n several hundreds blocks in TectoMT now, for various purposes:

nblocks for analysis/transfer/synthesis, e.g.

13/21

nblocks for analysis/transfer/synthesis, e.g.SEnglishW_to_SEnglishM::Lemmatize_mtree

SEnglishP_to_SEnglishA::Mark_heads

TCzechT_to_TCzechA::Vocalize_prepositions

nblocks for alignment, evaluation, feature extraction, etc.

n some of them only implement simple rules, some of them call complex probabilistic tools

nEnglish-Czech tecto-based translation currently composes of roughly 140 blocks

Page 14: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Tools available as TectoMT blocks

nto integrate a stand-alone NLP tool into TectoMT means to provide it with the standardized block interface

nalready integrated tools:ntaggers

nHajič's tagger, Raab&Spoustová Morče tagger, Rathnaparkhi MXPOST tagger, Brants's TnT tager, Schmid's Tree tagger, Coburn's

14/21

MXPOST tagger, Brants's TnT tager, Schmid's Tree tagger, Coburn's Lingua::EN::Tagger

nparsersnCollins' phrase structure parser, McDonalds dependency parser,

Malt parser, ZŽ's dependency parsernnamed-entity recognizer

nStanford Named Entity Recognizer, Kravalová's SVM-based NE recognizer

n miscel.nKlimeš's semantic role labeller, ZŽ's C5-based afun labeller, Ptáček's

C5-based Czech preposition vocalizer, ...

Page 15: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Other TectoMT componentsn "core" - Perl libraries forming the core of TectoMT infrastructure,

esp. for memory representation of (and interface to) to the data

structures

nnumerous file-format converters (e.g. from PDT, Penn treebank,

Czeng corpus, WMT shared task data etc. to our xml format)

n

15/21

nTectoMT-customized Pajas' tree editor TrEd

n tools for parallelized processing (Bojar)

ndata, esp. trained models for the individual tools, morphological

dictionaries, probabilistic translation dictionaries...

n tools for testing (regular daily tests), documentation...

Page 16: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Languages in TectoMT

n full-fledged sentence PDT-style analysis/transfer/synthesis for English and Czechn using state-of-the-art tools

16/21

n prototype implementations of PDT-style analyses for a number of other languagesnmostly created by students

n Polish, French, German, Tamil, Spanish, Esperanto…

Page 17: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

English-Czech translation in TectoMT

deep syntax:tectogramatical layer

t-layer

ANALYSIS TRANSFER SYNTHESIS

17/21

source language (English) target language (Czech)

morphological layer

shallow syntax:analytical layer

a-layer

m-layer

w-layer

Page 18: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

tectogramatical layer

ANALYSIS TRANSFER SYNTHESIS

t-layer

fill formems grammatemes usequeryfill morphological categories

English-Czech translation in TectoMT

rule based statistical & blocks

18/21

source language (English) target language (Czech)

morphological layer

analytical layer a-layer

m-layer

w-layertokenizationlemmatizationtagger (Morce)

parser (McDonald's MST)analytical functions

mark edges to contractbuild t-tree

fill formems grammatemes useHMTM

querydictionary

fill morphological categories

impose agreement

add functional words

generatewordforms

concatenate

segmentation

Page 19: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Real Translation ScenarioSEnglishW_to_SEnglishM::TokenizationNormalize_formsFix_tokenizationTagMorceFix_mtagsLemmatize_mtreeSEnglishM_to_SEnglishN::Stanford_named_entitiesDistinguish_personal_namesSEnglishM_to_SEnglishA::McD_parserFill_is_member_from_deprelFix_tags_after_parse

Mark_clause_headsMark_passivesAssign_functorsMark_infinMark_relclause_headsMark_relclause_corefMark_dsp_rootMark_parenthesesRecompute_deepordAssign_nodetypeAssign_grammatemesDetect_formemeRehang_shared_attrDetect_voice

Cut_variantsRehang_to_eff_parentsTranslate_LF_tree_ViterbiRehang_to_orig_parentsFix_transfer_choicesTranslate_L_female_surnamesAdd_noun_genderAdd_relpron_below_rcChange_Cor_to_PersPronAdd_PersPron_below_vfinAdd_verb_aspectFix_date_timeFix_grammatemes_after_transferFix_negation

Impose_pron_z_agrImpose_rel_pron_agrImpose_subjpred_agrImpose_attr_agrImpose_compl_agrDrop_subj_pers_pronsAdd_prepositionsAdd_subconjsAdd_reflex_particlesAdd_auxverb_compound_passiveAdd_auxverb_modalAdd_auxverb_compound_futureAdd_auxverb_conditionalAdd_auxverb_compound_past

19/21

Fix_tags_after_parseMcD_parser REPARSE=1 Fill_is_member_from_deprelFix_McD_topologyFix_nominal_groupsFix_is_memberFix_atreeFix_multiword_prep_and_conjFix_dicendi_verbsFill_afun_AuxCP_CoordFill_afunSEnglishA_to_SEnglishT::Mark_edges_to_collapseMark_edges_to_collapse_negBuild_ttreeFill_is_memberMove_aux_from_coord- _to_membersFix_tlemmasAssign_coap_functorsFix_either_orFix_is_member

Detect_voiceFix_imperativesFill_is_name_of_personFill_gender_of_personAdd_cor_actFind_text_corefSEnglishT_to_TCzechT::Clone_ttreeTranslate_LF_phrasesTranslate_LF_joint_staticDelete_superfluous_tnodesTranslate_F_try_rulesTranslate_F_add_variantsTranslate_F_rerankTranslate_L_try_rulesTranslate_L_add_variantsTranslate_LF_numerals_by_rulesTranslate_L_filter_aspectTransform_passive_constructionsPrune_personal_name_variantsRemove_unpassivizable_variantsTranslate_LF_compounds

Fix_negationMove_adjectives_before_nounsMove_genitives_to_postpositMove_relclause_to_postpositMove_dicendi_closer_to_dspMove_PersPron_next_to_verbMove_enough_before_adjFix_moneyRecompute_deepordFind_gram_coref_for_refl_pronNeut_PersPron_gender_from_antecOverride_pp_with_phrase_translationValency_related_rulesFill_clause_numberTurn_text_coref_to_gram_corefTCzechT_to_TCzechA::Clone_atreeDistinguish_homonymous_mlemmasReverse_number_noun_dependencyInit_morphcatFix_possessive_adjectivesMark_subject

Add_auxverb_compound_pastAdd_clausal_expletive_pronounsResolve_verbsProject_clause_numberAdd_parenthesesAdd_sent_final_punctAdd_subord_clause_punctAdd_coord_punctAdd_apposition_punctChoose_mlemma_for_PersPronGenerate_wordformsMove_clitics_to_wackernagelRecompute_orderingDelete_superfluous_preposDelete_empty_nounsVocalize_prepositionsCapitalize_sent_startCapitalize_named_entitiesTCzechA_to_TCzechW::Concatenate_tokensAscii_quotesRemove_repeated_tokens

Page 20: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Parallel analysis

n data needed for training the transfer phase models

n Czech-English parallel corpus CzEngn 8 mil. pairs of sentences with

automatic PDT-style analyses and

20/21�����

���

�� ��

automatic PDT-style analyses and alignment

Page 21: Introduction to TectoMTufal.mff.cuni.cz/clara/treecourse/pages/slides/Zabokrtsky-tectomt_intro.pdfIntroduction to TectoMT 1/21 Zdeněk Žabokrtský, Martin Popel Institute of Formal

Summary of Part I

n TectoMT (àTreex)n environment for NLP experiments

nmultipurpose, multilingual

n PDT-style linguistic structures

21/21

n PDT-style linguistic structures

n Linux+Perl, open-source

nmodular architecture (several hundreds of modules)

n capable of processing massive data

nwill be released at CPAN