nsf partnership for research and education : m eaning r epresentation for s tatistical l anguage p...
DESCRIPTION
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 3 MT pyramid (in terms of PDT) Key question in MT: optimal level of abstraction? Our hypothesis: somewhere around tectogrammatics high generalization over different language characteristics, but still computationally (and mentally!) tractableTRANSCRIPT
![Page 1: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software](https://reader038.vdocuments.site/reader038/viewer/2022100607/5a4d1bc77f8b9ab0599d58c3/html5/thumbnails/1.jpg)
1NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:
MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING
TectoMT
TectoMT = highly modular software framework for integrating NLP components into larger NLP systems
aimed at (but not limited to) MT using tectogrammatics
other goals: to create a system for testing the true usefulness of various NLP
tools within a real-life application to exploit the abstraction power of tectogrammatics to supply data and technology for other projects
![Page 2: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software](https://reader038.vdocuments.site/reader038/viewer/2022100607/5a4d1bc77f8b9ab0599d58c3/html5/thumbnails/2.jpg)
2NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:
MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING
Design decision implementation in Perl under Linux
non-Perl tools integrated via Perl wrappers
focus on modularity simple uniform API for all included modules
maximum re-usage of the PDT annotation scheme linguistic layers, tree editor TrEd, XML data formats, tools for
distributed processing ...
no requirements on methodology rule-based and statistical components can be freely combined
![Page 3: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software](https://reader038.vdocuments.site/reader038/viewer/2022100607/5a4d1bc77f8b9ab0599d58c3/html5/thumbnails/3.jpg)
3NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:
MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING
MT pyramid (in terms of PDT) Key question in MT: optimal level of abstraction?
Our hypothesis: somewhere around tectogrammatics high generalization over different language characteristics, but still computationally
(and mentally!) tractable
MT triangle:
sourcelanguage
targetlanguage
interlingua
tectogram.
surf.synt.
morpho.
raw text.
level ofabstraction
"transfer distance"
?
![Page 4: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software](https://reader038.vdocuments.site/reader038/viewer/2022100607/5a4d1bc77f8b9ab0599d58c3/html5/thumbnails/4.jpg)
4NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:
MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING
MT pyramid in vivo English-Czech translation in TectoMT (sequence of around 80
modules is used):
She has never laughed in her new boss's office. Nikdy se nesmála v úřadu svého nového šéfa.
![Page 5: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software](https://reader038.vdocuments.site/reader038/viewer/2022100607/5a4d1bc77f8b9ab0599d58c3/html5/thumbnails/5.jpg)
5NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:
MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING
Alignment of t-trees in TectoMT necessary for training t-layer
transfer from parallel corpora
current solution: set of human-designed features weighted by perceptron
Sample sentence pair: It is extremely important that
Iraq held elections to a constitutional assembly.
Je nesmírně důležité, že v Iráku proběhly volby do ústavního shromáždění.
![Page 6: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software](https://reader038.vdocuments.site/reader038/viewer/2022100607/5a4d1bc77f8b9ab0599d58c3/html5/thumbnails/6.jpg)
6NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:
MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING
TectoMT: Current State (1)
around 5 developers
around 200 modules, especially for Czech and English sentence analysis and synthesis
many of them are Perl wrappers to previously existing NLP tools: Collins's parser, McDonald's parser, Hajič's morphology analysis and tagger, Brants's TnT tagger ...
intensive usage of existing linguistic data resources: PTB, BNC, PDT, PEDT, CNC ...
![Page 7: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software](https://reader038.vdocuments.site/reader038/viewer/2022100607/5a4d1bc77f8b9ab0599d58c3/html5/thumbnails/7.jpg)
7NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:
MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING
TectoMT: Current State (2)
applications implemented in TectoMT so far (April 2008) English-Czech MT (TectoMT participates in WMT08 Shared Task)
preprocessing of t-trees for manual annotations of the Prague
Czech-English Dependency Treebank
interactive Czech analysis in the tree editor TrEd
English sentence generator in a dialog system
building of a large parallel treebank from the parallel corpus CzEng
![Page 8: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software](https://reader038.vdocuments.site/reader038/viewer/2022100607/5a4d1bc77f8b9ab0599d58c3/html5/thumbnails/8.jpg)
8NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:
MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING
TectoMT: Future plans
the most critical bottleneck - insufficient usage of target language model in tectogrammatical transfer (loglinear model trained by perceptron will be used soon)
there are many modules based on simple heuristics in the current system: corpus-based alternatives should be searched for most of them
tools for other languages will be added