nsf partnership for research and education : m eaning r epresentation for s tatistical l anguage p...

8

Click here to load reader

Upload: jewel-spencer

Post on 19-Jan-2018

212 views

Category:

Documents


0 download

DESCRIPTION

NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 3 MT pyramid (in terms of PDT) Key question in MT: optimal level of abstraction? Our hypothesis: somewhere around tectogrammatics high generalization over different language characteristics, but still computationally (and mentally!) tractable

TRANSCRIPT

Page 1: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software

1NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:

MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING

TectoMT

TectoMT = highly modular software framework for integrating NLP components into larger NLP systems

aimed at (but not limited to) MT using tectogrammatics

other goals: to create a system for testing the true usefulness of various NLP

tools within a real-life application to exploit the abstraction power of tectogrammatics to supply data and technology for other projects

Page 2: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software

2NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:

MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING

Design decision implementation in Perl under Linux

non-Perl tools integrated via Perl wrappers

focus on modularity simple uniform API for all included modules

maximum re-usage of the PDT annotation scheme linguistic layers, tree editor TrEd, XML data formats, tools for

distributed processing ...

no requirements on methodology rule-based and statistical components can be freely combined

Page 3: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software

3NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:

MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING

MT pyramid (in terms of PDT) Key question in MT: optimal level of abstraction?

Our hypothesis: somewhere around tectogrammatics high generalization over different language characteristics, but still computationally

(and mentally!) tractable

MT triangle:

sourcelanguage

targetlanguage

interlingua

tectogram.

surf.synt.

morpho.

raw text.

level ofabstraction

"transfer distance"

?

Page 4: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software

4NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:

MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING

MT pyramid in vivo English-Czech translation in TectoMT (sequence of around 80

modules is used):

She has never laughed in her new boss's office. Nikdy se nesmála v úřadu svého nového šéfa.

Page 5: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software

5NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:

MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING

Alignment of t-trees in TectoMT necessary for training t-layer

transfer from parallel corpora

current solution: set of human-designed features weighted by perceptron

Sample sentence pair: It is extremely important that

Iraq held elections to a constitutional assembly.

Je nesmírně důležité, že v Iráku proběhly volby do ústavního shromáždění.

Page 6: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software

6NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:

MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING

TectoMT: Current State (1)

around 5 developers

around 200 modules, especially for Czech and English sentence analysis and synthesis

many of them are Perl wrappers to previously existing NLP tools: Collins's parser, McDonald's parser, Hajič's morphology analysis and tagger, Brants's TnT tagger ...

intensive usage of existing linguistic data resources: PTB, BNC, PDT, PEDT, CNC ...

Page 7: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software

7NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:

MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING

TectoMT: Current State (2)

applications implemented in TectoMT so far (April 2008) English-Czech MT (TectoMT participates in WMT08 Shared Task)

preprocessing of t-trees for manual annotations of the Prague

Czech-English Dependency Treebank

interactive Czech analysis in the tree editor TrEd

English sentence generator in a dialog system

building of a large parallel treebank from the parallel corpus CzEng

Page 8: NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software

8NSF PARTNERSHIP FOR RESEARCH AND EDUCATION:

MEANING REPRESENTATION FORSTATISTICAL LANGUAGE PROCESSING

TectoMT: Future plans

the most critical bottleneck - insufficient usage of target language model in tectogrammatical transfer (loglinear model trained by perceptron will be used soon)

there are many modules based on simple heuristics in the current system: corpus-based alternatives should be searched for most of them

tools for other languages will be added