morphosyntactic correspondence: a progress report on bitext parsing alexander fraser, renjing wang,...

Morphosyntactic correspondence: a progress report on bitext parsing

Alexander Fraser, Renjing Wang, Hinrich Schütze

Institute for NLPUniversity of Stuttgart

INFuture2009: Digital Resources and Knowledge Sharing Nov 4th 2009, Zagreb

Outline The Institute for Natural Language Processing

at the University of Stuttgart Bitext parsing Using morphosyntactic correspondence

IfNLP Stuttgart The Institute for Natural Language Processing (IfNLP/IMS) at

the University of Stuttgart Dogil (Phonetics and Speech)

Large department Kuhn/Rohrer (LFG syntax and semantics)

Cahill (LFG generation) Heid (Terminology extraction, morphology)

Padó (Semantics, lexical semantics) Schütze (Statistical NLP and Information Retrieval)

More on next slide

IfNLP – Statistical NLP Group Hinrich Schütze (director since 2004)

Bernd Möbius – Speech recognition and synthesis Helmut Schmid - Parsing , morphology (known for TreeTagger, BitPar) Sabine Schulte im Walde – NLP and cognitive modeling of lexical semantics Michael Walsh – Speech, exemplar theoretic syntax Alex Fraser - Statistical machine translation, parsing, cross-lingual information retrieval

General department areas of research New statistical NLP models and methods Semi-supervised and active learning Cognitive/linguistic representation models

Applied to: NLP, retrieval, MT, speech, e-learning, …

IfNLP - Partnerships Partnerships

Stuttgart: large projects with linguistics, computer science, EE signal processing, high performance computing

Germany: Darmstadt, Tübingen, DSPIN/CLARIN consortium (UIMA-based German processing)

International: large French-led European project (6 universities, 4 industrial partners), collaborations on South African languages, Edinburgh, CLARIN

Industrial: various projects with publishers (many focusing on terminology)

What is bitext parsing? Bitext: a text and its translation

Sentences and their translations are aligned Sometimes called a parallel corpus

Syntactic parsing: automatically find the syntactic structure of a sentence (syntactic parse)

Bitext parsing: automatically find the syntactic structure of the parallel sentences in a bitext We will use the complementarity of the syntax of the two languages to obtain

improved parses

Motivation for bitext parsing Many advances in syntactic parsing come from better modeling

But the overall bottleneck is the size of the treebank Our research asks a different question:

Where can we (cheaply) obtain additional information, which helps to supplement the treebank?

A new information source for resolving ambiguity is a translation The human translator understands the sentence and disambiguates for us!

Our research goal was to build large databases of improved parses to help establish preferences for difficult phenomena like PP-attachment

Clause attachment ambiguity

Parse 1: high attachment (wrong)

Parse 2: low attachment (correct)

Not ambiguous in German

Number agreement disambiguates FRAU (woman) and HATTE (had) agree Unambiguous low attachment

Parse reranking of bitext

Goal: improve English parsing accuracy

Parse English sentence, obtain list of 100 best parse candidates

Parse German sentence, obtain single best parse

Determine the correspondence of German to English words using a word alignment

Calculate syntactic divergence of each English parse candidate and the projection of the German parse

Choose probable English parse candidate with low syntactic divergence

Measuring syntactic divergence

P(e | g) = exp ∑m λm hm(g, e, a)

∑e exp ∑m λm hm(g, e, a)

Define features to capture different (overlapping) aspects of syntactic divergence. Functions of: Candidate English parse e German parse g Word alignment a

Combine in log-linear model

Discriminatively train λ parameters to maximize parsing accuracy on a training set (minimum error rate training)

Rich bitext projection features Defined 36 features by looking at common English parsing errors No monolingual features, except baseline parser probability General features

Is there a probable label correspondence between German and the hypothesized English parse?

How expected is the size of each constituent in the hypothesized English parse given the German parse?

Specific features Are coordinations realized identically? Is the NP structure the same?

Mix of probabilistic and heuristic features

Training Use BitPar syntactic forest parser

English BitPar trained on Penn Treebank German BitPar trained on Tiger Treebank

Probabilistic feature functions built using large parallel text (Europarl)

Weights on feature functions (lambda vector) trained on portion of the Penn Treebank together with its translation into German Minimum error rate training using F score

Reranking English parses Difficult task

German is difficult to parse Our knowledge source, the German parser, is out-of-

domain (poor performance) Baseline English parser we are trying to improve is in-

domain (good performance) Test set has long sentences

Result: 0.70% F1 improvement on test data (stat. significant)

New results Reranking German parses

We needed German gold standard parses (and English translations) Sebastian Pado has made a small parallel treebank for Europarl available

No engineering on German yet We are using the same syntactic divergence features which were designed to

improve English parsing There are German specific ambiguities which could be modeled, such as subject-

object ambiguity (e.g., Die Maus jagt die Katze, “the mouse chases the cat” or “the cat chases the mouse”)

But easier task because the parser we are trying to improve is weaker (German is hard to parse, Europarl is out of domain)

2.3% F1 improvement currently, we think this can be further improved

Summary: bitext parsing I showed you an approach for bitext parsing

Reranking the parses of English to minimize syntactic divergence with an automatically generated German parse

I then showed our first results for reranking German parses using a single English parse

The approach we used for this kind of morphosyntactic correspondence is more general than just parse reranking Machine translation involves morphosyntactic correspondence And this is where we are interested in looking at Croatian

Morphosyntactic processing I am co-PI of a new IfNLP project funded by the DFG (German Science

Foundation) Project: morphosyntactic modeling for statistical machine translation

(SMT) SMT research, up until recently, has been dominated by translation into

English English expresses a lot of information through word order, very little through

inflection Approaches to translating morphologically rich languages to English

are preprocessing based

Present: linguistic preprocessing Linguistic preprocessing for SMT (stat. machine translation)

From: freer syntax, morphologically rich language To: rigid syntax, morphologically poor language Existing examples: German to English, Czech to English

Present: linguistic preprocessing How this works

Produce morphosyntactic analysis of German (or Czech) Reorder words in the German/Czech sentence to be in English order Reduce morphological inflection (for instance, remove case marking,

remove all agreement on adjectives, etc) For Czech: insert pseudo-words (e.g. indicate PRO-drop pronouns) Use statistics on this “simplified” German or Czech to map directly to

English using SMT

Present: linguistic preprocessing How well does this work? German to English SMT with linguistic preprocessing

(Stuttgart system) Results from 2008 ACL workshop on machine translation (extensive

human evaluation) Only system limited to organizer’s data competitive with:

The best system of 5 rule-based MT systems Saarbrücken hybrid rule-based/SMT system Google Translate, which does not use linguistic preprocessing but does

use vastly more data

Future: modeling What about translating from English to German or to Slavic

languages? Problem: morphological generation is more difficult

It is easy to reduce multiple inflections to one (for instance, stemming) Harder to learn to generate the right inflection

Future: modeling Current work on morphological generation

Work at Charles University in Prague on Czech Tectogrammatical representation is not (yet) competitive with simple

statistics (little explicit knowledge of morphology or syntax) Best English to German SMT systems also use little or no

morphological knowledge And they are much worse than rule-based English to German systems

Challenge: to use morphosyntactic knowledge with statistical approaches requires more than just linguistic preprocessing

morphosyntactic modeling

Morphosyntactic correspondence In fact, all multilingual problems involve morphosyntactic

correspondence: If we have a source parse tree, and source text, and we would like a

target text, this is machine translation If we have a source parse tree, source text and target text, and we

would like a target parse, this is bitext parsing If we would like to know which word in the target text is a translation

of a particular word in the source text and we use morphosyntactic analysis, this is syntactic word alignment

The same thinking can be used for cross-lingual information retrieval Very relevant when one of the languages is morphologically rich

Conclusion I introduced the IfNLP Stuttgart I presented a new approach to improving parsing using morphosyntactic

correspondence: bitext parsing I discussed the general challenge of using morphosyntactic correspondence,

focusing on statistical machine translation Biggest challenge is translating into freer word order, morphologically rich (e.g., German

and particularly Slavic languages) We are interested in the challenge of building systems to translate to Croatian To do this: we need partners who are working on Croatian analysis! We also request that you think about multilingual applications when producing Croatian

NLP resources The type of approach I showed for bitext parsing is useful for other multilingual

applications

Thank you!

Title text

Statistical Approach Using statistical models

Create many alternatives, called hypotheses Give a score to each hypothesis Find the hypothesis with the best score through search

Disadvantages Difficulties handling structurally rich models (math and computation) Need data to train the model parameters Difficult to understand decision process made by system

Advantages Avoid hard decisions Speed can be traded with quality, no all-or-nothing Works better in the presence of unexpected input Learns automatically as more data becomes available

Modified from Vogel

Morphosyntactic knowledge We use: morphological analyzers & treebanks, which are combined in

parsing models learned from treebanks English models have little morphological analysis (suffix analysis to determine

POS for unknown words) German syntactic parser BitPar (Schmid) uses SMOR (Stuttgart Morphological

Analyzer) Given inflected form, SMOR returns possible fine-grained POS tags E.g., for nouns/adjectives: POS, case, gender, number, definiteness BitPar puts possible analyses in the chart, and disambiguates

Slavic languages require even more morphological knowledge than German

Transferring syntactic knowledge Need knowledge source!

English syntactic parser About 90% bracketing accuracy

Mapping Requires bitext

Work discussed here uses German/English Europarl (European Parliament Proceedings)

Resource for Croatian: Acquis Communautaire Automatically generated word alignment

Additional details in the paper Formalization of bitext parsing as a parse reranking task Definitions of bitext feature functions Analysis of feature functions through feature selection Comparison of MERT (minimum error rate training) with

SVM-Rank

morphosyntactic correspondence: a progress report on bitext parsing alexander fraser, renjing wang,...

Documents