the lc-star project (ist-2001-32216) objectives: track i (duration 2 years) specification and...

The LC-STAR project (IST-2001-32216)Objectives:

Track I (duration 2 years)

Specification and creation of large word lists and lexica suited for flexible vocabulary

speech recognition and high quality speech synthesis covering a wide range of domains.

Track II (duration 3 years)

Investigation of speech centered translation technologies focusing on requirements concerning language resources (LR)

Specification and creation of corpora and lexica needed for speech centered Specification and creation of corpora and lexica needed for speech centered translationtranslation

Building a demonstrator for speech-to-speech translation

Demonstration of language transfer in Catalan, Spanish and US-English

www.lcstar.com

Creation of Lexica for Speech Centered

TranslationUte Ziegenhain (Siemens AG), Asuncion Moreno (UPC), Nuria Castell (UPC)

1

4 Industrial Partners:

www.lcstar.com



1

2 Partners from Universities:

1 External Partner:

Two approaches:

- Bi-lingual word by word translation lexica with enriched morphological information

- Advantages: reduction of WER

- Disadvantage: for more inflected languages lexicon size increases by a factor

of 7 (at least); effort varies highly between languages -> only provided for Catalan and Spanish for statistical experiments

- 'Phrasal' lexica consisting of bi-lingual short phrases typically found in a tourist domain environment

- Advantages: reduction of OOV, better alignment and lexicon model

- Disadvantage: selection of adequate corpora

www.lcstar.com



1

US-English corpora from Verbmobil (112,541 token):orthographic transcriptions of telephone conversations in US-English for an appointment scheduling domain

US-English corpora from TALP corpus (408,452 token): US-English sentences translated from orthographic transcriptions of telephone conversations in Spanish and Catalan for a tourist domain

Web corpus (2,640,562 token): Text corpora downloaded from tourist web pages in US-English

Phrasal corpus: 1500 expressions in US-English selected from tourist phrasal books.

Source Corpora

2

Procedure:

1. Create text corpora in a reference language (US-English) in a given domain

2. Select 10.000 of the most frequent content words (i.e. nouns, verbs, adjectives, etc.) to

create a representative word list of the domain

3. For each word in the word list, provide the syntactic context in which the words are embedded

4. Cut the sentence into a segment that contains the word. The segment have usually been shortened to nominal phrases (in case of nouns and adjectives) or to subject plus verb plus short complement (for verbs)

5.Manually correct the phrases (e.g. typing and orthographic errors, meaningless or offensive phrases, proper names etc.)

6.Add a set of typical phrasal expressions commonly used in the semantic domain. the

set is manually choosen from several tourist text books Building a demonstrator for

speech-to-speech translation

-> Result: 'phrasal' reference lexicon consisting of 10.000 short phrases

Creation of Reference Corpus

3

Format:Textual format will be used with XML-based mark-up in accordance with a common and language specific Document Type Definition (DTD)

Advantages of using XML are:

- Widely known technique

- Many tools supporting it are available

- Supports Unicode (useful for languages with non-Latin writing systems)

- Allows easy and concise representation of one-to-many relations multiple

translations, multiple PoS, etc.)

- Easily definable and flexible syntax

- Easy well-formedness tests are possible using publicly available tools

Format

4

Set of 10.000 segments:- Source language segment: orthography of the source phrase

- Target language segment: target language translation +

orthography, one PoS (NOM, VER, ADJ, PRO…) and lemma

- Additional information possible (e.g. tags for foreign words, etc.)

Example:

Content

language Content

source EN-US hi Mary , how are you ?

language content Entry_com Ortography

entry ortography

pos

lemma

hola INT hola

María PRN María , PUN , ¿ PUN ¿ cómo CON como está VER estar

target ES-ES hola María , ¿ cómo está ?

? PUN ?

5

Partners and Languages

Partner Language IBM Italian Nokia Finnish Natural Speech Communication (NSC) Hebrew Rheinisch Technische Hochschule Aachen (RWT)

German

Siemens Russian University of Maribor (UMB) Slovenian Universitat Politécnica de Catalunya (UPC) Spanish

UPC Catalan

UPC US-English

6

1. Translate as literal as possible to the source text, while preserving the syntactic correctness, semantic meaning and naturalness

2. Idiomatic expressions will be translated and marked as such

3. Ambiguities: select most plausible translation with respect to semantic domain; otherwise provide more than one translation

4. Proper nouns are marked and translated only in case when they are used in target language (e.g. AIDS -> SIDA)

5. Punctuation marks are separated from words and should be kept.

6. Digits should be kept unless a transcription is required in the target language.

7.Abbreviations should be expanded or kept abbreviated depending on the use in target language.

8.Foreign words can be optionally labeled with a tag

9.Parts of word: (e.g. due to false starts etc.) if the reference phrase does not provide enough context to disambiguate generate the partial target word followed by the + mark.

Translation Methodology

7

Approach:

- Phrases occuring in all three languages are added to the training corpus

- Training corpus consists of selected dialogues from Verbmobil and TALP tourism

corpus

Preliminary Results:- Reduced OOV rate (13% relative for Spanish and 23% for Catalan)

- Overall better translation of certain phrases from touristic domain

- No significant change in translation error rates yet

References:Asuncion Moreno et al. (2004): Language Independent Specificaiton of LR for Translation. D5.5. of the LC-STAR project, IST-2001-322-16, to be published.

Nicola Ueffing (2004): Results on Different Structured LR for Speech-to-Speech Translation. D4.5. of the LC-STAR project, IST-2001-322-16, to be published.

Maja Popović, Hermann Ney (2004): Towards the Use of Word Stems & Suffixes for Statistical Machine Translation. LREC 2004, Lissabon.

First Experiments and Preliminary Results

Contact: Ute Ziegenhain, e-mail: [email protected]

the lc-star project (ist-2001-32216) objectives: track i (duration 2 years) specification and...

Documents