the lc-star project (ist-2001-32216) objectives: track i (duration 2 years) specification and...
TRANSCRIPT
The LC-STAR project (IST-2001-32216)Objectives:
Track I (duration 2 years)
Specification and creation of large word lists and lexica suited for flexible vocabulary
speech recognition and high quality speech synthesis covering a wide range of domains.
Track II (duration 3 years)
Investigation of speech centered translation technologies focusing on requirements concerning language resources (LR)
Specification and creation of corpora and lexica needed for speech centered Specification and creation of corpora and lexica needed for speech centered translationtranslation
Building a demonstrator for speech-to-speech translation
Demonstration of language transfer in Catalan, Spanish and US-English
www.lcstar.com
Creation of Lexica for Speech Centered
TranslationUte Ziegenhain (Siemens AG), Asuncion Moreno (UPC), Nuria Castell (UPC)
1
4 Industrial Partners:
www.lcstar.com
Creation of Lexica for Speech Centered
TranslationUte Ziegenhain (Siemens AG), Asuncion Moreno (UPC), Nuria Castell (UPC)
1
2 Partners from Universities:
1 External Partner:
Two approaches:
- Bi-lingual word by word translation lexica with enriched morphological information
- Advantages: reduction of WER
- Disadvantage: for more inflected languages lexicon size increases by a factor
of 7 (at least); effort varies highly between languages -> only provided for Catalan and Spanish for statistical experiments
- 'Phrasal' lexica consisting of bi-lingual short phrases typically found in a tourist domain environment
- Advantages: reduction of OOV, better alignment and lexicon model
- Disadvantage: selection of adequate corpora
www.lcstar.com
Creation of Lexica for Speech Centered
TranslationUte Ziegenhain (Siemens AG), Asuncion Moreno (UPC), Nuria Castell (UPC)
1
US-English corpora from Verbmobil (112,541 token):orthographic transcriptions of telephone conversations in US-English for an appointment scheduling domain
US-English corpora from TALP corpus (408,452 token): US-English sentences translated from orthographic transcriptions of telephone conversations in Spanish and Catalan for a tourist domain
Web corpus (2,640,562 token): Text corpora downloaded from tourist web pages in US-English
Phrasal corpus: 1500 expressions in US-English selected from tourist phrasal books.
Source Corpora
2
Procedure:
1. Create text corpora in a reference language (US-English) in a given domain
2. Select 10.000 of the most frequent content words (i.e. nouns, verbs, adjectives, etc.) to
create a representative word list of the domain
3. For each word in the word list, provide the syntactic context in which the words are embedded
4. Cut the sentence into a segment that contains the word. The segment have usually been shortened to nominal phrases (in case of nouns and adjectives) or to subject plus verb plus short complement (for verbs)
5.Manually correct the phrases (e.g. typing and orthographic errors, meaningless or offensive phrases, proper names etc.)
6.Add a set of typical phrasal expressions commonly used in the semantic domain. the
set is manually choosen from several tourist text books Building a demonstrator for
speech-to-speech translation
-> Result: 'phrasal' reference lexicon consisting of 10.000 short phrases
Creation of Reference Corpus
3
Format:Textual format will be used with XML-based mark-up in accordance with a common and language specific Document Type Definition (DTD)
Advantages of using XML are:
- Widely known technique
- Many tools supporting it are available
- Supports Unicode (useful for languages with non-Latin writing systems)
- Allows easy and concise representation of one-to-many relations multiple
translations, multiple PoS, etc.)
- Easily definable and flexible syntax
- Easy well-formedness tests are possible using publicly available tools
Format
4
Set of 10.000 segments:- Source language segment: orthography of the source phrase
- Target language segment: target language translation +
orthography, one PoS (NOM, VER, ADJ, PRO…) and lemma
- Additional information possible (e.g. tags for foreign words, etc.)
Example:
Content
language Content
source EN-US hi Mary , how are you ?
language content Entry_com Ortography
entry ortography
pos
lemma
hola INT hola
María PRN María , PUN , ¿ PUN ¿ cómo CON como está VER estar
target ES-ES hola María , ¿ cómo está ?
? PUN ?
5
Partners and Languages
Partner Language IBM Italian Nokia Finnish Natural Speech Communication (NSC) Hebrew Rheinisch Technische Hochschule Aachen (RWT)
German
Siemens Russian University of Maribor (UMB) Slovenian Universitat Politécnica de Catalunya (UPC) Spanish
UPC Catalan
UPC US-English
6
1. Translate as literal as possible to the source text, while preserving the syntactic correctness, semantic meaning and naturalness
2. Idiomatic expressions will be translated and marked as such
3. Ambiguities: select most plausible translation with respect to semantic domain; otherwise provide more than one translation
4. Proper nouns are marked and translated only in case when they are used in target language (e.g. AIDS -> SIDA)
5. Punctuation marks are separated from words and should be kept.
6. Digits should be kept unless a transcription is required in the target language.
7.Abbreviations should be expanded or kept abbreviated depending on the use in target language.
8.Foreign words can be optionally labeled with a tag
9.Parts of word: (e.g. due to false starts etc.) if the reference phrase does not provide enough context to disambiguate generate the partial target word followed by the + mark.
Translation Methodology
7
Approach:
- Phrases occuring in all three languages are added to the training corpus
- Training corpus consists of selected dialogues from Verbmobil and TALP tourism
corpus
Preliminary Results:- Reduced OOV rate (13% relative for Spanish and 23% for Catalan)
- Overall better translation of certain phrases from touristic domain
- No significant change in translation error rates yet
References:Asuncion Moreno et al. (2004): Language Independent Specificaiton of LR for Translation. D5.5. of the LC-STAR project, IST-2001-322-16, to be published.
Nicola Ueffing (2004): Results on Different Structured LR for Speech-to-Speech Translation. D4.5. of the LC-STAR project, IST-2001-322-16, to be published.
Maja Popović, Hermann Ney (2004): Towards the Use of Word Stems & Suffixes for Statistical Machine Translation. LREC 2004, Lissabon.
First Experiments and Preliminary Results
Contact: Ute Ziegenhain, e-mail: [email protected]