1 a fully annotated corpus of russian speech pavel skrelin, nina volskaya, daniil kocharov, karina...

14
1 A Fully Annotated Corpus of Russian Speech Pavel Skrelin, Nina Volskaya, Daniil Kocharov, Karina Evgrafova, Olga Glotova, Vera Evdokimova Department of Phonetics, Saint-Petersburg State University [email protected]

Upload: coral-morris

Post on 14-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

1

A Fully Annotated Corpus of Russian Speech

Pavel Skrelin, Nina Volskaya, Daniil Kocharov, Karina Evgrafova, Olga Glotova, Vera Evdokimova

Department of Phonetics, Saint-Petersburg State University

[email protected]

2

CORPRES

• fully annotated COrpus of Russian Professionally REad Speech

• developed at the Department of Phonetics, Saint-Petersburg State University

• developed for:

– unit-selection TTS

• possible linguistic use:

– research in the Russian phonetics and inter- and intra-speaker speech variability

D. Kocharov, Fully Annotated Corpus of Russian Speech

3

Corpus Description

• 8 speakers (4 women and 4 men)

• 60 hours of read speech (7.5 hours from each speaker).

• Texts of different styles:

– fiction narrative texts,

– a play with emotionally expressive dialogues

– informational texts on IT, politics, economy

• 6 levels of annotation.

D. Kocharov, Fully Annotated Corpus of Russian Speech

4

Annotation

• Level 1: pitch period boundaries

• Level 2: phonetic events

• Level 3: real phonetic transcription

• Level 4: ideal phonetic transcription

• Level 5: orthographic transcription

• Level 6: prosodic transcription

D. Kocharov, Fully Annotated Corpus of Russian Speech

5

Annotated Speech Sample

D. Kocharov, Fully Annotated Corpus of Russian Speech

Annotation file format:0,1,0,2,h0,8,хотелось0,32,h0,64,12286,16,-968,16,

6

Labeling Periods of Fundamental Frequency andPhonetic Events

• F0 periods were detected automatically.

• The efficiency of automatic F0 detection and F0 period labeling was up to 98%.

• The results of the automatic procedure were checked and corrected manually.

• Phonetic events were detected manually:

– epenthetic vowels,

– voice onsets,

– voiced plosures,

– stationary parts of voiceless consonants,

– glottalization.

D. Kocharov, Fully Annotated Corpus of Russian Speech

7

Phonetic Transcription

• Version of SAMPA for Russian was used for transcription.

• 18 symbols were used to mark positional allophones of 6 Russian vowel phonemes /a/, /o/, /i/, /u/, /e/, /y/.

• They contained indication of the vowel’s position regarding stress:

– 0 – stressed accented vowel,

– 1 – unstressed vowel in a pretonic syllable,

– 4 – unstressed one in a post-tonic syllable.

• The set of consonant symbols included 41 symbols:

– 36 Russian consonant phonemes

– 5 voiced allophones of voiceless consonants

D. Kocharov, Fully Annotated Corpus of Russian Speech

8

Real Phonetic Transcription

• Speech signal was manually:

– segmented

– transcribed

– peer-revised.

• Huge work!

• Time efficiency: ~1 sound per minute.

=> 1 minute of speech per 1-2 hours.

D. Kocharov, Fully Annotated Corpus of Russian Speech

9

Ideal Phonetic Transcription

• Transcription:

– is generated from texts.

• Labels:

– placed automatically to coincide with the label positions produced manually on the real transcription level.

• Automatic labeling:

– not perfect due to the mismatch of ideal and real phonetic transcriptions.

=> results of the automatic procedure were further manually corrected.

D. Kocharov, Fully Annotated Corpus of Russian Speech

10

Orthographic and Prosodic Transcription

• Orthographic transcription (Level 5) contains

– the boundaries of words and word labels.

– prosodically prominent words are labeled with special symbols.

• Prosodic transcription (Level 6) contains

– boundaries of tone units and pauses and their labels.

• Prosodic information was marked by expert phoneticians on the basis of perceptual and acoustic analysis of the speech data in a text file containing orthographic transcription.

• Labels were later automatically transferred from the text file to the annotation files to coincide with the phonetic transcription levels.

D. Kocharov, Fully Annotated Corpus of Russian Speech

11

Corpus Data Description

D. Kocharov, Fully Annotated Corpus of Russian Speech

Fully Annotated Data

Partly Annotated Data

Total Amount

Phonemes 1 048 867 – –

Words 211 437 317 021 528 458

Tone Units 64 055 86 546 150 601

Hours 24 36 60

40% of the corpus: manually segmented and fully annotated on all six levels.

60% of the corpus: partly annotated

labels for pitch period and phonetic event labels:

no manual phonetic transcription

orthographic, prosodic transcription, ideal phonetic transcription are done, but not aligned with speech signal.

12

Mismatch Between Ideal and Real Transcriptions

Total Correctly Mispronounced Elided

Count 1 118 833 947 508 101 292 70 033

Percent 100 84.7 9.05 6.25

D. Kocharov, Fully Annotated Corpus of Russian Speech

13

Conclusions

• It is the only available corpus for Russian TTS.

• Precise annotation provides an especially valuable resource for both linguistic research and speech applications development.

• The corpus is large enough for:

– high-quality TTS

– phonetic research of intra- and inter-speaker pronunciation variability

• Our experience shows:

– manual transcription is very expensive, but worth doing.

D. Kocharov, Fully Annotated Corpus of Russian Speech

14

Thank you!

D. Kocharov, Fully Annotated Corpus of Russian Speech