1 a fully annotated corpus of russian speech pavel skrelin, nina volskaya, daniil kocharov, karina...

1

A Fully Annotated Corpus of Russian Speech

Pavel Skrelin, Nina Volskaya, Daniil Kocharov, Karina Evgrafova, Olga Glotova, Vera Evdokimova

Department of Phonetics, Saint-Petersburg State University

[email protected]

2

CORPRES

• fully annotated COrpus of Russian Professionally REad Speech

• developed at the Department of Phonetics, Saint-Petersburg State University

• developed for:

– unit-selection TTS

• possible linguistic use:

– research in the Russian phonetics and inter- and intra-speaker speech variability

D. Kocharov, Fully Annotated Corpus of Russian Speech

3

Corpus Description

• 8 speakers (4 women and 4 men)

• 60 hours of read speech (7.5 hours from each speaker).

• Texts of different styles:

– fiction narrative texts,

– a play with emotionally expressive dialogues

– informational texts on IT, politics, economy

• 6 levels of annotation.


4

Annotation

• Level 1: pitch period boundaries

• Level 2: phonetic events

• Level 3: real phonetic transcription

• Level 4: ideal phonetic transcription

• Level 5: orthographic transcription

• Level 6: prosodic transcription


5

Annotated Speech Sample


Annotation file format:0,1,0,2,h0,8,хотелось0,32,h0,64,12286,16,-968,16,

…

6

Labeling Periods of Fundamental Frequency andPhonetic Events

• F0 periods were detected automatically.

• The efficiency of automatic F0 detection and F0 period labeling was up to 98%.

• The results of the automatic procedure were checked and corrected manually.

• Phonetic events were detected manually:

– epenthetic vowels,

– voice onsets,

– voiced plosures,

– stationary parts of voiceless consonants,

– glottalization.


7

Phonetic Transcription

• Version of SAMPA for Russian was used for transcription.

• 18 symbols were used to mark positional allophones of 6 Russian vowel phonemes /a/, /o/, /i/, /u/, /e/, /y/.

• They contained indication of the vowel’s position regarding stress:

– 0 – stressed accented vowel,

– 1 – unstressed vowel in a pretonic syllable,

– 4 – unstressed one in a post-tonic syllable.

• The set of consonant symbols included 41 symbols:

– 36 Russian consonant phonemes

– 5 voiced allophones of voiceless consonants


8

Real Phonetic Transcription

• Speech signal was manually:

– segmented

– transcribed

– peer-revised.

• Huge work!

• Time efficiency: ~1 sound per minute.

=> 1 minute of speech per 1-2 hours.


9

Ideal Phonetic Transcription

• Transcription:

– is generated from texts.

• Labels:

– placed automatically to coincide with the label positions produced manually on the real transcription level.

• Automatic labeling:

– not perfect due to the mismatch of ideal and real phonetic transcriptions.

=> results of the automatic procedure were further manually corrected.


10

Orthographic and Prosodic Transcription

• Orthographic transcription (Level 5) contains

– the boundaries of words and word labels.

– prosodically prominent words are labeled with special symbols.

• Prosodic transcription (Level 6) contains

– boundaries of tone units and pauses and their labels.

• Prosodic information was marked by expert phoneticians on the basis of perceptual and acoustic analysis of the speech data in a text file containing orthographic transcription.

• Labels were later automatically transferred from the text file to the annotation files to coincide with the phonetic transcription levels.


11

Corpus Data Description


Fully Annotated Data

Partly Annotated Data

Total Amount

Phonemes 1 048 867 – –

Words 211 437 317 021 528 458

Tone Units 64 055 86 546 150 601

Hours 24 36 60

40% of the corpus: manually segmented and fully annotated on all six levels.

60% of the corpus: partly annotated

labels for pitch period and phonetic event labels:

no manual phonetic transcription

orthographic, prosodic transcription, ideal phonetic transcription are done, but not aligned with speech signal.

12

Mismatch Between Ideal and Real Transcriptions

Total Correctly Mispronounced Elided

Count 1 118 833 947 508 101 292 70 033

Percent 100 84.7 9.05 6.25


13

Conclusions

• It is the only available corpus for Russian TTS.

• Precise annotation provides an especially valuable resource for both linguistic research and speech applications development.

• The corpus is large enough for:

– high-quality TTS

– phonetic research of intra- and inter-speaker pronunciation variability

• Our experience shows:

– manual transcription is very expensive, but worth doing.


14

Thank you!


1 a fully annotated corpus of russian speech pavel skrelin, nina volskaya, daniil kocharov, karina...

Documents

annotated speech sample

russian phonetics

minute of speech

real transcription level

phonetic events level

russian vowel phonemes

russian consonant phonemes

corpus description