copyright 2007, toshiba corporation. how (not) to select your voice corpus: random selection vs....

Copyright 2007, Toshiba Corporation.

How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced

Tanya Lambert, Norbert Braunschweiler, Sabine Buchholz

6th ISCA Workshop on Speech SynthesisBonn, Germany

22-24th August 2007

2

Overview

Text selection for a TTS voice

Random sub-corpus

Phonologically balanced sub-corpus

Phonetic and phonological inventory of full corpus and its sub-corpora

Phonetic and phonological coverage of units in test sentences with

respect to the full corpus and its sub-corpora

Voice building - automatic annotation and training

Objective and subjective evaluations

Conclusions

3

Selection of Text for a TTS Voice

Voice preparation for a TTS system is affected by:

Text domain from which text is selected

Text annotations (phonetic, phonological, prosodic, syntactic)

The linguistic and signal processing capabilities of the TTS system

Unit selection method and the type of units selected for speech

synthesis

Corpus training

Speech annotation (automatic/manual; phonetic details, post lexical

effects)

Other factors (time and financial resources, voice talent, recording

quality, the target audience of a TTS application, etc.)

4

Text Selection

Our case study tries to answer the following question: What is the effect of different script selection methods on a

half-phone unit selection system, automatic corpus annotation and corpus training?

Full corpus: The ATR American English Speech Corpus for Speech Synthesis (~ 8 h) used in this year’s Blizzard Challenge.

Random sub-corpus (0.8 h); Phonologically-rich sub-corpus (0.8 h)

Full corpus~8 h

Phonbal Random

Phonologically balanced selection

Random selection

5

Phonologically-Rich Sub-Corpus

…………………………………………….…………………..………………………………………………………..……..……….….….…..................

Set cover algorithm

…….…….…….…….…….Lexical units (full corpus)

…….…….…….

Sub-corpus A (1133 sentences)

Removed stress in

consonants

+

+........................................................................................................................................................................................................................................................................................................................................................................................

Sentences from full corpus (emphasis on interrogative, exclamatory, multisyllabic phrases, consonant clusters before and after silence)

………....……………..…...

Sub-corpus B….

….

+Sub-corpus A539 sentences(above the cut

point)

Sub-corpus(728 sentences

~2906 sec)

Phonetically and phonologically transcribed full corpus

=

Full corpus

Lexical units(sub-corpus)

594 sentences covered 1 unit per sentence

Set cover algorithm

6

Random Sub-Corpus

…….…….…….

Randomized sequence of sentences: Sub-corpus (686 sentences < 2914 sec)

Removed sentence

s including foreign words Sub-corpus

(687 sentences~2914 sec)

Full corpus

……………….…………………………………..………………………….………………..………………………………..……………….…………………………….………………..……………………….…..…………………………………………..………………..………………

+ 1 sentence= 2914 sec

7

Textual and Duration Characteristics of Corpora

Full Arctic Phonbal Random

seconds 28,591 2,914 2,906 2,914

sentences 6,579 1,032 728 687

words 79,182 9,196 8,156 8,094

words/sent. 12.0 8.9 11.2 11.8

% sent with

1 – 9 words 37.7 54.9 41.0 38.6

10 – 15 words 27.6 45.1 18.6 26.9

> 15 words 34.8 - 40.4 34.5

‘?’ 868 1 96 94

‘!’ 4 - - 1

‘,’ 3,977 430 452 410

‘;’ 30 6 4 3

‘:’ 17 - - -

8

Selection of text based on broad phonetic transcription may be insufficient

Inclusion of phonological, prosodic and syntactic markings how to make it effective for a half-phone unit selection system?

Distribution of Unit Types in Full Corpus and its Sub-Corpora

Corpus Selection - Considerations

Unit Types Full Arctic Phonbal Random

diph. (no stress) 1607 1385 1510 1322

lex. diphones 4332 2716 3306 2735

lex. triphones 17032 7945 8716 8144

sil_CV clusters (no stress) 104 42 46 43

VC_sil clusters (no stress) 184 84 100 75

9

Percentage Distribution of Units in Full Corpus and its Sub-corpora

0.0

5.0

10.0

15.0

20.0

Arctic Random Phonol. Rich

% o

f fu

ll co

rpu

s diphones

lexical diphones

lexical triphones

sil CV clusters

VC sil clusters

0.00

20.00

40.00

60.00

80.00

100.00

Arctic Random Phonol. rich

% o

f fu

ll c

orp

us

diphones (no stress)

lexical diphones

lexical triphones

sil_cv_clusters.lf

vc_clusters_sil.lf

10

Distribution of Unit Types in Test Sentences

0

500

1000

1500

2000

2500

3000

3500

4000

diph. (no stress) lexical diph. lexical triph.

typ

e o

ccu

rren

ces

conv mrt news novel sus

Testing distribution of unit types in 400 test sentences 100 sentences each from: conv = conversational; mrt = modified rhyme

test; news = news texts; novel = sentences from a novel; sus = semantically unpredictable sentences

11

Distribution of Lexical Diphone Types per Corpus per Text Genre

0

200

400

600

800

1000

1200

1400

1600

1800


occ

urr

ence

of

lexi

cal d

iph

. typ

es

Full corpus

Arctic

Phon. rich

Random

12

Missing Diphone Types from Each Corpus in Relation to Test Sentences

0

20

40

60

80

100

120

140

160


mis

sin

g le

xic

al d

iph

on

e t

yp

es

0

5

10

15

20

25

30


mis

sin

g d

iph

on

e ty

pes

(n

o s

tres

s)

Full corpus

Arctic

Random

Phonologically rich

13

Diphone Types in Each Corpus but not Required in Test Sentences

0

500

1000

1500

2000

2500

3000

3500

4000

4500

conv mrt new s novel sus

lexi

cal d

iph

on

e ty

pes

0

200

400

600

800

1000

1200

1400

1600


dip

ho

ne

ty

pe

s (

no

str

es

s)

Full corpus

Arctic

Random

Phonologically rich

14

Voice Building – Automatic Annotation and Training

From both corpora Phonbal and Random synthesis voices were created

Automatic synthesis voice creation encompasses Grapheme to phoneme conversion Automatic phone alignment Automatic prosody annotation Automatic prosody training (duration, F0, pause, etc.) Speech unit database creation

Automatic phone alignment Depends on the quality of grapheme to phoneme conversion Depends on the output of text normalisation Uses HMM’s with a flat start, i.e. depends on corpus size Respects pronunciation variants Acoustic model typology: three-state Markov, left-to-right with no

skips, context independent, single Gaussian monophone HMM’s

15

Voice Building – Automatic Annotation and Training

Automatic prosody annotation Prosodizer creates ToBI markup for each sentence Rule based Depends on quality of phone alignments Depends on quality of text analysis module, i.e. uses PoS, etc.

Automatic prosody training Depends on phone alignments, ToBI markup, and text analysis Creates prediction models for:

• Phone duration

• Prosodic chunk boundaries

• Presence or absence of pauses

• The length of previously predicted pauses

• The accent property of each word: de-accented, accented, high

• The F0 contour of each word

Quality of predicted prosody is important factor for overall voice quality

16

Objective Evaluation – how good are the phone alignments?

Comparison of phone alignments in the Phonbal and Random sub-corpora against those in the Full corpus

Phone alignment of Random corpus is slightly better than that of Phonbal

Metric Phonbal Random

Overlap Rate 95.26 96.35

RMSE of boundaries 6.3 ms 3.3 ms

boundaries within 5 ms 86.6 % 91.8 %



17

Objective Evaluation – Accuracy of Prosody Prediction

Comparison of the accuracy of pause prediction, prosodic chunk prediction, and word accent prediction; by the modules trained on the Phonbal or on the Random sub-corpus

against the automatic markup of 1000 sentences not in either sub-corpus

Some prosody modules trained on Random corpus are better

Phonbal Random

Chunks Precision 58.9 56.3

Recall 34.2 38.7

Pauses Precision 63.1 63.4

Recall 34.1 38.0

acc Precision 69.7 69.5

Recall 78.4 78.9

high Precision 54.7 57.1

Recall 38.6 41.1

18

Subjective Evaluation – Preference Listening Test

Subject Phonbal Random

Non-American Listeners

1 20 33

2 21 32

3 24 29

4 25 28

All 90 122

American English Listeners

1 21 32

2 21 32

3 16 37

4 23 30

5 25 28

All 106 159

Result of preference test comparing 53 test sentences synthesized with voice Phonbal or voice Random

2 groups of listeners: Non American listeners Native American listeners

Columns 2 and 3 show the number of times each subject preferred each voice

Each of the 9 subjects preferred the Random voice

19

Conclusions

Two synthesis voices were compared in this study: The two voices are based on two separate selections of sentences

from the same source corpus The Random corpus was created by a random selection of

sentences from the source corpus The Phonbal corpus was created by selecting sentences which

optimise its phonetic and phonological coverage

Listeners consistently preferred the TTS voice built with our system from the Random corpus

Investigation of the differences of the two sub-corpora revealed: Phonbal has better diphone and lexical diphone coverage Random has better phone alignments Random has slightly better prosody prediction performance

20

Future

Is the prosody prediction performance only due to better automatic prosody annotation which is due to better phone alignment?

Is the random selection inherently better suited to train prosody models on, e.g. because its distribution of sentence lengths is not as skewed as the Phonbal one?

What exactly is the relation between phone frequency and alignment accuracy?

Why does the Random corpus have so much better pause alignment when it contains fewer pauses?

Is it worth trying to construct some kind of prosodically balanced corpus to boost the performance of the trained modules, or would that result in a similar detrimental effect on alignment accuracy?

copyright 2007, toshiba corporation. how (not) to select your voice corpus: random selection vs....

Documents