gslt speech synthesis 08 b [read-only]...3) term paper presentations, assignment correction to do...

26
1 Olov Engwall, Speech synthesis, 2008 Speech synthesis Olov Engwall, Speech synthesis, 2008 Presentations Work in pairs in 6 minutes mini-interviews (3 minutes each). Ask questions around the topics: What is your previous experience of speech synthesis? Why did you decide to take this course? What do you expect to learn? Write down the answers of your partner. Present during the presentation round Submit the answers to me Why? To let me know more about your background and expectations to be able to adapt the course content. To get to know each other. To ”start you up”… Olov Engwall, Speech synthesis, 2008 The course This is what the course book will look like… Until then, refer to http://svr-www.eng.cam.ac.uk/~pat40/book.html Course pages: www.speech.kth.se/courses/GSLT_SS Lecture content (impossible to cover the entire book): 1) History, Concatenative synthesis, Unit selection, HMM synthesis, Text issues, Prosody 2) Vocal tract models, Formant synthesis, Evaluation 3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic selection Olov Engwall, Speech synthesis, 2008 Definition & Main scope The automatic generation of synthesized sound or visual output from any phonetic string. Olov Engwall, Speech synthesis, 2008 Synthesis approaches By Concatenation Elementary speech units are stored in a database and then concatenated and processed to produce the speech signal By Rule Speech is produced by mathematical rules that describe the influence of phonemes on one another Olov Engwall, Speech synthesis, 2008 History

Upload: others

Post on 09-Mar-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

1

Olov Engwall, Speech synthesis, 2008

Speech synthesis

Olov Engwall, Speech synthesis, 2008

PresentationsWork in pairs in 6 minutes mini-interviews (3 minutes each).

Ask questions around the topics:• What is your previous experience of speech synthesis?• Why did you decide to take this course?• What do you expect to learn?Write down the answers of your partner. Present during the presentation roundSubmit the answers to me

Why?To let me know more about your background and expectations to be able to adapt the course content.To get to know each other.To ”start you up”…

Olov Engwall, Speech synthesis, 2008

The courseThis is what the course book will look like…Until then, refer to http://svr-www.eng.cam.ac.uk/~pat40/book.html

Course pages: www.speech.kth.se/courses/GSLT_SS

Lecture content (impossible to cover the entire book):1) History, Concatenative synthesis, Unit selection, HMM synthesis,

Text issues, Prosody 2) Vocal tract models, Formant synthesis, Evaluation3) Term paper presentations, assignment correction

To Do until next time:1) Assignment 1: Unit selection calculations2) Term paper topic selection

Olov Engwall, Speech synthesis, 2008

Definition & Main scope

The automatic generation of synthesized sound orvisual output from any phonetic string.

Olov Engwall, Speech synthesis, 2008

Synthesis approaches

By ConcatenationElementary speech units are stored in a database and then concatenated and processed to produce the speech signal

By RuleSpeech is produced by mathematical rules that describe the influence of phonemes on one another

Olov Engwall, Speech synthesis, 2008

History

Page 2: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

2

Olov Engwall, Speech synthesis, 2008

van KempelenWolfgang von Kempelen’s bookMechanismus der menschlichen

Sprache nebst Beschreibung einersprechenden Maschine (1791).

The essential parts• pressure chamber = lungs,• a vibrating reed = vocal cords,• a leather tube = vocal tract.

The machine was• hand operated• could produce whole words andshort phrases.

Olov Engwall, Speech synthesis, 2008

Wheatstone’s version

Why is it of interest to us?

Charles Wheatstone’s version of von Kempelen's speaking machine

Parametric features!

Olov Engwall, Speech synthesis, 2008

First electronic synthesis

• Homer Dudley presented VODER (Voice Operating Demonstrator) at the World Fair in New York in 1939

• The device was played like a musical instrument, with voicing/noise source on a foot pedal and signal routed through ten bandpass filters.

Olov Engwall, Speech synthesis, 2008

First formant synthesizers1950’s PAT (Parametric Artificial Talker), Walter Lawrence 3 electronic formant resonators input signal (noise)6 functions to control 3 formant frequencies, voicing, amplitude, fundamental frequency, and noise amplitude.

1950’s OVE (Orator Verbis Electris) by Gunnar Fant

From 1950’s: other synthesizers including the first articulatory synthesis DAVO (Dynamic Analog of the Vocal tract)

An excellent historical trip of speech synthesis:Dennis Klatt's History of Speech Synthesis athttp://www.cs.indiana.edu/rhythmsp/ASA/Contents.html

Olov Engwall, Speech synthesis, 2008

• OVE I (1953)• On your computer today, and the original next

time + OVE II (1962)

Let us take a look at OVE

Olov Engwall, Speech synthesis, 2008

OVE Instructionshttp://www.speech.kth.se/courses/GSLT_SS/ove.html

1. Test how the five different source models change the output. What is the difference in the formant pattern between different sources? Look at the number of formants, the peak amplitude, the bandwidth.

2. Alter a) the frequency and b) the shape of the source signal. What happens with the formant frequencies in the two cases? Relate these changes to human speech production.

3. Change the Frequency values F1-F4. Start with a neutral vowel (F1=500 Hz, F2=1500 Hz, F3=2500 Hz, F4=3500 Hz). Explain the attenuation in formant peak amplitude for higher frequencies (hint: try a rectangle source and change the shape to 99).

Now move one of the formant peaks so that it is about 200 Hz from the closest peak. What happens with the neighbour peak? Change the bandwidth of the formants. What is the relation between the bandwidth and the formant peak amplitude?

4. Move around the cursor in the vowel space and see how the shape of the output waveform (green curve in the bottom panel) changes.

If you have time, try to generate the sentences "How are you?" and "I love you!".

Page 3: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

3

Olov Engwall, Speech synthesis, 2008

Formant amplitudes

Olov Engwall, Speech synthesis, 2008

Speech analysis & manipulation

Olov Engwall, Speech synthesis, 2008

Why signal processing?• Need to separate the source from the filter

for modelling (linear predictive analysis)

• Need to model the sound source (prosody, speaker characteristics)

• Need to alter speech units in concatenative synthesis (amplitude, cepstrums)

• Need to make concatenations smooth in concatenative synthesis (PSOLA)

Olov Engwall, Speech synthesis, 2008

The source-filter theoryThe signal (c-d) is the result of a linear filter(b) excited by one or several sources (a).

Olov Engwall, Speech synthesis, 2008

filter(vocal tract)

radiation(lips)

source(glottis)

TIME:

FREQUENCY:

The source-filter theory

More to come on the vocal tract filter in lecture 7Olov Engwall, Speech synthesis, 2008

• The voiced quasi-periodic source (glottis pulses) – vowels

Parameters: – on/off, – Fundamental frequency F0, – intensity, – shape

• Frication source – fricatives

• Transient noise – plosives

• No source – voiceless occlusions

Source functions

t

Page 4: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

4

Olov Engwall, Speech synthesis, 2008

High aspiration levels. Greater pulse asymmetry Less time in open state.

Low glottal tension. Triangular glottal opening. High medial compression Medium longitud. tension.

Whispery

Very low F0. Irregular F0 & amplitude

High adductive tension and medial compressionLittle longitudinal tension

Creaky

Audible aspirationSlow “glottal return”. Glottal pulse symmetry. Higher F0 intensity.

Lack of tension. Never close completely.

Breathy

Standard source. Steep spectral slope.

Normal, efficientComplete glottis closures

ModalAcousticArticulatory

Voice source types

Olov Engwall, Speech synthesis, 2008

The quasi-periodic source

TimeFrequency

t

t

f

f

T0 f= nF0 =n/T0

Why is there a damping slope inthe transfer function?

Olov Engwall, Speech synthesis, 2008

Simple vowel synthesis

Source Filter

Waveform F1 F2 F3 F4

Triangle source and formant filters in cascade:Bandpass filters with frequency, bandwidth, level

So, how do we find the source from a speech signal?

Olov Engwall, Speech synthesis, 2008

A method to separate the source from the filter

Predicts the next sample as a linear combination of the past psamples

The coefficients a1 … ap describe a filter that is the inverse of the transfer function

Linear Prediction (LP)

• Minimization of the prediction error results in an all-pole filter which matches the signal spectrum

• This inverse filter removes the formants and can hence be used to find the source.

∑=

−=p

kk knxanx

1

][][~

Olov Engwall, Speech synthesis, 2008

Spectral Fourier analysis• A Fourier transform of the filter coefficients

a1 … an give the frequency response of the inverse filter • A periodic waveform can be described as a sum of

harmonics• The harmonics are sine waves with different phases,

amplitudes and frequencies. • The frequencies are multiples of the fundamental

frequency.• A periodic signal has a discrete spectrum

tf

tf

Olov Engwall, Speech synthesis, 2008

Fourier Transforms• Fourirer transform (FT): A non-period signal has a

continous frequency spectrum

• Discrete FT (DFT): Fourier transform of a sampled signal• N samples in both the time and frequency domain.• The spectrum is mirrored around the sampling frequency

• Fast FT (FFT): Clever algorithm to calculate DFT • Reduces the number of multiplications:

DFT: ~N2 FFT: ~(N/2) * 2log(N)

t f

tf

Page 5: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

5

Olov Engwall, Speech synthesis, 2008

Windowing• The analysis of a long speech signal is made on short frames:

• The truncation of the signal results in artefacs (sidelobes)• The artefacts are reduced if the signal is multifplied with a

window that gives less weight to the sides.

Analysis window 10 – 50 ms (20 ms in example)

Olov Engwall, Speech synthesis, 2008

Effect spectrum• The FFT analysis gives complex values: amplitude and

phase for each frequency component• The phase is often not interesting, only the signal’s energy

at different frequencies.• The effect spectrum shows the power spectrum for a short

section of the signal

WindowingFFT

SquareLogarithm

Olov Engwall, Speech synthesis, 2008

Cepstrum Analysis• The dominating method for ASR, used in HMM synthesis

•Inverse Fourier transform of logarithmic frequency spectrum

“Spectral analysis of spectrum”

•The coarse structure of the spectrum is described with a small number of parameters

•Orthogonal coefficients (uncorrelated)

•Anagram: Spectrum-cepstrum, filtering-liftering, frequency-quefrency, phase-saphe

Olov Engwall, Speech synthesis, 2008

Cepstrum from filterbank

-1,5-1

-0,50

0,51

1,5

-2

-1

0

1

2

-1,5-1

-0,50

0,51

1,5

-2

-1

0

1

2

30

50

70

90

1 2 3 4

1 2 3 4

30

50

70

90

110

Spectrum of /a:/

Spectrum of /s/

Cepstrum of /a:/

Cepstrum of /s/

Weight functions

W1

W2

W3

W4

C1 C2 C3 C4

C1 C2 C3 C4

* =

* =

∑=

−=N

iNij

ij AN

C1

)5.0( )cos(2 π

Olov Engwall, Speech synthesis, 2008

Mel filter bankMel filter bank

Mel Frequency Cepstral Coefficients

FFTFFT

30

50

70

90

110

Mel-Spectrum of /a:/

-200

-100

0

100

1 2 3 4

Mel-Cepstrum of /a:/

C1 C2 C3 C4Mel

dB

~6000 Hz

The Mel scale is perceptually motivated

Cepstrum transformLinear < 1000 HzLog > 1000 Hz

MFCC

Olov Engwall, Speech synthesis, 2008

Concatenative synthesis

Page 6: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

6

Olov Engwall, Speech synthesis, 2008

Nothing new under the sun…

• Peterson et al. (1958)

• Dixon and Maxey (1968)

• “Diadic Units”, (Olive, 1977)

Olov Engwall, Speech synthesis, 2008

Let’s get the terms straightConcatenative synthesisDefinition: All kinds of synthesis based on the concatenation

of units, regardless of type (sound, formant trajectories, articulatory parameters) and size(diphones, triphones, syllables, longer units).

(Everyday use: Concatenation of same-size sound units.)

Unit selectionDefinition: All kinds of synthesis based on the concatenation

of units where there are several candidates to choose from, regardless of if the candidates have the same, fixed size or if the size is variable.

(Everyday use: Concatenation of variable sized sound units.)

Olov Engwall, Speech synthesis, 2008

Why has concatenation conquered?

• Storing the segment database is no longer an issue• Advances in ensuring smoothness in concatenations• Rule-based synthesis output used to be smoother• Unit selection provides (piece-wise) high quality speech.• Change of applications.• Certain sounds are too hard to be produced by rule

• Vowels are easy to create by rule• Bursts, voiceless stops are too difficult, we do not fully

understand their production mechanisms

Concatenative Synthesis is the state-of-the-art

Olov Engwall, Speech synthesis, 2008

Database preparation

• Choose the speech units (Phone, Diphone, Sub-word unit, Cluster based unit selection)

• Compile and record utterances• Segment signal and extract speech units• Store segment waveforms (along with context) and information in a database: Dictionary, waveform, pitch marke.g. “ch-l r021 412.035 463.009 518.23”

diphone file Start time Middle time End • Pitch mark file: a list of each pitch mark position in the file• Extract parameters; create parametric segmentdatabase (for data compaction and prosody matching)• Perform amplitude equalization (prevents mismatches)

Olov Engwall, Speech synthesis, 2008

Diphone & Triphone synthesis

s ɑː k

*s1 s2ɑ1 ɑ2l1 l2*

s ɑː l

*sɑ1 ɑ2l*

*s1 s2ɑ1

*r1 r2ɑ1 ɑ2k1 k2*

r ɑː k

*rɑ1 ɑ2k*

ɑ2k1 k2*

*sɑ1 ɑ2k*

Diphone

Triphone

Olov Engwall, Speech synthesis, 2008

Diphone synthesis

Sequences of a particular sound/phone in all its environmentsof occurrence or all/most two-phone sequences occurring in alanguage: auto ’car’ -> _a, au, ut, to, o_

• Rationale: the ’center’ of a phonetic realization is the moststable region, whereas the transition from one segment toanother contains the most interesting phenomena, and is thusthe hardest to model.

Page 7: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

7

Olov Engwall, Speech synthesis, 2008

Diphone synthesis

• 1200 diphones can already create a quite good sounding synthesis

-Speaker dependence (one set from one speaker)

-Various digital signal processing techniques -> ’robotic’ sound

- Segmental quality, transition between diphones

- Only partial covery of co-articulation

MBROLA BT, Laureate Festival

Olov Engwall, Speech synthesis, 2008

Diphone ”synthesis” labhttp://www.speech.kth.se/courses/GSLT_SS/lab1.html

1. Record the "database", the word list: "Dockad, yttern, töm, flöde, möta, lätt, blomster, lyssnarna."in one go, in that order and without pausing.

2. Segment the wordlist into diphones: Cut out each diphone and put them in a new Wavesurfer window, but with pauses separating each diphone.

3. Identify the diphones that you need to create the sentence"Dom flyttade möblerna.“

4. Copy and paste diphones from the database window into a new synthesis window.

5. Play the sentence, fine tune durations and concatenations.

Olov Engwall, Speech synthesis, 2008

Equalization• Segments extracted from different words, with differentphonetic contexts, have amplitude and timbre mismatches.

• Equalization: Related endings of segments are imposed similar amplitude spectra.

• Amplitude equalization: smooth modification of the energy levels at the beginning and at the end of segments. The energy of all the phones of a given phoneme is given the average value. The difference is distributed on the neighbourhood.

• Timbre conflicts are tackled at run-time, by smoothing individual couples of segments when necessary, so that some of the phonetic variability is still maintained.

Olov Engwall, Speech synthesis, 2008

Concatenation with PSOLA• Time-Domain Pitch-Synchronous-OverLap-Add (TD-PSOLA) • High speech quality

• Very low computational cost (7 operations/sample).• A window (2-pitch periods long) is multiplied with the signal

• The signal is broken into a set of localized signals (non-zero only at the window intervals)

Olov Engwall, Speech synthesis, 2008

Altering pitch with PSOLA• Relative shifting of localized signals• Spacing reflects pitch duration• Good result for modification factor [0.6 – 1.5]

Spaced futher apartOlov Engwall, Speech synthesis, 2008

Altering duration & amplitudeIncrease number of PSOLA iterations (overlaps) to increase duration

• Decrease number of PSOLA iterations(overlaps) to decrease duration

• Multiplying the signal by a constant• If constant > 1, amplitude increase• If constant < 1, amplitude decrease

Frame duplication

Page 8: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

8

Olov Engwall, Speech synthesis, 2008

MBROLA• Algorithm: Multi-Band Resynthesis OverLap and Add

• A time-domain PSOLA-like algorithm with efficientsmoothing of the spectral envelope

• Very high data compression ratios (up to 10)

• Synthesizer: Concatenation of diphones.In: List of phonemes and prosodic info (duration of phonemes and

a piecewise linear description of pitch),Out: speech samples on 16 bits (linear), at the sampling frequency

of the diphone database.

• Project goal: generate a set of speech synthesizers for as many languages as possible, free for non-commercialapplications.

Olov Engwall, Speech synthesis, 2008

Unit Selection

• Larger database of recorded units: e.g. diphones, phones, syllables, words, etc.• Multiple occurrences of the units cover a wide space of the spectral and prosodic parameters• Units nearest in this space to the targets will be chosen and will require only minor modification• The corpus is segmented into phonetic units, indexed, and used as-is• Selection is made on-line

• The trend is towards longer and longer units2005200420032002200120001999

Olov Engwall, Speech synthesis, 2008

Best Unit SelectionTarget cost

– Prosodic and spectral closeness to target

Concatenation cost– Units occurring beside each other in the recorded database being given a zero

Cost function: – Target + Concatenation cost (weighted sum)

Viterbi algorithm used to find the overall minimum cost path.

Assignment 1: Practical exercises with the calculation of target and concatenation cost.

Olov Engwall, Speech synthesis, 2008

Target & Concatenation costTarget cost = The difference in each frame between the target

and candidates for– target pitch – power– duration

• Manhattan (City block) distance

• Euclidean distance

∑ −=i

ii yxD ||

∑ −=i

ii yxD 2)(

• Concatenation cost = The difference between the end of diphone 1 and the start of diphone 2

• Mahalanobis distance

• Kullback-Leibler distance

∑ −= 2

2)(

i

ii yxDσ

i

iN

i ii yxyxD log)(

1∑=−=

Oh no! A different number of frames!

BEWARE OF PITFALL

Olov Engwall, Speech synthesis, 2008

Viterbi – best path search

Time

Pho

ne1

Pho

ne2

Utterance

• All possible sequences are hypothesized in parallel• Threshold excludes improbable hypothesesBased on• previous path probability (getting to state i)• transition probability (getting from i to j)• observation likelihood (state j matches input)

ij

Olov Engwall, Speech synthesis, 2008

Pros & cons of Unit selectionAdvantages:• Piece-wise very high waveform quality, thanks to minimal

signal manipulation• Non-linguistic features of the speakers voice built in

Disadvantages:• Discontinuities between units• Hit or miss for target selection• Quality differences between different sized units• Fixed voice• Fixed non-linguistic features

Are there any valid alternatives?

Page 9: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

9

Olov Engwall, Speech synthesis, 2008

HMM synthesis

Olov Engwall, Speech synthesis, 2008

An example of voice

conversion

Model Estimation

LPModel

FormantTrajectory

Source Speech

TargetSpeech

LPModel

FormantTrajectory

Mapped SpeechWarping

FactorsTarget

SpeakerHMMModel

Source SpeakerHMMModel

Formant Tracking

Formant Mapping

SpeechRecon

struction

Speech Reconstruction

LPC-

Spec

trum

War

ping

/ Po

le R

otat

ion

Model Estimation

LPModel

FormantTrajectory

Source Speech

TargetSpeech

LPModel

FormantTrajectory

Mapped SpeechWarping

FactorsTarget

SpeakerHMMModel

Source SpeakerHMMModel

Formant Tracking

Formant Mapping

SpeechRecon-

struction

Speech Reconstruction

LPC

Spe

ctru

m W

arpi

ng /

Pole

Rot

atio

n

Transformed(AM M to F)American male American female

Olov Engwall, Speech synthesis, 2008

HMM synthesis

A speech synthesis technique based on HTK (Hidden Markov Model Toolkit)

Developed by the HTS working group at the Department of Computer Science Nagoya

Institute of Technology Interdisciplinary Graduate School of Science and

Engineering Tokyo Institute of Technology http://hts.sp.nitech.ac.jp

Olov Engwall, Speech synthesis, 2008

Hidden Markov Models

• A HMM is a machine, with a limited number of possible states.

• The transition between two states is regulated by probabilities.

• Every transition results in an observation with a certain probability.

• The states are hidden, only the observations are visible.

Pii

Pij

Pjj

Pjk

Pjk

Pkl

Pll

Oi OjOk Ol

Olov Engwall, Speech synthesis, 2008

HMM in speech synthesis1. Transcription & segmentation of speech databases2. Construction of inventory of speech segments3. Run-time selection of speech segments

High quality speech can be synthesized using waveformconcatenation algorithms (e.g., PSOLA).

However, to obtain various voice qualities, a large amount of speech data is necessary.

→ Speech synthesis from HMMs themselves.Voice quality can be changed by transforming HMM

parameters appropriately.The output is vocoded, but it is always smooth and stable

Olov Engwall, Speech synthesis, 2008

Basic idea

Mel-Log-Spectrum-Approximation

Start the training of the HMMs with a good guesson the parameters.

The guess is improvedthrough comparison with training observations.

In the synthesis we shouldfind the optimal sequenceof states, throughconcatenation of HMMs

Page 10: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

10

Olov Engwall, Speech synthesis, 2008

The training part• The training is automatic. You need:

– The text + recordings of about 1000 sentences• The training of 1000 sentences

– takes 24 hours and generates a voice of less than 1 MB

• Separate HMMs for: Spectrum, F0, Duration• Training in two steps:

1.Context independent models2.Use these models to create context dependent models.

• Clustering:– Storing all contexts requires much space– It may be difficult to find alternatives for missing models– Many models are very similar = redundancy

Olov Engwall, Speech synthesis, 2008

Clustering• Groups a large database into clusters• Three trees: Duration, F0 and Spectrum• Division based on yes/no questions

– Grouping acoustic similar phonemes– Features.– Context.

Olov Engwall, Speech synthesis, 2008

Synthesis

For each phoneme we need: • Mel-Cepstrum, with first and

second derivative (mcep, Δ, Δ²)

• (F0, Δ, Δ²) + information aboutvoicing

• Duration. Can be generated implicitly by F0 and spectrum HMMs, but the result is morenatural with explicit modeling.

• Δ och Δ² are used to smooth the parameter sequences.

Olov Engwall, Speech synthesis, 2008

Delta, delta-delta...

Olov Engwall, Speech synthesis, 2008

Use of HMM synthesis

• Various voices:– Speaker adaptation– Speaker interpolation– Eigenvoices

• Very low bit rate speech coder• Security of speaker identification systems

Olov Engwall, Speech synthesis, 2008

Speaker adaptation

Page 11: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

11

Olov Engwall, Speech synthesis, 2008

Speaker interpolation

www.sp.nitech.ac.jp/~tokuda/HTS_demo/speaker_inter/index.html

Olov Engwall, Speech synthesis, 2008

Test of speaker verification

Olov Engwall, Speech synthesis, 2008

Very low bit-rate speech coding

Olov Engwall, Speech synthesis, 2008

Swedish HMM synthesisMaster thesis by Anders Lundgren

Language specific parts:

• Text to phoneme transcription (RulSys or Festival)

• Translation of the phonemic transcription to HTK SAMPA-Festival

• Module to generate contextual information(syllable division, word accent placement)

• Decision tree paths for the clustering of HMMs– Features– contextual information

Olov Engwall, Speech synthesis, 2008

Listening test• Separate evaluation of prosody and spectrum

• Six voice variants:– HTS– Prosody from HTS, spectrum from MBROLA– Prosody from RULSYS, spectrum from HTS

• TMH’s synthesis reference system – Prosody from RULSYS, spectrum from MBROLA

Olov Engwall, Speech synthesis, 2008

Clarity

0%10%20%30%40%50%60%70%80%90%

100%

M HTS

M HTS

pros

ody

M HTS

spec

trum

F HTS

F HTS

pros

ody

F HTS

spec

trum

Much worseWorseNo differenceBetterMuch better

Page 12: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

12

Olov Engwall, Speech synthesis, 2008

Naturalness

0%10%20%30%40%50%60%70%80%90%

100%

M HTS

M HTS pr

osod

y

M HTS sp

ectru

mF H

TS

F HTS pr

osod

y

F HTS sp

ectru

m

Much worseWorseNo DifferenceBetterMuch better

Olov Engwall, Speech synthesis, 2008

Previous TTS experience

0%10%20%30%40%50%60%70%80%90%

100%

0%10%20%30%40%50%60%70%80%90%

100%

Much worseWorseNo DifferenceBetterMuch better

Yes

No

p s p s

p sp s

More on how to evaluate in lecture 9!

Olov Engwall, Speech synthesis, 2008

The automatic generation of synthesized sound from any text string.

From text

Olov Engwall, Speech synthesis, 2008

Text-to-speech”The automatic generation of synthesized sounds...”

texttext

Linguistic analysisLinguistic analysis

Prosodic analysisProsodic analysis

Phonetic descriptionPhonetic description

Sound generationSound generation

Morphologic analysisLexicon and rulesSyntax analysis

Rules and lexicon

Rules and choice of units

Joining partsJoining parts Rules

“hello”

Olov Engwall, Speech synthesis, 2008

Text Analysis Challenges

• Homographs– My latest project is to learn how to better project my

voice.– The girl with the bow in her hair was told to bow deeply

when greeting her superiors.

• Numbers (models, dates)– On May 5 2005, the university bought 2005 computers– a Boeing model 747 can contain 747 people

• Abbreviations– Yesterday it rained 3 in. Take 1 out, then put 3 in.– St. John St.

Let us try!Olov Engwall, Speech synthesis, 2008

Preprocessor• Sentence end detection (semicolon, period – ratio, time

and decimal point, sentence ending respectively)• Abbreviations (e.g. – for instance)

Changed to their full form with the help of lexicons• Acronyms (I.B.M – these can be read as a sequence of

characters, or NASA which can be read following the default way)

• Numbers (Once detected, first interpreted as rational, time of the day, dates and ordinal depending on their context)

• Idioms (e.g. “In spite of”, “as a matter of fact”– these are combined into single FSU using a special lexicon)

Page 13: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

13

Olov Engwall, Speech synthesis, 2008

Morphological AnalysisTask is to propose all possible parts of speech categories to

each word taken individually on the basis of their spelling.Function words

(determiners, pronouns, prepositions, conjunctions..)

– limited number.

• Can be stored in lexicon• Word he:

<spel> = he<syn cat> = pronoun<syn num> = <syn gen> = masc<phon> = /hΙ/

Content words – infinite in number

• Needs Morphology – describes words using a reduced set of abstract semantically bearing units called morphemes.

• Inflectional, derivational and compound words are decomposed into morphemes

• Uses regular grammars with lexicons of stems and affixes

Olov Engwall, Speech synthesis, 2008

Contextual Analysis• Considers words in their context

• Reduces the list of their parts of speech categories to a very restricted number of highly probable hypotheses, given the corresponding possible parts of speech of neighboring words.

• Achieved by N-grams, multi-layer perceptrons (neural networks), local stochastic grammars (provided by expert linguistics) etc

Olov Engwall, Speech synthesis, 2008

Letter-to-phonemes• Module responsible for the automatic determination of

the phonetic transcription of the incoming text

• Cannot just look up in a pronunciation dictionary– Do not follow the rule “one character = one phoneme”– Single character correspond to two phonemes — x as /ks/– Several characters producing one phoneme — th in thought– Single character pronounced in different ways — c in ancestor,

ancient, epic

• Rule based – applied based on spelling, sentence analysis

• Dictionary based – a large dictionary of correct spellings

• Hybrid Approach – combines the above, usually used

Olov Engwall, Speech synthesis, 2008

Dictionary or Rule BasedDictionary:Store a maximum of phonological knowledge into a lexicon.Compounding rules describe how the morphemes of

dictionary items are modified. Hand-corrected, expensiveThe lexicon is never complete:

needs out of vocabulary pronouncer, transcribed by rule.

Rules:A set of letter to sound (grapheme to phoneme) rules. Words pronounced in a such a particular way that they

have their own rule are stored in exceptions directory.Fast & easy, but lower accuracy

Olov Engwall, Speech synthesis, 2008

Letter-to-sound difficulties• Consonants reduced or deleted in clusters (eg. /t/ in softness)

• Assimilation leads to a change of some phonological features of a given phoneme (eg. obstacle)

• Homographs pronounced differently (eg. record, contrast)

• Phonetic liaisons (e.g. in French words immediately followed by a vocalic sound results in pronunciation of characters that otherwise disappear)

• Unstressed vowels transformed into schwas (short central phonetic elements) or deleted (e.g. interesting)

• New words, proper nouns dependent on the language of origin (e.g. in Swedish “jeans”, “comme il faut”)

Olov Engwall, Speech synthesis, 2008

Creating rules• Writing rules by hand is difficult• Automatic process built from

lexicon– Find alignments:

• Provides phone string plus stress

WordsLetters

68.76%95.60%Thai89.38%98.79%DE-CELEX93.03%99.00%BRULEX57.80%91.99%CMUDICT74.56%95.80%OALD

CorrectLexicon

k

k

-

e

t-eh-ch

dcehc

Page 14: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

14

Olov Engwall, Speech synthesis, 2008

Phrasing

Determines where phrase boundaries occur– insert pauses on phrase boundaries– determined by CART tree trained on big

data corpus

Olov Engwall, Speech synthesis, 2008

Intonation: Word accentWord Accent: Decided depending on word class,

position in the sentence and in the phrase, word classes of preceding and following words.

For each syllable of each word: if and which(e.g. Swedish ‘tomten’, ‘stegen’).

Olov Engwall, Speech synthesis, 2008

Intonation: F0 contourLarge pitch range (female)Authoritive (final fall)Emphasis for Finance (H*)Final has a raise – more information to come

• Word stress and sentence intonation– each word has at least one syllable which is spoken with

higher prominence– in each phrase the stressed syllable can be accented

depending on the semantics and syntax of the phrase• Prosody relies on syntax, semantics, pragmatics: personal

reflection of the reader.

Olov Engwall, Speech synthesis, 2008

Pitch contour modeling

• Tonetics (the British school)– tone groups composed of syllables {unstressed,

stressed, accented or nuclear}. – nuclear syllables have nuclear tones {fall, rise, fall-rise,

rise-fall}

• ToBI (Tones and Break Indices)– Phrases split into intermediate phrases composed of

syllables. – Relative tone levels: high (H) or low (L) (plus diacritics)

at every intonational or intermediate phrase boundary (%) and on every accented syllable

• Stylization method (prosodic pattern measured from natural speech)

Olov Engwall, Speech synthesis, 2008

Prosody modeling

• Prosody targets (to put emphasis, stress) typically include:– Pitch– Phone durations– Energy

• Prosody parameters can be trained

• Fixed durations, flat F0.• Decline F0• “hat” accents on stressed syllables• accents and end tones• statistically trained

Prosody is critical for obtaining the right intonation (or else speech may sound unnatural or unintelligible)

Olov Engwall, Speech synthesis, 2008

Prosody modeling

Page 15: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

15

Olov Engwall, Speech synthesis, 2008

[<SABLE> <SPEAKER NAME="male1">

The boy saw the girl in the park <BREAK/> with the telescope.The boy saw the girl <BREAK/> in the park with the telescope.

Some English first and then some Spanish.<LANGUAGE ID="SPANISH">Hola amigos.</LANGUAGE><LANGUAGE ID="NEPALI">Namaste</LANGUAGE>

Good morning <BREAK /> My name is Stuart, which is spelled<RATE SPEED="-40%"> <SAYAS MODE="literal">stuart</SAYAS> </RATE>though some people pronounce it <PRON SUB="stoo art">stuart</PRON>. My

telephone number is <SAYAS MODE="literal">2787</SAYAS>.

I used to work in <PRON SUB="Buckloo">Buccleuch</PRON> Place, but no one can pronounce that.

By the way, my telephone number is actually<AUDIO SRC="http://att.com/sounds/touchtone.2.au"/> …

Synthesis markup

Olov Engwall, Speech synthesis, 2008

SABLE: marking emphasis

What will the weather be like today in Boston?It will be <emph>rainy</emph> today in Boston.

When will it rain in Boston?It will be rainy <emph>today</emph> in Boston.

Where will it rain today?It will be rainy today in <emph>Boston</emph>

Olov Engwall, Speech synthesis, 2008

Vocal tract models

Olov Engwall, Speech synthesis, 2008

Articulatory synthesis

Benefits:• Produce speech in the same way as humans• Can be made with few parameters• The changes are intuitive

(raise the tongue tip, round the lips)

Disadvantages:• Computationally demanding• Problems with consonants• Articulatory measurements required• State-of-the-art articulatory synthesis still sounds bad

Articulation as filter

Olov Engwall, Speech synthesis, 2008

Articulatory models

FunctionalGeometric parameters control the different parts of the tongue, jaw, lips etc.

PhysiologicalMuscle model. Articulations are created through activation of different muscles.

Olov Engwall, Speech synthesis, 2008

Articulatory basisMeasurements (X-rays, MRI etc) are used to model the dimensions of the tube.

In the midsagittalplane, and to get the relation between midsagittaldistance and area in each plane.

Page 16: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

16

Olov Engwall, Speech synthesis, 2008

3D articulatory syntesisWhy?• Two-dimensional models simulate the third dimension as area=a•(distance)d.a & d are decided empirically and vary through the tube.

A three-dimensional model gives• the cross-sectional area directly • lateral modeling (/l/)• visual synthesis (pronunciation training)

Olov Engwall, Speech synthesis, 2008

3D MRI measurements

3*18 slices orthogonal to the midsagittal plane in 43 s.

Supine position

CorpusOne neutral reference and 43 Swedish articulations. 13 vowels: /ɑ:, e:, æ:, i:, y:, u:, ʉ:, o:, ø:, œ:, a, u, ɔ/

10 consonants: /p, t, k, l, r, s, f, ʂ, ɧ, ɕ/in VCV contexts: /aɪʊ/

Olov Engwall, Speech synthesis, 2008

3D ReconstructionOne contour per image.

Reconstruct a 3D shape for each articulation

⁄ akÉa alÉa UžU

Olov Engwall, Speech synthesis, 2008

Tongue bodyJaw height Tongue dorsum

Articulatory model

Six articulatory parameters defined using a component analysis of the 3D tongue shapes.

Olov Engwall, Speech synthesis, 2008

Tongue advance Tongue widthTongue tip

Articulatory model

Add vocal tract walls:Symmetric walls, extracted from the MR Images.Collision handling for the tongue against walls, palate and jaw.

Olov Engwall, Speech synthesis, 2008

Movetrack Electromagnetic Articulograph:6 coils; upper lip , upper & lower incisors , three tongue coils, 8 , 20 and 52 mm from the tip.

Multimodal articulatory synthesis

Qualisys optical motion tracking: 4 IR cameras28 reflectors3 reference reflectors on headmount

C

C C

C

C

RR

Audio & video recorders V

V

Rf

RfRfRf

UL

T1 T2 T3

T3

T1

T2 JT2

UL

UL

T2T1J

Page 17: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

17

Olov Engwall, Speech synthesis, 2008

3D Articulatory data Ai

Pitch Pi

Training:

Linear estimator

or

Neural network

Dat

a pr

oces

sing

&

Trai

ning

Speech signal Si

LSP Li ∀i∈C, i≠k

∀i∈C, i≠k

∀i∈C, i≠k

LPC analysis

Model fitting

14 Articulatory parameters APi

Resampling

Olov Engwall, Speech synthesis, 2008

Pitch Pk

Syn

thes

is Synthesis:Linear estimator

orNeural network

14 Articulatory parameters APk

Synthetic Speech Sk*

LSP filter

LSP Lk*

Olov Engwall, Speech synthesis, 2008

Multimodal synthesis

Olov Engwall, Speech synthesis, 2008

From articulation to acoustics

Electric circuit equivalent

Vocal tract model

Tubes

2D airflow dynamics

Waveform

Cross-sections

3D air flow calculations

Area function

Olov Engwall, Speech synthesis, 2008

Area & transfer functions

Arti

cula

tory

mod

el Area function →Area vs. distance

Formants

Transfer function:Amplitude vs. frequency

Para

met

er s

ettin

gs

Olov Engwall, Speech synthesis, 2008

Formants

Page 18: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

18

Olov Engwall, Speech synthesis, 2008

Vocal tract models labhttp://www.speech.kth.se/courses/GSLT_SS/lab2.html

• Synthesize /aa, ii, uu/ witha) a two-tube model b) a three-parameter model c) an area function model d) an articulatory model

• Investigate what happens if a nasal tract is added for each model.

• Compare the four methods regarding flexibility, complexity, intuitivity. What are the advantages and disadvantages of each of them?

• Use the articulatory model to investigate how the seven parameters influence the vocal tract shape and acoustics. Start from a neutral vocal tract (set all values to 0) and vary each parameter.

• Move or place your own articulators in the same way; do your intuitive thoughts about the effect of your articulatory movements correspond to the results in the model?

• Experiment with the parameters in 'Tract Configuration' and 'Physical Constants'. What influence do they have on the synthesis?

Formant values are: /aa/: 650 1000 2500 /ii/: 290 2050 2400 /uu/: 300 700 2100

Olov Engwall, Speech synthesis, 2008

Equivalent circuit

GC

LR

The tube has a acoustic mass ~ L = ρ/ΑThe air functions as a spring ~ C = A/(ρc2)There are frication losses ~ R, G

A is the cross-sectional area of the tube, ρ is the air densityand c is the speed of sound in air

A

Acoustically Mechanically Electricallyflow speed currentpressure force voltageac. mass mass inductanceac. spring mech. spring capacitance

Olov Engwall, Speech synthesis, 2008

• The tube has rigid walls. • Since the cross-sections are small compared to the tube

length we have a plane wave.• Two-directional waves with reflections between the tubes

• The “current” and “voltage” are sinusoidal:

Assumptions

A1A2 r = (A1-A2)/(A1+A2)

( )U x U e U ex x= ++−

−γ γ

( )I x I e I ex x= ++−

−γ γ

with γ = ( )( )R j L G j C+ +ω ω The index + and – indicate the direction.

Olov Engwall, Speech synthesis, 2008

In an electric circuit:

Uout

Iout

Uin

Iin

Iin – Iout

Zb

Za Za

The transfer function is the quota

in

out

IIH =

2tanh

sinh1

0

0

lZZ

lZZ

a

b

γγ

=

=

Long calculations give: If we assume that the tube is loss-less, Z0 and γ are simplified:

ZR j LG j C

LC

AA c

cA0 2=

++

= = =ωω

ρρ

ρ//

γ = ( )( )R j L G j C j LCjc

+ + = =ω ω ωω

( )( )baoutbinout

boutbainin

ZZIZIUZIZZIU

+−=−+=

Olov Engwall, Speech synthesis, 2008

Example: the neutral vowel

Iout

Iin

Iout - Iin

Zb

Za Za

glottis lips

( )

ba

b

in

out

inoutbouta

ZZZ

IIH

IIZIZ

+==⇒

=−+ 0

( )( ) ( )

( )

( )

( )( )( )( )

( )lllZ

lZ

lZ

H

lZZ

lllZlZZ

b

a

γγγ

γ

γ

γ

γγγγ

cosh1

sinh1cosh

sinh

sinh

sinh

sinh1

sinhcosh

2tanh

00

0

0

00

=−

+=⇒

=

−=⎟⎠⎞

⎜⎝⎛=

( ) ,...3,2,1,124

,...2,1,0,22

0cos

:

=−=⇒=⎟⎠⎞

⎜⎝⎛ +=⇒+=⇒=⎟

⎠⎞

⎜⎝⎛ nn

lcFnn

lcn

cl

cl

Poles

nn ππωππωω

,...}3500,2500,1500,500{get we,350m/sc and 17.5cm,l speaker, male typicalaFor

HzHzHzHzF ===

⎟⎠⎞

⎜⎝⎛

=⎟⎠⎞

⎜⎝⎛

=⇒=

cl

clj

HcjLossless

ωωωγ

cos

1

cosh

1

Olov Engwall, Speech synthesis, 2008

Two tubesConnect two homogene tubes

Iout

I1 Iin

I1 – Iout Iin - I1

Zb2 Zb1

Nod 2Nod 1

Za2 Za1 Za1 Za2

HII j l

cj l

cAA

j lc

j lc

ut

in

= =+

⎛⎝⎜

⎞⎠⎟

1

11 2 2

1

1 2cosh cosh tanh tanhω ω ω ω

Poles when AA

lc

lc

2

1

1 2 1tan tanω ω

=

Node 1: ( )I I Z I Z ZZ Z

Z Zin b a aa b

a a− = + + +

⎛⎝⎜

⎞⎠⎟1 2 2 1

1 1

1 21

Node 2: ( )21

111121

aa

baoutaoutbout ZZ

ZZIIZIZII+

=⇒=−

Even longer calculations give:

Page 19: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

19

Olov Engwall, Speech synthesis, 2008

Example with two tubes

A male speaker produces a vowel with a constricted pharynx (the pharyngeal area is one eighth of that in the oral cavity). Calculate the first two formant frequencies.

smccmlAAlll

/350 ,speaker male afor 5.1781,

2 1

221

==

===

( )[ ] HzFHzFnF

FAA

cl

cl

cl

AA

n

n

1216,7848arctan2000

83502

175.02tantan

1tantan

21

2

12

1

21

1

2

==⇒+±=

=⎟⎟⎠

⎞⎜⎜⎝

⎛⇒=⎟

⎠⎞

⎜⎝⎛

=⎟⎠⎞

⎜⎝⎛

⎟⎠⎞

⎜⎝⎛

ππ

πω

ωω

Olov Engwall, Speech synthesis, 2008

Consonants

The source is somewhere else than at the glottis

Some cavities may be closed:

e.g. the mouth cavity for nasals

Olov Engwall, Speech synthesis, 2008

Two homogene tubes in series

Iut

I

Iin

I1 - IutI

Zb2 Zb1

Za2 Za1Za1Za2U

Power source

The same poles as before, but now zeroes as well, when |sinhθ2|=0 !!

⎟⎟⎠

⎞⎜⎜⎝

⎛+

=−

212

121

21

2

tanhtanh1coshcosh

sinh

θθθθ

θ

ZZ

ZII

in

ut

Consonants

Olov Engwall, Speech synthesis, 2008

Formant Synthesis

Olov Engwall, Speech synthesis, 2008

OVE II

Model the poles directly instead!Olov Engwall, Speech synthesis, 2008

Digital resonators

Page 20: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

20

Olov Engwall, Speech synthesis, 2008

Formants and bandwidths

An all-pole model:resonances when thedenominator is zero.

Bandwidths:Function of energy lossesdue to heat conduction,viscosity, cavity-wallmotions, radiation of soundfrom the lips and the realpart of the glottal sourceimpedance.

Olov Engwall, Speech synthesis, 2008

Synthesis by rule

Lab exercise 3

Olov Engwall, Speech synthesis, 2008

Formant synthesis labhttp://www.speech.kth.se/courses/GSLT_SS/lab3.html

• The task is to adapt the synthesis of "Dom flyttade möblerna" to sound as a target speaker.

• Start Wavesurfer and open the reference sentence in the "Speech Analysis" configuration.

• Create an transcription pane: right-click > "Create Pane > Transcription". Download the automatic transcription and load it by right-clicking in the transcription pane > "Load transcription“.

• Start a new Wavesurfer (File > New), choose "Formant Synthesis sw".

• Type "Dom flyttade möblerna" into the Text slot and Synthesize.

• Edit the parameters that are displayed: F0, F1-F4.- left-click > and drag the parameter track. - Insert new control points: right-click on a parameter track.

• To make a phoneme longer/shorter: click in the transcription window and drag to the left/right.

• How close do you get by just editing pitch, formants and duration? Olov Engwall, Speech synthesis, 2008

Data-driven formant synthesis

Keeps the flexibility of the formant synthesis

More natural sounding than rule-driven synthesis

Speaker adaption

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

500

1000

1500

2000

2500

3000

3500

4000 M O B I: L sil

Olov Engwall, Speech synthesis, 2008

Formant unit selection

Formants are chosen through unit selection from a formant diphone library of about 2000 diphones.

Formant trajectories are scaled and interpolated to fit rule-generated durations.

Olov Engwall, Speech synthesis, 2008

Synthesis comparison

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

500

1000

1500

2000

2500

3000

3500

4000 M O B I: L sil Data−driven

Rule−driven

Page 21: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

21

Olov Engwall, Speech synthesis, 2008

Listening test evaluations• 15 subjects, 20 sentences, continuous scale from

”Unnatural” to ”Natural”.

• 4 types of stimuli:1. Rule-based syntes2. Data-driven formant synthesis3. + Data-driven fricative synthesis4. + Replace the voiceless fricatives (/f/, /s/, /sj/,/tj/, /rs/)

and plosives (/k/, /p/, /t/, /rt/) with recorded versions.

• 12 subjects, 10 sentences, binary scale

• Data-driven synthesis with manually correctedformant data was preferred in 73 % of thecases over rule-driven synthesis

Olov Engwall, Speech synthesis, 2008

Evaluation resultsRule-driven and Data-driven

Overall Sentences without critical errors

Sentences with critical errors

Hand-corrected sentences

Olov Engwall, Speech synthesis, 2008

Evaluation

Olov Engwall, Speech synthesis, 2008

Evaluation: Why?

• Monitoring the development– Initial: choosing a ”good” voice, a good inventory.– Progress evaluation– Diagnostic evaluation: find out where things go

wrong and why.

• Performance Evaluation

intelligibilitycomprehensibility

qualityadequacyusability

For developers:Overall quality evaluation

For users:Comparative evaluation

Olov Engwall, Speech synthesis, 2008

Diagnostic evaluation:

• Segmental: intelligibility tests on the ability to distinguishindividual sounds.– Diagnostic Rhyme Test (DRT) – Modified Rhyme Test (MRT)– Minimal Pair Intelligibility Tests (MPIT)– Phonetically Balanced Word List (PB)– Nonsense words

• Sentence: comprehension of words or short sentences

• Comprehension: more than one sentence.

• Prosody: assessment of intonation and emotions

• Subjective opinions

Different levels

Standard procedures are available only for segmental evaluationOlov Engwall, Speech synthesis, 2008

• Consonant intelligibility in word initial position.

• 96 word-pairs to test 6 characteristics:Voicing: veal-feelNasality: reed, deedSustension: vee-bee, sheat- cheatSibilation: sing-thingGraveness: weed-reedCompactness: key-tea

• Forced choice

• Intelligibility = number of correct identifications compared to all words.

• Diagnostic information given in confusion matrices.

Diagnostic Rhyme Test

Page 22: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

22

Olov Engwall, Speech synthesis, 2008

Pros and cons of DRTPros• Limited number of stimuli,

not too time consuming

• Naive listeners can takepart

• Easy to interpret the results

• Confusion matrices helpto localise the problems

Cons• Consonantal intelligibility

only in word initial position

=> Modified Rhyme test: inital and final

• isolated contexts

• closed response format

Choice: din, sin, fin, pin, win, tin. Winamp media file

Olov Engwall, Speech synthesis, 2008

• Nonsense sentences. Forced choice (two alternatives).a) ”the uniform towels snitch a sniffer” / b) ”the uniformed towels snitch a sniffer”– Forced choice, between: uniform - uniformed

• Phonetic features: – Consonant and vowel substitution: copper-chopper, tutor-teeter– Consonant insertion/deletion: attitudes-altitudes– One-feature substitutions: ringers-riggers– Two-features substitutions: burnish-furnish– Word initial: gasket- basket– Word internal: musty - musky– Word final: familiar- familial

• Segment location: stressed, unstressed

• Word location, initial, medial, final

Minimal pairs intelligibility test

Olov Engwall, Speech synthesis, 2008

Evaluation problems• It is unrealistic to test one level at a time: they are not

independent.

• Can we really evaluate the intelligibility of TTS at segmental level?

• Is intelligibility more important than naturalness?

• Limitations of subjective tests– Learning effects– Concentration problems– Choise of listeners: naive or expert?

• Is it possible to build objective tests?

Olov Engwall, Speech synthesis, 2008

Comprehension test

Single-task performance measure: listen and understand 2 passages and answer ten multiple choicequestions.

Subjects: 2 groups, onelistening to syntheicspeech, one listening to natural speech.

Results: no significantdifferences betweenunderstanding syntheticand natural speech.

Multiple-task performance measure: listen and understand 1 passage and at the same time detectclicks occurring in the passage.

Subjects: same

Results: Subjects who listenedto synthetic speech tooklonger to identify the clicks.

Check mental load De Logu et al. 1998

Comprehension tests are difficult to construct due to the intervention of cognitive factors.

Olov Engwall, Speech synthesis, 2008

Subjective opinion tests• Listeners are presented with a set of stimuli to be rated on:

Overall impression & acceptance (quality)

Listening effort, comprehension (intelligibility)

Pronunciation, speaking rate & voice quality (naturalness).

Mean Opinion Score: Evaluates the general

speech quality5 excellent, 4 good, 3 fair, 2 poor, 1 bad

Degradation Mean Opinion Score:Evaluates how disturbances are

perceived.

5 Inaudible, 4 Audible, not annoying3 Slightly annoying, 2 Annoying, 1 Very annoying

Olov Engwall, Speech synthesis, 2008

Comparing systems

• No standard procedures are available to carry out comparative evaluations of systems.

• Most common is to use preference scores:(- System A much better)- System A better- No difference- System B better(- System B much better)

0%10%20%30%40%50%60%70%80%90%

100%

M HTS

M HTS pr

osod

y

M HTS sp

ectru

m

F HTS

F HTS pr

osod

y

F HTS sp

ectru

m

Much worseWorseNo DifferenceBetterMuch better

Page 23: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

23

Olov Engwall, Speech synthesis, 2008

Synthesis of the future

Olov Engwall, Speech synthesis, 2008

Speaker adaptation• Why?

– Make the synthesis more human-sounding, more diverse, more personalized.

– Synthesize ordinary speech of ordinary people!• What?

– The non-linguistic (?) features of the acoustic signal: voice quality, gender, age, dialect, sociolect.

• How? – Record the speaker as target or adapt the synthesis

(by statistics or rules)

• Various contexts: low, raspy voices, strong, commanding voices, children’s and old persons’ voices, promotional voices, emotional voices, etc.

Olov Engwall, Speech synthesis, 2008

Speaker characteristics

The linguistic component: semantic information that is part of the speaker’s language (e.g. question intonation)

• The paralinguistic component: the speaker’s attitudinalor emotional states, sociolect and regional dialect.

• The extralinguistic component: the individuality, genderand age of a certain speaker. It can be judgedindependently of the language.

To adapt a speech synthesizer to a certain speaker, weneed both the para- and extralinguisitic components.

Linguistic vs. Individual components

Olov Engwall, Speech synthesis, 2008

Speaker Variability: Dialect• Different dialects use different

phonemes for the same word – e.g.: British vs American “better”

–Brittish vs. Australian ”say”

• Different dialects use different allophones for the same phoneme:– Swedish: Öga/Öra, Äga/Ära

(Värmland-Östergötland)• Differences in prosody and accent.

Olov Engwall, Speech synthesis, 2008

Speaker Variability:

Within-Speaker Variability• Can change F0 and voice quality

Between-Speaker Variability• Cannot change basic physiology (lungs, vocal folds,

vocal tract…), which limits ranges of F0 and voicequalities

• Difficult to change the– Sociolect: Level of education/social environment– Personal history

Individual Differences

Olov Engwall, Speech synthesis, 2008

SociolectNew York City department

store study

[r] in 'fourth floor'

low20%Klein's

middle51%Macy's

high62%Saks 5th Av.

STATUS[r]%SHOP

Swedish: Liiidingö, sju

0,50,60,70,80,9

11,11,2

RP

Eng

lish

Fren

ch

Am

Eng

lish

Swed

ish

Dut

ch

F0

Mean normalized F0 in vowels (in Bark) for different languages

Page 24: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

24

Olov Engwall, Speech synthesis, 2008

Sounding Gay

• Crist (1997) - 5 out of 6 speakers exhibited longer /s/ in gay stereotyped speech

• Linville (1998) - gay speakers had longer /s/• Rogers, Smyth, and Jacobs (2000) - both /s/ and /z/ were

longer in gay-sounding speech• Levon (2004) - altering sibilant duration alone insufficient

to change perception of gayness

Fricative duration

Olov Engwall, Speech synthesis, 2008

Creating emotional speech synthesis of a text requires:a. Signal Processing: algorithms for altering the

acoustic prosodic parameters of the speech.

b. Prosody Modeling: Creating typical patterns corresponding to different emotions.

c. Text Analysis: finding textual cues to prosody and the expressive intention of a text.

Emotions

Olov Engwall, Speech synthesis, 2008

Two approaches1. To design a general method of assigning a given

expressive intention to any text.– An ongoing and challenging task, involving research on signal

processing, speech acoustics and human communication.

2. Enriching synthetic messages with expressive phrases and sounds, which convey expressive intentions.

– A commercially available solution: e.g., Loquendo, IBM.

<speak version=”1.0" xml:lang="en-US">Yes sir, the package will be on your desk tomorrow. And I say that with the utmost confidence. I will take care of it.How will I take care of it? I don’t know how I’m going to take care of it. If I knew how to take care of it </speak>

<prosody emotion=“calm” attitude=“confident”>Yes sir <mm_hmm/> the package will be on your desk tomorrow. And I <prosody ToBI=“H*”/> say that with the utmost confidence. <emphasis> I </emphasis> will take <emphasis> care </emphasis> of it. </prosody><prosody emotion=“despair”> <creakiness=“high”> <groan/> How will I take care of it? I don’t<prosody ToBI=“H*”/> know how I’m going to take care of it. <sigh/> If I knew how to take care of it <sobbing/></prosody>

Olov Engwall, Speech synthesis, 2008

Emotion analysisHow to determine synthesis parameters for different emotions?

– Professional acting– Amateur acting – Read a text with different emotions

• Acted and read speech is widely used, but…Does it reflect the way emotions are expressed in

spontaneous speech?

Alternatives:• Wizard of Oz scenarios• Customer calls to call centres

– Lots of real emotional speech– But, permissions?

• TV shows (Oprah, Ricki Lake, Dr. Phil etc)

Olov Engwall, Speech synthesis, 2008

Emotion databasesAgain, two approaches: 1. Create large databases for each emotion you want to

synthesize and use the entries as such• E.g. diphones• Duplicate the database for each emotion…

2. Modify the default output signal from the synthesizer using emotion rules

• Small set of phonetically balanced sentences (25 or so)• Sentences without emotional content,

e.g. The competitor has made twenty five offers, closing only five contracts

• Compare with a Neutral style.

Olov Engwall, Speech synthesis, 2008

-10.00-8.00-6.00-4.00-2.000.002.004.006.008.00

10.00

%

angry happy sad

Syllable duration

FSSLSU

Emotion correlates

-20.00

-10.00

0.00

10.00

20.00

30.00

%

angry happy sad

Mean F0

FS

S

LS

U

-40,00

-20,00

0,0020,00

40,00

60,00

80,00

100,00

120,00

%

angry happy sad

F0 Range

FSSLSU

-6.00

-4.00

-2.00

0.002.00

4.00

6.00

8.00

10.00

%

angry happy sad

RMS Energy

FSSLSU

FS the first stressed syllable of the sentence or after a speech pause S stressed syllableLS the last stressed syllable of the sentence or before a speech pause U unstressed syllable

Loquendo

Page 25: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

25

Olov Engwall, Speech synthesis, 2008

Emotion synthesis schemeAcoustic unit selectionInput Text

signalAnalysis prosodic

parameters

Time and Pitch Scaling + Gain function

Output Waveform

Expressive style

Energy Duration Pitch

Synthesis prosodic

parametersEnergy Duration Pitch

“E” rules “D” rules “P” rules

PSOLA

Olov Engwall, Speech synthesis, 2008

Examples

50

100

150

200

250

300

0 0,2 0,4 0,6 0,8 1 1,2 1,4time (sec.)

Hz

neutralsadhappyangry

Many more on http://emosamples.syntheticspeech.de/

Loquendo’s Susan

Olov Engwall, Speech synthesis, 2008

Evaluation results• Texts without emotional content.

- “The competitor has made twenty five offers, closing only five contracts”

• Volunteers listened to samples in random order and evaluated from 0 to 5 how much sad, angry, happy or neutral each stimuli sounded.

00.5

11.5

22.5

33.5

44.5

neutral angry happy Sad

neutral (TTS) angry (TTS) happy (TTS) sad (TTS)

Olov Engwall, Speech synthesis, 2008

Emotional questions

But, the closer we get to “real” emotions, the more difficult it is to recognize them!

Up to 95% correct identification on acted speechUp to 79% on read speechUp to 73% on lab-recorded dialogue data

What is the goal of expressive synthesis?Convey an emotion?

Make the synthesized emotion sound natural?

And, how many emotions do we have?Four?Seven? (Ekman: neutral + sadness, happiness,

anger, fear, disgust, surprise)

Olov Engwall, Speech synthesis, 2008

Why 'real speech’ synthesis?

• 'Yeah!', 'Right on!', 'Fantastic!', 'Hi!'• Why can’t we synthesize ‘real speech’?

– Because we assume that words alone carry most of the meaning in speech

– But the '85%' (?) of speech which is non-verbal is largely monosyllabic

– Monosyllables can be very repetitive - unless they vary in another dimension

• Voice quality: spectral features, voice source features and temporal features (e.g., voice on-/offsets, jitter, creak, etc.).

Olov Engwall, Speech synthesis, 2008

But how to synthesize?

The

NATR a

ppro

ach • 1000 hours of everyday

conversation • Recorded with head-mounted

mic to DAT and Minidisc• Analyzed acoustically,

manually transcribed, & perceptually labeled

• No studio use, no recording constraints

• Japanese native-language speakers, mixed ages, in everyday situations => A paralinguistic speech corpus

Page 26: GSLT speech synthesis 08 b [Read-Only]...3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic

26

Olov Engwall, Speech synthesis, 2008

Acoustic AnalysisBoundaries of quasi-syllabic

Nuclei

Quasi-syllable

boundaries

F0 contourSonorant Energy contour

(a) Variance in delta-

Cepstrum(b)

Formant / FFT

Cepstraldistance Composite

(a & b)measure

of reliability

Glottal AQ pressed

breathy Estimated vocal-tract area-functions

Phonetic labels

(if available)

Olov Engwall, Speech synthesis, 2008

Discourse Act Labellingo 反論 arguep 提案、申し出 suggest, offerq 気づき notices つなぎ connectorr 依頼、命令 request-actiont 文句 complainu 褒める flatterw 独り言 talking-to-selfx 言い詰まり disfluencyy 演技 actingz 繰り返し repeatr* 要求 request (a~z)v* 確認を与える verify (a~z)

a あいさつ greetingb 会話終了 closingc 自己紹介 introduce-selfd 話題紹介 introduce-topice 情報提供 give-informationf 意見、希望 give-opiniong 応答肯定 affirmh 応答否定 negatei 受け入れ acceptj 拒絶 rejectk 了解、理解、納得 acknowledgel 割り込み, 相づち interjectm 感謝 thankn 謝罪 apologize

Speaking style (and voice) vary greatly, depending upon (a) the situation(b) who we are speaking to ...(c) how we feel about what we are saying!

Olov Engwall, Speech synthesis, 2008

Concept-to-speech: why?

105 question about my bill 63 question on my bill 57 calling about my bill 43 talk to somebody about my bill 41 talk to someone about my bill 32 questions about my bill 30 problem with my bill 23 speak to someone about my bill 22 calling about a bill 20 calling about my phone bill 16 questions on my bill 16 question about a bill 15 talk about my bill 11 question about my phone bill 11 question about my billing 11 discuss my bill 10 speak with someone about my bill

10 calling about my billing 9 problem with my phone bill 9 calling about my telephone bill 8 speak to someone in billing 8 question about the bill 7 speak to somebody about my bill 7 speak to a billing 7 question on my phone bill 7 calling regarding my bill 7 calling concerning my bill 6 talk to somebody in billing 6 questions about my billing 6 question on my billing

5 talk to someone about a bill 5 talk to somebody about my billing 5 talk to somebody about a bill 5 speak to someone in the billing 5 speak to someone about a bill 5 questions on my billing 5 question on the bill 5 question on a bill 5 question my bill 5 calling in regards to my bill 5 calling about the bill 4 talk to someone about my telephone bill 4 talk to somebody about my account 4 talk to billing 4 speak with someone in billing 4 question about my telephone bill 4 information on my bill 4 calling regarding my statement .............. 1 talk to someo- to someone about my moms telephone bill 1 question about the new A T and T billing

Total 1083 variations in 1912 matches

Ways to say “question about my bill” to AT&T:

Humans do not read a text aloud, we talk!

6 problem with my billing 6 information about my bill 6 calling about my A T and T bill 5 talk to someone about my phone bill

So is future text-to-speech synthesis just cut and paste from an enormous database?

Olov Engwall, Speech synthesis, 2008

Concept-to-speech: what?

• Input: Abstract presentation Goal or a machine generated message.• Output: Syntactic Structure for Concept-To-Speech Synthesis• Language-independent text planning component• Language-specific domain-grammars• Enriched information passed to synthesis

Olov Engwall, Speech synthesis, 2008

Slot filling or generation?

Concept-to-speech: how?

Either:Put key information into different carrierphrases

Or:Generate utterancesbased on content.