6/3/20151 voice transformation : speech morphing gidon porat and yizhar lavner sipl – technion iit...

04/18/23 1

Voice Transformation : Voice Transformation : Speech MorphingSpeech Morphing

Gidon Porat and Yizhar Lavner

SIPL – Technion IITDecember 1 2002

04/18/23 2

Project GoalsProject Goals

Gradually change a source speaker’s voice, to sound like the voice of a target speaker.

The inputs : two reference voice signals,

one for each speaker. The output : N voice signals that gradually

change from source to target.

source targetinterp

04/18/23 3

ApplicationsApplications

● Multimedia and video entertainment: voice Morphing , just like it’s “face” counterpart.

While seeing a face gradually changing from one person to another’s (like often done in video clips) ,we could simultaneously hear his voice changing as well.

● Forensic voice identification by synthesis:Identifying a suspect voice by creating voice-bank of different pitch, rate and timbre. Similar method was developed for face recognition to replace police sketch artists.

04/18/23 4

The ChallengesThe Challenges

The source and target voice signals will never be of the same length. A time varying method is needed.

It is needed to change the source voice’s characteristics to those of the target speaker : pitch,duration,spectral parameters.

Produce a natural sounding speech signal.

source target

04/18/23 5

The Challenges cont.The Challenges cont.

Here are two identical words (“shade”) form source and target

The target speaker’s word lasts longer then the source speaker’s and it’s “shape” is quite different.

04/18/23 6

Speech Modeling Speech Modeling

A1 A2 A3 Ak

Sound transmission in the vocal tract can be modeled as sound passing through concatenated loss less acoustic tubes.

A known mathematical relation between the areas of these tubes and the vocal tract’s filter, will help in the implementation of our algorithm

04/18/23 7

Speech Modeling - Synthesis Speech Modeling - Synthesis

Impulsetrain

generator

Glottal PulsemodelG(z)

RandomNoise

generator

Vocaltract model

V(z)

RadiationmodelR(z)

PitchDetector

LinearPredictor

SpeechProductionSystem

Voiced/unvoiced

The basic synthesis of digital speech is done by the discrete-time system model.

04/18/23 8

sample

align

interpolate

Prototype Waveform Prototype Waveform InterpolationInterpolation

PWI is a speech coding method, which is based on the presentation of a speech signal, or its residual error function, by a 3-D surface.

The creation of such a surface is described by 3 main steps.

04/18/23 9

Surface Construction AlgorithmSurface Construction Algorithm

Error SignalComputation

LPC parametersPitch detection marks

Short TimePitch FunctionComputation

PrototypeWaveformSampler

AlignBetween

0 and 2*pi

Align WithPrevious

Waveform

InterpolateAlong

Time Axis

the errorwaveform surface

04/18/23 10

The SolutionThe Solution

Use 3D surfaces that will capture each vocal phoneme’s residual error signal characteristics and interpolate between the two speakers.

Unvoiced phonemes are not dealt with due to complexity and the fact that they carry little information about the speaker.

04/18/23 11

Algorithm – Block DiagramAlgorithm – Block Diagram

PhonemesBoundaries

Demarcation

LPCAnalysis

PitchDetector

PhonemesBoundaries

Demarcation

PitchDetector

LPCAnalysis

WaveformCreator

WaveformCreator

Interpolator

e(n)reconstruction

new LPCcalculator

Speaker A Speaker B

Synthesizer

New Interpolated Voice

04/18/23 12

The Algorithm

+

Intermediate surface

Speaker A

Speaker B

The new error function, that will be reconstructed from that surface, will then be the input of a new Vocal Tract Filter.

Once the surfaces for both source and target speakers are created (for each phoneme) an interpolated surface is created.

04/18/23 13

After creating a surface for each speaker’s phoneme, an intermediate surface is created.

The Waveform – IntermediateThe Waveform – Intermediate

( , ) ( , ) (1 ) ( , )new s tu t u t u t

( , ) :{ :[0 ]; :[0 2 ]}

( , ) :{ :[0 ]; :[0 2 ]}

( , ) :{ :[0 ]; :[0 2 ]}

/ ; /

s s

t t

new new

new s new t

u t t T

u t t T

u t t T

T T T T

04/18/23 14

The Waveform - ReconstructionThe Waveform - ReconstructionThe new, intermediate error signal can be evaluated from the new surface by the equation :

And that :

( ) ( , ( )), [0 : ]new new new newe t u t t t T

0

'0 '

2( ) ( )

( )

t

newnewt

t t dtp t

( )new t

Assuming the pitch cycle changes slowly in time :

( ) ( ) (1 ) ( )new s tp t p t p t

04/18/23 15

Algorithm – Cont.Algorithm – Cont.

A1 A2t A3t A4t A5t A6t

A1s A2s A3s A4s A5s A6s

A1n A2n A3n A4n A5n A6n

+

The areas of the new tube model will be an interpolation between the source’s and the target’s.

Once the new Areas are computed the LPC parameters and V(z) can be calculated, and the signal can be synthesized.1 2 3 4 5 6 7( , , , , , , )a a a a a a a

04/18/23 16

New Voiced Phoneme Creation New Voiced Phoneme Creation Block DiagramBlock Diagram

evaluatenew p(t)

source and target error surface

source and target short time

pitch function

modificationparameter (a)

calculatenew fi(t)

create newerror surface

recover newerror function

claculate newtube areas

source and targetLPC parameters

calulatenew LPC

parameters

synthesizenew speech

signalnew

phoneme

modificationparameter (a)

04/18/23 17

The transfer function The transfer function

The factor could be invariant, and then the voice produced will be an intermediate between the two speakers, or it could vary in time, (from at t=0 to at t=T), yielding a gradual change from one voice to the other.

In order for one to hear a “linear” change between the source’s and the target’s voices, the coefficient, (the relative part of ) has to vary in time nonlinearly with a fast transition to a = 0.5 and a slow transitions around it.

( , )su t

04/18/23 18

0 1000 2000 3000 4000 5000 6000 7000-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

The Algorithm – cont.The Algorithm – cont.

The final and new speech signal is created by concatenating the new vocal phonemes, in order, along with the source’s/target’s unvoiced phonemes and silent periods.

1

silent

2voiced3

unvoiced

4voiced

04/18/23 19

ConclusionConclusion

The utterances produced by the algorithm were shown to consist of intermediate features of the two speakers. Both intermediate sounds:

And gradually changing sounds were produced.

04/18/23 20

Algorithm’s limitationsAlgorithm’s limitations

Although most of the speaker’s individuality is concentrated in the voiced speech segments, degradation could be noticed, when interpolating between two speech signals that differ greatly in their unvoiced speech segments, such as heavy breathing, long periods of silence, etc.

04/18/23 21

What has been done in the What has been done in the second part of this project?second part of this project? Basic implementation of reconstruction algorithm. Interpolation between two surfaces. Final implementation of the algorithm. Fixes in the algorithm such as:

– Maximization of the crosscoralation between surfaces.

– Modifications to the morphing factor.

04/18/23 22

Future work proposalsFuture work proposals

The effect of the unvoiced/silent segments of speech on voice individuality.

The effect of the Vocal Tract’s formants shift around the unit cycle on the human’s ear, in order to find a robust mapping of the transform function alpha.

04/18/23 23

Appendix - EquationsAppendix - Equations

( , ) ( , ) (1 ) ( , )new s t alignedu t u t u t

( , ) ( , )t aligned t nu t u t 2

0 0

( , ) ( , )

arg max { }( , )e

Tnew

s t e

tn

s t e

u t u t dtd

u u t

( ) ( ) (1 ) ( )new s tp t p t p t

04/18/23 24

Appendix – transfer function.Appendix – transfer function.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

desired morphing coefficient

actual morphing coefficient

6/3/20151 voice transformation : speech morphing gidon porat and yizhar lavner sipl – technion iit...

Documents

sourcetargetinterp slide

sourcetarget slide

waveform intermediate

target voice signals

new surface

voice morphing

source speaker s voice

voice bank