mbe - sharif
TRANSCRIPT
Page 1 of 34
Outline
Introduction to vocoders
MBE vocoder
– MBE Parameters
– Parameter estimation
– Analysis and synthesis algorithm
AMBE
IMBE
Page 2 of 34
Vocoders - analyzer
1. Speech analyzed first by segmenting speech using a
window (e.g. Hamming window)
2. Excitation and system parameters are calculated for
each segment
1. Excitation parameters : voiced/unvoiced, pitch period
2. System parameters: spectral envelope / system impulse
response
3. Sending this parameters
Page 3 of 34
Vocoders - Synthesizer
System
parameters
Excitation Signal
White noise/ unvoiced
Pulse train/voiced
Synthesized
voice
Page 4 of 34
Vocoders
But usually vocoders have poor quality– Fundamental limitation in speech models
– Inaccurate parameter estimation
– Incapability of pulse train/ white noise to produce all voice
• speech synthesized entirely with a periodic source exhibits a “buzzy” quality, and speech synthesized entirely with a noise source exhibits a “hoarse” quality
Potential solution to buzziness of vocoders is to use of mixed excitation models
In these vocoders periodic and noise like excitations are mixed with a calculated ratio and this ration will be sent along the parameters
Page 5 of 34
Multi Band Excitation Speech
Model
Due to stationary nature of a speech signal, a window w(n) is usually applied to signal
The Fourier transform of a windowed segment can be modeled as the product
of a spectral envelope and an excitation spectrum
In most models is a smoothed version of the original speech spectrum
)(ws
)(wH
)()()( nsnwnsw
|)(| wE
|)(|)()(ˆ www EHs
)(wH )(ws
Page 6 of 34
MBE model (Cont’d)
the spectral envelope must be represented accurately enough
to prevent degradations in the spectral envelope from
dominating.
– quality improvements achieved by the addition of a frequency
dependent voiced/unvoiced mixture function.
In previous simple models, the excitation spectrum is totally
specified by the fundamental frequency w0 and a
voiced/unvoiced decision for the entire spectrum.
In MBE model, the excitation spectrum is specified by the
fundamental frequency w0 and a frequency dependent
voiced/unvoiced mixture function.
Page 7 of 34
Multi Banding
In general, a continuously varying frequency dependent voiced/unvoiced mixture function would require a large number of parameters to represent it accurately. The addition of a large number of parameters would severely decrease the utility of this model in such applications as bit-rate reduction.
To further reduce the number of these binary parameters, the spectrum is divided into multiple frequency bands and a binary voiced/unvoiced parameter is allocated to each band.
MBE model differs from previous models in that the spectrum is divided into a large number of frequency bands (typically 20 or more), whereas previous models used three frequency bands at most .
Page 8 of 34
Multi Banding
Original
spectrum
Spectral
envelope
Periodic
spectrum
V/UV
information
Noise
spectrum
Excitation
spectrum
Synthetic
spectrum
Page 9 of 34
MBE Parameters
The parameters used in MBE model are:
1. spectral envelope
2. the fundamental frequency
3. the V/UV information for each harmonic
4. and the phase of each harmonic declared voiced. The phases of harmonics in frequency bands declared unvoiced are not included since they are not required by the synthesis algorithm
Page 10 of 34
Parameter Estimation
In many approaches (LPC based algorithms) the algorithms for
estimation of excitation parameters and estimation of spectral envelope
parameters operate independently.
These parameters are usually estimated based on heuristic criterion
without explicit consideration of how close the synthesized speech will
be to the original speech.
– This can result in a synthetic spectrum quite different from the
original spectrum.
In MBE the excitation and spectral envelope parameters are estimated
simultaneously so that the synthesized spectrum is closest in the least
squares sense to the spectrum of the original speech “analysis by
synthesis”
Page 11 of 34
Parameter Estimation (Cont’d)
the estimation process has been divided into two
major steps.
1. In the first step, the pitch period and spectral
envelope parameters are estimated to minimize the
error between the original spectrum and the synthetic
spectrum.
2. Then, the V/UV decisions are made based on the
closeness of fit between the original and the synthetic
spectrum at each harmonic of the estimated
fundamental.
Page 12 of 34
Parameter Estimation (cont’d)
The parameters estimated by
minimizing the following error
criterion:
– Where
The error in an interval
is minimized at:
dss ww
2
)(ˆ)(2
1
|)(|)()(ˆ www EHs
dEAsm
m
b
a
Wmwm
2
)()(2
1
dEw
dEwSw
Am
m
m
m
b
a
b
a
m 2
)(
)()(
Page 13 of 34
Pitch Estimation and Spectral
Envelope
An efficient method for obtaining a good approximation for the periodic transform P ( w ) in this interval is to precompute samples of the Fourier transform of the window w (n) and center it around the harmonic frequency associated with this interval.
For unvoiced frequency intervals, the envelope parameters are estimated by substituting idealized white noise (unity across the band) for |E (a)| in previous formulas which reduces to averaging the original spectrum in each frequency interval.
For unvoiced regions, only the magnitude of A, is estimated since the phase of A, is not required for speech synthesis.
Page 14 of 34
More about pitch estimation
Experimentally, the error E tends to vary slowly
with the pitch period P
the initial estimate is obtained by evaluating the
error for integer pitch periods
Since integer multiples of the correct pitch period
have spectra with harmonics at the correct
frequencies, the error E will be comparable for the
correct pitch period and its integer multiples.
Page 15 of 34
More about pitch estimation
(Cont’d)
Speech
segment
Original
spectrum
Error/Pitch
Original and
Synthetic
P=42.48
Original and
Synthetic
P=42
Page 16 of 34
V/UV Decision
The voiced/unvoiced decision for each harmonic is made by comparing the normalized error over each harmonic of the estimated fundamental to a threshold
When the normalized error over mth harmonic is below the threshold, this frame will be marked as voiced else unvoiced
dSwm
m
b
a
m
m 2
)(2
1
Page 17 of 34
Analysis Algorithm Flowchart
Window
Speech
segment
start
Compute error vs. pitch period
Autocorrelation approach
Select initial pitch period
(Dynamic programming
Pitch tracker)
Refine initial pitch period
(frequency domain approach)
Make V/UV decision for each
Frequency band
Select V/UV spectral
Envelope parameters
For each freq. band
Stop
Page 18 of 34
Speech Synthesis
The voiced signal can be synthesized as the sum of
sinusoidal oscillators with frequencies at the harmonics of
the fundamental and amplitudes set by the spectral
envelope parameters (The time domain method).
The unvoiced signal can be synthesized as the sum of
bandpass filtered white noise
The frequency domain method was selected for
synthesizing the unvoiced portion of the synthetic speech.
Page 19 of 34
Synthesis algorithm block diagram
Separate
Voiced/Unvoiced
Envelope samples
Bank of
Harmonic
oscillators
STFT Replace
envelope
Weighted
Overlap-add
Linear
interpolation
V/UV
Decision
Envelope
samples
Voiced envelope
samples
Unvoiced envelope
samples
Voiced envelope
samples
Unvoiced envelope
samples
Voiced
speech
Unvoiced envelope
samples
White noise
sequence
Unvoiced
speech
Page 20 of 34
MBE Synthesis algorithm
First, the spectral envelope samples are separated into voiced or unvoiced spectral envelope samples depending on whether they are in frequency bands declared voiced or unvoiced
Voiced envelope samples include both magnitude and phase, whereas unvoiced envelope samples include only the magnitude.
Voiced speech is synthesized from the voiced envelope samples by summing the outputs of a band of sinusoidal oscillators running at the harmonics of the fundamental frequency
m
mmv ttAts ))(cos()()(ˆ
Page 21 of 34
MBE Synthesis algorithm (Voiced)
The phase function is determined by an initial
phase and a frequency track as follows:
The frequency track is linearly interpolated
between the mth harmonic of the current frame and
that of the next frame by:
m
0 )(tm
0
0
)()( t
mm dt
)(tm
mmS
tSm
S
tSmt
)(
)()0()( 00
Page 22 of 34
MBE Synthesis algorithm
(Unvoiced)
Unvoiced speech is synthesized from the unvoiced envelope samples by first synthesizing a white noise sequence.
For each frame, the white noise sequence is windowed and an FFT is applied to produce samples of the Fourier transform
In each unvoiced frequency band, the noise transform samples are normalized to have unity magnitude. The unvoiced spectral envelope is constructed by linearly interpolating between the envelope samples |Am(t)|.
The normalized noise transform is multiplied by the spectral envelope to produce the synthetic transform. The synthetic transforms are then used to synthesize unvoiced speech using the weighted overlap-add method.
Page 23 of 34
MBE Synthesis (Cont’d)
The final synthesized speech is generated by summing the
voiced and unvoiced synthesized speech signals
+Synthesized
speech
Voiced
speech
Unvoiced
speech
Page 24 of 34
Bit Allocation
Parameter Bits
Fundamental
Frequency
9
Harmonic
Magnitude
139-94
Harmonic
Phase
0-45
Voiced/Unvoiced
Bits
12
Total 160
Page 25 of 34
Advanced MBE (AMBE)
MBE coding rate at 2400 bps
AMBE coding rate at 1200/2400 bps
Four new features
1. Enhanced V/UV decision
2. Initial pitch detection
3. Refined pitch determination
4. Dual rate coding
Page 26 of 34
Enhanced V/UV decision
divide the whole speech frequency band
into 4 subbands and 2 subbands for 2.4 kbps
and 1.2 kbps respectively.
That is to say only 4 bits and 2 bits are used
to encode U/V decisions for 2.4 kb/s and
1.2 kb/s vocoder respectively.
Page 27 of 34
Initial pitch detection
MBE takes 2 steps to detect the refined initial pitch period
– Spectrum matching technique to find the initial pitch period
– Using DTW-based (Discrete Time Wrapping) technique to
smooth the estimation
Computational complexity is very high
In MBE, a modified three-level center clipped auto-
correlation method is used to detect the initial pitch period,
and also use a simple smoothing method to correct the
pitch errors.
Page 28 of 34
Redefined pitch determination
To find the best pitch the basic method is to compute the error between the original speech spectrum and the shaped voiced speech spectrum by first supposing a pitch period
The supposed pitch of which the spectrum error is minimum is chosen as the last pitch
To reduce the computational complexity, AMBE uses a 256- point FFT to get the speech spectrum, and 5-point window spectrum is used to form the voiced harmonic spectrum.
To get the refined pitch, AMBE perform seven times of spectrum matching process. In every time. AMBE first set a supposed pitch, then shape a harmonic spectrum over the overall frequency band according to the supposed pitch and window spectrum, and an error can be calculated by subtracting the shaped spectrum from speech spectrum. After the seven times of matching process, the refined pitch can easily be determined
Page 29 of 34
Dual rate coding
Parameter 2400 bps 1200 bps
Pitch
quantization
8 6
V/UV decision 4 2
Amplitude
quantization
41 19
total 53 27
Page 30 of 34
Improved MBE (IMBE)
A 2400 bps coder based on MBE
Substantially better than U.S government
standard LPC-10e
The parameters of the MBE speech model :
– the fundamental frequency
– voiced/unvoiced information
– the spectral envelope.
Page 31 of 34
IMBE algorithm
estimate the excitation and system parameters
which minimize the distance between the original
and synthetic speech spectra (analysis by
synthesis)
Once these parameters are estimated,
voiced/unvoiced decisions are made by
comparing the spectral error over a series of
harmonics to a prescribed threshold
Page 33 of 33
IMBE Coding
IMBE offered in 2.4, 4.8 and 8.0 kbps
Analysis and synthesis routines are the same except the bit allocation
The fundamental frequency needs accuracy of about l Hz. and requires about 9 bits per frame.
The V/UV decisions are encoded with one bit per decision.
The remaining bits are allocated to error control and the spectral envelope information.