blind dereverberation
DESCRIPTION
Blind DereverberationTRANSCRIPT
-
Technische Universitat Darmstadt
Institut fur Automatisierungtechnik
Fachgebiet Regelungtheorie und Robotik
Prof. Dr.-Ing. Jurgen Adamy
Landgraf-Georg-Str. 4
D-64283 Darmstadt
Diplomarbeit
Study of blind dereverberation algorithms for real-time
applications
Xavier Domont
Work in cooperation with:
Honda Research Institute Europe Gmbh
D-63073 Offenbach/Main
Tutors:
Dr.-Ing. Martin Heckmann (HRI)
Dipl.-Ing. Bjoern Scholing (TUD)
Juni 2005
-
Abstract
At Honda Research Institute Europe, an automatic speech recognition system is
developed for the humanoid robot ASIMO. The reverberation effect alters the
perception of speech signals emitted in a room and reduces the performance of
automatic speech recognition. A lot of methods have been proposed in the past
few decades to enhance reverberant speech signals. This diploma thesis studies
the most promising algorithms and discusses if they can be implemented in real-
time for real environments.
The existing methods can be classified in two families:
1. Those who estimate directly the clean speech signal and treat reverberations
as disturbances.
2. Those who estimate the room impulse response and invert the estimated
system to recover the clean speech.
These two approaches are compared in this thesis, based on implementations
in Matlab of selected algorithms. The focus of this comparison is set on the
suitability of these algorithms for real environments, where speaker and robot
are moving, and a possible real-time implementation.
-
Kurzfassung
Am Honda Research Institute Europe wird ein automatisches Spracherken-
nungssystem fur den Roboter ASIMO entwickelt. Hall stort die Sprachqualitat
und senkt deutlich die Ergebnisse bei der Spracherkennung. Seit 30 Jahre sind
viele Methoden vorgeschlagen worden, um Sprachsignale zu verbessern. Diese
Diplomarbeit untersucht die aussichtsreichsten Algorithmen im Hinblick auf
Echtzeitfahigkeit und Anwendbarkeit unter realen Bedingungen.
Es gibt zwei Ansatze dieses Problem zu losen:
1. Das original Sprachsignal kann direkt aus dem beobachtete Signal geschatzt
werden. Der Halleffekt wird als Storung des reinen Signals angenommen.
2. Die Raumimpulsantwort kann bestimmt werden und wird dann an-
schlieend invertiert, um das originale Sprachsignal zu bekommen.
Diese zwei Ansatze werden in dieser Diplomarbeit verglichen. Dafur werden aus-
gewahlte Algorithmen implementiert. Der Hauptpunkt des Vergleichs war die
Untersuchung der Methoden auf Einsetzbarkeit in echten Umgebungen, in denen
sich Sprecher und Roboter bewegen.
-
Contents
1 Introduction 17
1.1 What is blind dereverberation? . . . . . . . . . . . . . . . . . . . 17
1.2 Motivation of this diploma-thesis . . . . . . . . . . . . . . . . . . 18
1.3 Audio processing architecture on ASIMO . . . . . . . . . . . . . . 19
1.3.1 Overview of the peripheral auditory system . . . . . . . . 20
1.3.2 The Gammatone filterbank, a model of the basilar membrane 21
1.4 Overview of this report . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Model of a reverberant signal 25
2.1 Properties of a speech signal . . . . . . . . . . . . . . . . . . . . . 25
2.1.1 Quick overview of the speech production system . . . . . . 25
2.1.2 Speech segments categorization . . . . . . . . . . . . . . . 27
2.1.3 Harmonicity of a speech signal . . . . . . . . . . . . . . . 28
2.1.4 Linear prediction analysis . . . . . . . . . . . . . . . . . . 28
2.2 Room acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1 Measurement of real room impulse responses . . . . . . . . 31
2.2.2 Simulation of the room impulse response . . . . . . . . . . 32
2.2.3 Linear Time-Invariant model of the room . . . . . . . . . . 35
2.2.4 Effect on the spectrogram . . . . . . . . . . . . . . . . . . 36
-
8 CONTENTS
2.3 Inversion of the room impulse response . . . . . . . . . . . . . . . 37
2.3.1 Conditions on the inversion of FIR filters . . . . . . . . . . 37
2.3.2 Are room transfer functions minimum-phase ? . . . . . . . 39
2.3.3 Multiple input inverse filter . . . . . . . . . . . . . . . . . 41
3 Enhancement of a speech signal 45
3.1 Harmonicity based dereverberation . . . . . . . . . . . . . . . . . 45
3.1.1 Effect of reverberation on a sweep signal . . . . . . . . . . 46
3.1.2 Adaptive harmonic filtering . . . . . . . . . . . . . . . . . 47
3.1.3 Dereverberation operator . . . . . . . . . . . . . . . . . . . 48
3.1.4 The HERB method . . . . . . . . . . . . . . . . . . . . . . 52
3.1.5 Test of the method . . . . . . . . . . . . . . . . . . . . . . 53
3.1.6 Discussion of the method . . . . . . . . . . . . . . . . . . . 57
3.2 Dereverberation using LP analysis . . . . . . . . . . . . . . . . . . 59
3.2.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . 59
3.2.2 The kurtosis as measure of the reverberation . . . . . . . . 60
3.2.3 Maximization of the kurtosis . . . . . . . . . . . . . . . . . 62
3.3 Discussion of the method . . . . . . . . . . . . . . . . . . . . . . . 64
4 Equalization of room impulse responses 67
4.1 Principle of the channel estimation . . . . . . . . . . . . . . . . . 67
4.1.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.2 Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.3 How can this idea be implemented? . . . . . . . . . . . . . 69
4.1.4 Why have the channels to be coprime? . . . . . . . . . . . 70
4.1.5 Estimation of the length of the filters . . . . . . . . . . . . 70
-
CONTENTS 9
4.2 Batch method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.1 Extraction of the common part . . . . . . . . . . . . . . . 72
4.2.2 Noisy case . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Iterative method . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.1 Choice of the optimization method . . . . . . . . . . . . . 76
4.3.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 Improvement of the method . . . . . . . . . . . . . . . . . . . . . 77
4.5 Discussion of the channel estimation methods . . . . . . . . . . . 79
5 Conclusion and outlook 81
5.1 Review of the studied methods . . . . . . . . . . . . . . . . . . . 81
5.1.1 Harmonicity-based dereverberation . . . . . . . . . . . . . 81
5.1.2 Linear prediction analysis . . . . . . . . . . . . . . . . . . 81
5.1.3 Channel estimation . . . . . . . . . . . . . . . . . . . . . . 82
5.1.4 Direct comparison of the methods . . . . . . . . . . . . . . 82
5.2 Speech model based method vs. channel estimation . . . . . . . . 83
5.3 What should we decide for ASIMO? . . . . . . . . . . . . . . . . . 83
A Proofs 87
-
List of Figures
1.1 Different paths of a sound wave in a room . . . . . . . . . . . . . 17
1.2 General model of a reverberant signal . . . . . . . . . . . . . . . . 18
1.3 General shape of a room impulse response . . . . . . . . . . . . . 18
1.4 Peripheral auditory system [1] . . . . . . . . . . . . . . . . . . . . 20
1.5 Impulse and frequency responses of a Gammatone filter . . . . . . 21
1.6 Analysis filters of a Gammatone filter-bank with 16 channels. . . . 22
2.1 General model of a reverberant signal . . . . . . . . . . . . . . . . 25
2.2 Schematic diagram of the human speech production mechanism
(source: [3]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Block diagram of the human speech production (source: [3]) . . . 27
2.4 Discrete-Time speech production model. (a) True Model. (b)
Model to be estimated using LP analysis. (source [3]) . . . . . . . 30
2.5 System to identify (one microphone case) . . . . . . . . . . . . . . 31
2.6 Measurement method . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Example of measured room impulse response . . . . . . . . . . . . 32
2.8 Image Method: Direct path . . . . . . . . . . . . . . . . . . . . . 33
2.9 Image Method: virtual source . . . . . . . . . . . . . . . . . . . . 33
2.10 Image Method: Sound wave reflecting off two walls . . . . . . . . 33
2.11 Room impulse response simulated with the image method . . . . . 35
-
12 LIST OF FIGURES
2.12 Spectrograms of an anechoic signal (left) and the resulting spec-
trogram of its convolution with the impulse response of figure 2.7
(right). This spectrograms were obtained with a Gammatone filter-
bank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.13 Inversion of a filter . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.14 Pole () and zero () of an all-pass filter . . . . . . . . . . . . . . 40
2.15 Energy of a non-minimum phase system (dashed - blue) and the
corresponding minimum-phase system (red). . . . . . . . . . . . . 41
2.16 Multiple input inverse filter . . . . . . . . . . . . . . . . . . . . . 42
3.1 Spectrograms of a sweeping sinusoid and its reverberant signal. . . 46
3.2 Adaptive harmonic filtering . . . . . . . . . . . . . . . . . . . . . 47
3.3 Diagram of the HERB dereverberation method . . . . . . . . . . . 52
3.4 up-left: Original signal (sweep with harmonics). up-right: Re-
verberant signal. bottom-left: Harmonic estimate with the
Gammatone filter-bank. bottom-right: Harmonic estimate with
Nakatanis harmonic filter. . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Spectrogram of the clean and reverberant signal used to test the
reverberation operator. . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Spectrogram of the enhanced signal computed in the frequency
domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Impulse response of the dereverberation operator and spectrogram
of the enhanced signal computed in the time domain. . . . . . . . 57
3.8 Effect of the reverberation on the fundamental frequency. . . . . . 58
3.9 Example of platykurtic (left) and leptokurtic (right) distributions.
Both distributions have the same standard deviation . . . . . . . 60
3.10 On the left, extract of the LP residuals of a speech signal. Note
the strong peaks corresponding to the glottal pulses. On the right,
the same signal impaired by reverberations. . . . . . . . . . . . . 61
-
LIST OF FIGURES 13
3.11 Estimation of the probability density functions of the LP residuals
of a clean speech signal (blue) and of a reverberant signal (red).
Both signal have been centered and normalized such that their
means = 0 and their standard deviations = 1. . . . . . . . . . 61
3.12 (a) A single channel time-domain adaptive algorithm for maximiz-
ing the kurtosis of the LP residuals. (b) Equivalent system, which
avoids LP reconstruction artifacts. . . . . . . . . . . . . . . . . . . 63
3.13 Two-channel frequency-domain adaptive algorithm for maximiza-
tion of the kurtosis of the LP residual. . . . . . . . . . . . . . . . 64
3.14 On the left the LP residual of a clean signal. On the right the LP
residual of the resulting dereverberated signal. The kurtosis of the
dereverberated signal is higher than the kurtosis of the original
signal. The resulting signal is strongly distorted. . . . . . . . . . . 65
4.1 SIMO System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Channel identification with overestimated channel orders. . . . . . 71
4.3 Estimated zeros and real zeros for one channels (left). Zeros of all
the estimated channels. On the left 4 estimated zeros are alone,
they do not correspond to a real pole of the filter. On the right it
can be noticed that these 4 additional zeros are common to all the
estimated channels. . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Eigenvalues of the matrix Rx in the noiseless case. On the right:
zoom on the smallest eigenvalues. . . . . . . . . . . . . . . . . . . 73
4.5 left: 4 of the 11 eigenvectors of the null space. right: common
part of the null space (blue) and real impulse response (red). The
impulse response of the 2 channels are concatenated and 10 zeros
(corresponding to the over-estimation of the order) were added. . 74
4.6 Eigenvalues of the correlation matrix in the noisy case. The vari-
ance of the noise is equal to 1010 on the left and 106 on theright. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
-
14 LIST OF FIGURES
4.7 Iterative estimation of the channel impulse responses using two
microphones. On the left the estimated zeros (blue) of one of the
channels are compared with their real values (red). On the right
the remaining impulse response after inversion of the system is
drawn (blue), in the ideal case it should a Dirac (red) . . . . . . . 77
4.8 Iterative estimation of the channel impulse responses using 5 mi-
crophones. On the left the estimated zeros (blue) of one of the
channels are compared with their real values (red). On the right
the remaining impulse response after inversion of the system is
drawn (blue), in the ideal case it should a Dirac (red) . . . . . . . 78
4.9 Comparison of the position of the zeros when the convolution and
the subsampling are performed in a different order. . . . . . . . . 79
-
Acronyms
FFT Fast Fourier Transform
DFT Discrete Fourier Transform
STFT short time Fourier transform
SISO Single-Input Single-Output
SIMO Single-Input Multiple-Output
MIMO Multiple-Input Multiple-Output
ROC Region of Convergence
LTI Linear Time-Invariant
FIR Finite Impulse Response
IIR Infinite Impulse Response
MINT Multiple input inverse filter
-
Chapter 1
Introduction
1.1 What is blind dereverberation?
Figure 1.1: Different paths of a sound wave in a room
The acoustic signals emitted in a room reflect off the walls and other objects
(see figure 1.1). The direct signal and all the reflected sound waves arrive to
the microphone/listener with a different delay and sum up. This effect is called
reverberation. Sometimes the term echo is used instead of reverberation. However
echo generally implies a distinct, delayed version of a sound. In a room, each
delayed sound wave arrives in such a short period of time that we do not perceive
each reflection as a copy of the original sound. Even though we cant discern
-
18 1 Introduction
every reflection, we still hear the effect of the entire series of reflections.
Whereas a human being, without hearing problems, can quite well cope with these
distortions, this reverberation effect impairs the speech intelligibility in devices
such as hands-free conferences telephones and automatic speech recognition.
The diagram in figure 1.2 shows how the system can be modeled. The effect of
the room is considered as a filter with impulse response h(t) whose input is the
clean speech signal s(t) and the output is the observed reverberant signal x(t).
s(t) h(t) x(t)
Figure 1.2: General model of a reverberant signal
Figure 1.3 shows the general shape of a room impulse response. The reverbera-
tion corrupts the speech by blurring its temporal structure. However, due to the
spectral continuity of speech, the early reflections mainly increase the intensity of
the reverberant speech, whereas the later ones are deleterious to speech quality
and intelligibility.
Figure 1.3: General shape of a room impulse response
The aim of the blind dereverberation is to recover the clean signal s(t) out of the
observed reverberant signal x(t). The term blind means that neither the clean
signal nor the impulse response of the room are known before the processing.
1.2 Motivation of this diploma-thesis
This diploma thesis was written in cooperation with Honda Research Institute
(HRI) Europe. One of the important projects of HRI is the development of the
-
1.3 Audio processing architecture on ASIMO 19
humanoid robot ASIMO (Advanced Step in Innovative MObility). At HRI Europe
the CARL (Child-like Acquisition of Representation and Language) Group of Dr.
Frank Joublin aims at developing a system of automatic speech recognition (ASR)
and production for ASIMO. As the distortions caused by reverberations alters
the performance of ASR, we will investigate if a signal processing method can be
found to dereverberate the signals heard by ASIMO.
During the past decades many dereverberation methods have been proposed.
However, no standard method has yet been found and this research topic is still
very active. The aim of this diploma-thesis is to establish a state of the art of the
existing methods and then to evaluate if some of them could be integrated to the
audio processing system of ASIMO.
The important requirements for ASIMO are, firstly, that the dereverberation is
processed in real-time, and, secondly that the system must adapt to a real and
changing environment. It means that the algorithms have to adapt themselves to
the room conditions faster than these conditions change. As both ASIMO and
the speaker can be in movement the effects of the room are susceptible to change
very rapidly.
To perform this study we selected, out of the recently proposed methods, the ones
which seemed the most promising. The selected methods have then been imple-
mented in MATLAB in order to determine their advantages and their drawbacks.
For the implementation, we try, as often as possible, to use the existing audio
processing architecture of ASIMO, described in section 1.3.
In addition to the analysis of their performances, it will be discussed if the studied
methods, while enhancing the perception of speech, do not alter some signal char-
acteristics which are used by following audio processing on ASIMO. In particular,
the phase spectrum is essential to the localization of a source of speech.
1.3 Audio processing architecture on ASIMO
The audio processing system at HRI uses a Gammatone filterbank. This type
of filterbank is widely used in audio signal processing as it simulate the human
auditory system.
-
20 1 Introduction
Figure 1.4: Peripheral auditory system [1]
1.3.1 Overview of the peripheral auditory system
The aim of the peripheral auditory system (see figure 1.4) is to transform a sound
(which is actually a pressure variation in air) into nerve impulses. These impulses
are then conveyed by the auditory nerve to the brain stem. The nerve cells in
the brain stem act as relay stations, eventually conveying nerve impulses to the
auditory cortex.
The outer ear is composed of the pinna (the visible part) and the auditory canal
or meatus. The pinna significantly modifies the incoming sound in a way that
depends on the angle of incidence of the sound relative to the head. This is
important for the sound localization. Sound travels down the meatus and cause
the eardrum, or tympanic membrane, to vibrate. These vibrations are transmitted
through the middle ear by three small bones, the osscicles, to a membrane-covered
opening in the bony wall of the spiral-shaped structure of the inner ear, the
cochlea.
The cochlea is shaped like the spiral shell of a snail. It is filled with almost incom-
pressible fluids and is divided along its length by two membranes, the Reissners
membrane and the the basilar membrane. The motion of the basilar membrane
in response to a sound is of primary interest.
-
1.3 Audio processing architecture on ASIMO 21
1.3.2 The Gammatone filterbank, a model of the basilar
membrane
A point on the basilar membrane is characterized by its impulse response. The
Gammatone function approximates physiologically recorded impulse responses:
g(t) = tn1 exp(2pibt) cos(2pif0t+ ) (1.1)
where t is the time (t 0), b determines the duration of the impulse response, nis the order of the filter and determines the slope of the skirts of the filter, is a
phase and f0 is the center frequency.
Figure 1.5: Impulse and frequency responses of a Gammatone filter
It can be observed from figure 1.5 that the Gammatone filter is a bandpass with
its center frequency at f0. Its bandwidth depends on b.
To simulate the whole basilar membrane, a bank of Gammatone filters can be
used. Each filter channel represents the frequency response of one point on the
basilar membrane.
The parameters of the Gammatone filters are determined out of psychoacoustic
measurements. Glasberg and Moore [2] summarized the equivalent rectangular
bandwidth (ERB) of the human auditory filter. The ERB of a filter is defined as
the width of a rectangular filter whose height equals the peak gain of the filter
and which passes the same total power as the filter.
The relation between the bandwidth and the center frequency of the Gammatone
filters is given by:
ERB = 24.7 + 0.108 f0. (1.2)
-
22 1 Introduction
The figure 1.6 shows the transfer functions for a bank of 16 filters with center
frequency spaced between 50 Hz and 8 kHz. As the spectral resolution of the
basilar membrane decreases as the frequency increases, the center frequencies of
the Gammatone filters are not linearly distributed and their bandwidth increase
with the center frequency according to equation (1.2). We can also note that the
pass bands overlap.
Figure 1.6: Analysis filters of a Gammatone filter-bank with 16 channels.
1.4 Overview of this report
The existing blind dereverberation methods can be classified in two families.
1. We can estimate directly the clean speech signal, or the parameters and
excitation of an appropriate parametric model, as a missing data problem
by treating reverberations as disturbances.
2. We can model the effect of the room by a filter. The coefficients of this filter
are estimated by treating the clean speech as disturbance. The observed
signal is then deconvolved by the estimated filter to recover the clean speech.
In chapter 2 we will discuss how speech signals and room impulse responses can
be modeled. This modeling step is essential to determine what, in the observed
signal, is due to the speech and what is an effect of the room.
-
1.4 Overview of this report 23
In chapter 3 two methods, which use the properties of speech to enhance re-
verberant signals, will be studied. These methods consider that the room effect
is a disturbance and try and restore the characteristics of the speech that the
reverberations altered.
In chapter 4 the possibility of estimating the room impulse responses will be
discussed. This approach is very interesting as, knowing the effect of the room on
the signal, it will then make possible to revert this effect and recover the clean
signal.
-
Chapter 2
Model of a reverberant signal
In terms of signal processing a room can be seen as a filter. The original (anechoic)
signal s(n) goes through a filter h(n) and gives the reverberant signal x(n), see
figure 2.1. In the case of blind dereverberation the input signal s(n) and the room
transfer function h(n) are unknown.
s(n) h(n) x(n)
Figure 2.1: General model of a reverberant signal
The task of dereverberation is to find an estimate s(n) of s(n), given the output
x(n) of the system. In order to make this task feasible, a model of the speech
signal and/or a model of the room are required.
In section 2.1 different ways to model a speech signal will be discussed. In section
2.2 the effects of the room on the speech signal will be investigated. At last section
2.3 will discuss the possibility of inverting the effects of the room.
2.1 Properties of a speech signal
2.1.1 Quick overview of the speech production system
The principal components of the human speech production system are (see figure
2.2) the lungs, trachea(windpipe), larynx (organ of voice production), pharyn-
-
26 2 Model of a reverberant signal
Figure 2.2: Schematic diagram of the human speech production mechanism (source:[3])
geal cavity (throat), oral or buccal cavity (mouth), and nasal cavity (nose). The
pharyngeal and oral cavities are usually grouped and referred to as vocal tract.
It is useful to think of speech production in terms of an acoustic filtering oper-
ation. The pharyngeal, oral and nasal cavities comprise the main acoustic filter.
This filter is excited by the organs below it, and is loaded at its main output by
a radiation impedance due to the lips. The articulators are used to change the
properties of the system, its form of excitation, and its output loading over time.
Figure 2.3 shows a simplified acoustic model illustrating these ideas.
-
2.1 Properties of a speech signal 27
Figure 2.3: Block diagram of the human speech production (source: [3])
2.1.2 Speech segments categorization
The spectral characteristics of the speech wave are non-stationary, since the phys-
ical system changes rapidly over time. Speech can therefore be divided into sound
segments which present similar properties over a short period of time. Without
going into further details the main way to classify a speech sound is with the type
of excitation.
The two elementary types of excitation are voiced and unvoiced. There are actu-
ally a few other type of excitation (mixed, plosive, whisper, silence) but they can
be seen just as a combination of the two elementary types.
Voiced sounds are produced by forcing air through the glottis, an opening between
the vocals folds. The vocal cords vibrate in oscillatory fashion and, therefore, the
produced speech signal is quasi-periodic, its period is called fundamental period
T0; the fundamental frequency F0 can be defined as1
T0.
-
28 2 Model of a reverberant signal
Unvoiced sounds are generated by forming a constriction at some point along the
vocal tract, and forcing air through the constriction to produce turbulence. The
produced speech signal is a noise-like sound.
Typical human speech communication is limited to a bandwidth of 7-8 kHz. The
main part of the energy is contained in voiced segments.
2.1.3 Harmonicity of a speech signal
A speech signal s(n) can be modeled [4] by using the sum of a harmonic signal
sh(n), derived from a glottal vibration, and a non-harmonic signal sn(n), such as
fricatives and plosives, as
s(n) = sh(n) + sn(n). (2.1)
The harmonic part of the signal is defined by its voiced durations and their fun-
damental frequencies (F0). A voiced duration is the time during which the vocal
cords vibrate to generate a harmonic signal and the fundamental frequency refers
to the frequency of the fundamental component of the signal. Each harmonic
component has a frequency which corresponds to F0 or its multiples.
It can be assumed that F0 is constant within a short time, therefore the harmonic
signal, sh(h), can be modeled over a time frame of length T by the sum of sinu-
soidal components whose frequencies coincide with the fundamental frequency of
the signal and its multiples:
sh(n) =Nk=1
Ak cos
(kF0
n ncfs
+ k
)for |n nc| < T
2(2.2)
where Ak and k are the amplitude and the phase of the k-th harmonic compo-
nent, nc the time index of the center of the frame and fs the sampling rate.
2.1.4 Linear prediction analysis
A widely used model of speech signals is given by Linear Prediction (LP) analysis.
This model consists of separating the speech signal into a excitation signal and
a model of the vocal tract.
-
2.1 Properties of a speech signal 29
During a stationary frame of speech the model would ideally be characterized by
a pole-zero transfer function of the form
(z) = 0
1 +Li=1
b(i)zi
1Ri=1
a(i)zi(2.3)
which is driven by an excitation sequence
e(n) =
+
q=(n qP ), voiced case
zero mean, unity variance,uncorrelated noise,
unvoiced case
(2.4)
where (n) is the discrete Dirac
(n) =
{1 if t = 0,
0 else.(2.5)
The principle of the LP analysis is to approximate this pole-zero system with an
all-pole system
(z) =1
1Ri=1
a(i)zi(2.6)
which can be easily estimated by solving a system of linear equations. The
schematics of the true speech model and of its LP approximation are shown in
figure 2.4.
A magnitude spectrum, but not a phase spectrum1, can be exactly modeled with
stable poles. It means that the LP analysis will model the true magnitude
spectrum of the speech which is, in the most cases, enough for speech perception.
For example, a listener moving from room to room within a house is able to clearly
understand speech of a stationary talker, even if the phase relationships among
the components are changing dramatically [3]. However, for some applications like
the localization of the talker, the temporal dynamics of the sound are essential
and the LP analysis should be used with care.
1Actually the LP model has a minimum-phase characteristic. This notion will be discussedmore in details in section 2.3
-
30 2 Model of a reverberant signal
Figure 2.4: Discrete-Time speech production model. (a) True Model. (b) Model tobe estimated using LP analysis. (source [3])
To understand the name Linear Prediction, it is helpful to consider the LP
analysis in the time domain. An all-pole transfer function corresponds to an
autoregressive (AR) model, i.e. the signal s(n) can be expressed as a linear com-
bination of its L past samples:
s(n) =L
k=1
aks(n k) + e(n) (2.7)
where ak are the LP coefficients. The excitation signal e(n) can be seen in terms
of system identification as the prediction error signal, also called LP residual.
2.2 Room acoustics
This section will firstly present how room impulse responses can be measured
(2.2.1) or simulated (2.2.2). The goal is to obtain a set of real and artificial
-
2.2 Room acoustics 31
impulse responses. These data will be useful in the next chapters to test the
dereverberation methods.
In a second time (2.2.3) we will discuss if a general model of a room can be found.
At last (2.2.4), we will use time-frequency analysis to shortly study the effects of
reverberation on speech signals.
2.2.1 Measurement of real room impulse responses
In order to get real impulse response corresponding to a normal room, we per-
formed measurements in the office of the CARL group at HRI. A sound signal
was played through the room by a loudspeaker. Simultaneously the sound wave
was recorded using a model of ASIMOs head equipped with two microphones.
s(n) h(n) x(n)
Figure 2.5: System to identify (one microphone case)
For each microphone both the input s(n) and the output x(n) of the system are
known, the impulse response h(n) can be then computed by the reverting the
convolution
x(n) = h(n) s(n). (2.8)However the measurement is generally altered by additive noise. To improve the
measurement it is therefore better to use auto- and cross-correlation functions.
Equation (2.8) becomes
Rsx(n) = h(n) Rs(n) (2.9)
where Rs(n) is the autocorrelation function of s(n) and Rsx(n) the cross-
correlation function of s(n) and x(n). Equation (2.9) is less sensitive to noise.
Moreover, if s(n) is a white noise, its autocorrelation function is equal to (n).
Then
h(n) = Rsx(n), when Rs(n) = (n). (2.10)
The impulse response of the room is equal to the autocorrelation function of the
white noise, played by the loudspeaker, and signal recorded by the microphone.
For our measurement we used 1 second of Gaussian white noise as room excitation
signal.
-
32 2 Model of a reverberant signal
Two different sound cards were used to play and record the signals. In order to
easily synchronize the input and the output of the system, the excitation signal
was directly recorded on another channel of the capture sound card in addition
to the signals of the microphones (see figure 2.6).
x(n) - h(n) - y(n)r- x(n)
Figure 2.6: Measurement method
Moreover this method permits to compensate eventual effects of the sound cards.
As we only disposed of a stereo sound card, the record for the left and right ears
had to be performed separately. Figure 2.6 shows one of the measured impulse
response.
Figure 2.7: Example of measured room impulse response
2.2.2 Simulation of the room impulse response
A technique to simulate the impulse response of a room is the image method
proposed in 1979 by Allen [5]. It sums the direct path with all reflections on walls
or objects.
An example in [6] shows the principle of this method. Figure 2.8 shows the direct
path from a sound source () to a microphone (). Another part of the sound wave
-
2.2 Room acoustics 33
Figure 2.8: Image Method: Direct path
is reflected off a wall and then impinges upon the microphone. This reverberated
sound seems to come directly from a virtual source located in an adjacent room,
symmetrical to the original room relatively to the wall (see figure 2.9). On this
figure the black line represents the real path of the signal, whereas the blue line
is its perceived path.
HHHH
Figure 2.9: Image Method: virtual source
This process can be extended to sound waves that are reflected more than once
off the walls (see figure 2.10). This process can be continued the same way in the
HHHH
XXXXXXXXXX
Figure 2.10: Image Method: Sound wave reflecting off two walls
three dimensions to get an infinity amount of virtual sources.
-
34 2 Model of a reverberant signal
The virtual sources permit to easily compute the distance the sound wave travels
to arrive at the microphone.
Considering a rectangular room with dimensions (Lx, Ly, Lz), the coordinates-
vector ri,j,k = (xi, yj, zk)T ((i, j, k) Z3) of a virtual source is:
xi = (1)ixsource +(i+
1 (1)i2
)Lx
yj = (1)jysource +(j +
1 (1)j2
)Ly
zk = (1)kzsource +(k +
1 (1)k2
)Lz
(2.11)
where rsource = (xsource, ysource, zsource)T is the coordinate-vector of the source.
The distance from the virtual source to the microphone is
di,j,k = ri,j,k rm =(xi xm)2 + (yj ym)2 + (zk zm)2 (2.12)
where rm = (xm, ym, zm)T is the coordinate-vector of the microphone.
The sound wave corresponding to the (i, j, k) virtual source will arrive at the
microphone with a delay
i,j,k =di,j,kc
(2.13)
where c is the speed of sound. The impulse response of the room is the sum of the
delayed impulse corresponding to the signals arriving from each virtual source:
h(t) =i,j,kZ
hi,j,k (t di,j,k
c
)(2.14)
The magnitude hi,j,k of the unit impulse is influenced by the distance the sound
wave travels to get from the source to the microphone
bi,j,k =1
4pid2i,j,k(2.15)
and by the number of reflections of the walls
ci,j,k = |i|+|j|+|k| (2.16)
where < 1 is the wall reflection coefficient (which is, in this simple model,
considered to be the same for all the walls).
hi,j,k = bi,j,kci,j,k (2.17)
-
2.2 Room acoustics 35
Although the impulse response of the room should contain an infinite number of
delayed impulses, corresponding to an infinity of virtual sources, the magnitudes
hi,j,k become very small for large i j and k. The impulse response has then a
finite time duration:
h(t) =n
i=n
nj=n
nk=n
hi,j,k (t di,j,k
c
)(2.18)
Figure 2.11: Room impulse response simulated with the image method
Figure 2.11 shows a simulated room impulse response obtained with the image
method. Reverberant sounds generated using such an impulse response sound like
signals recorded in real conditions. However phenomenon like the phase inversion
of the sound wave when it reflects off a wall, the presence of objects or people
are ignored by this model.
2.2.3 Linear Time-Invariant model of the room
The general shape of the measured and the simulated room impulse responses
corresponds to the one described on figure 1.3. However, when the conditions in
the room are changing (movement of the talker and/or listener) the coefficients
of the impulse response have big fluctuations especially in the late reverberation
tail. As we explained in chapter 1, the distortions in the speech signal are mostly
due to the late reverberation. Therefore, a model based on the image method,
where the room impulse responses are modeled by a sum of delayed impulses
hi,j,k (t i,j,k), is not practical for a system identification.Actually the only general properties which can be retained from a room impulse
-
36 2 Model of a reverberant signal
response are its linearity, its causality (there is no reverberation before the be-
ginning of the signal) and its general exponential decay structure.
In real environments, the talker or the listener are in movement, therefore the
effect of the room is time-variant. However if we assume that the computation is
fast enough the system can be considered Linear Time-Invariant (LTI).
Moreover, because of the exponential decay, the impulse response of the room has
a finite duration. The room is then modeled by a Finite Impulse Response (FIR)
filter. The relation between the input s(n) and the output x(n) is given by the
convolution
x(n) = h(n) s(n) =L1k=0
h(k)s(n k), (2.19)
where L is the length of the impulse response (also called order of the channel).
Actually, the FIR model of the room impulse response is very practical as the
transfer function of the system, i.e. the z-transform of its impulse response,
H(z) =L1k=0
h(k)zk (2.20)
is defined for all finite z and is a polynomial.
2.2.4 Effect on the spectrogram
It is interesting to study the effect of reverberation on the spectrogram2 of a
speech signal. Figure 2.12 shows the spectrograms of the same speech signal
without and with reverberation.
The problem can be explained in the time-frequency domain as: Given the spec-
trogram of the original signal at time frame t and frequency f , S(t, f), what is
the influence of the room on the spectrogram of the reverberant signal at time-
frequency bin (t, f ), X(t, f )?.
The value of X at time frame t is only affected by bins of the original signalthat are between time frames t and tD where D depends on the reverberation
2Instead of the spectrogram, the Gammatone filter-bank described in chapter 1 can beused. Contrary to a normal spectrogram, computed using a short-time Fourier transform, theGammatone filterbank gives an output for each time sample (no subsampling) and the centerfrequency of the filters are not linearly distributed, which is closer to the human auditorysystem.
-
2.3 Inversion of the room impulse response 37
Figure 2.12: Spectrograms of an anechoic signal (left) and the resulting spectrogramof its convolution with the impulse response of figure 2.7 (right). This spectrogramswere obtained with a Gammatone filter-bank.
time of the room, i.e. the time for the sound to die away to a level of 60 dB
below its original level. In the frequency-domain, the reverberation affects slightly
the adjacent channels. According to [7], this effect has the form of a Laplace
distribution.
2.3 Inversion of the room impulse response
In this section the theoretical possibility of a perfect dereverberation will be
discussed. The issue can be formulated in the following way: Assuming that the
room impulse response is known, is it possible to remove its effect and to get an
accurate estimate of the original speech signal?.
2.3.1 Conditions on the inversion of FIR filters
The inverse g(n) of a filter h(n) (see figure 2.13) is
s(n) = g(n) x(n)= g(n) h(n) s(n)= s(n)
(2.21)
which can be simplified to
g(n) h(n) = (n) (2.22)
-
38 2 Model of a reverberant signal
s(n) h(n) x(n) g(n) s(n)
Figure 2.13: Inversion of a filter
The inversion problem can be studied with the help of the z-transform. The z-
transform of h(n), also called transfer function of the filter, is defined as the power
series
H(z) =+
k=h(k)zk (2.23)
It was shown in section 2.2 that the room can be considered as a FIR filter. Then
its z-transform is a polynomial
H(z) = h0 + h1z1 + + hNzL+1 (2.24)
where L is the length of the room impulse response
H(z) = h0zL+1(z z1)(z z2) (z zL1) (2.25)
H(z) has L 1 finite zeros at z = z1, z2, . . . , zN .The transfer function of the inverse filter is then the rational function
G(z) =1
h0 + h1z1 + + hNzL+1
(2.26)
The Infinite Impulse Response (IIR) filter G(z) is causal and stable if and only if
all its poles are inside the unit circle (|z| = 1). As the poles of G(z) are the zerosof H(z), this means that all the zeros of H(z) must be inside the unit circle. Such
a system is called minimum-phase.
In order to understand this problem, we can observe what happens if we want to
invert a simple non minimum-phase system.
Given the FIR filter h(n), defined in the time-domain by:
h(n) = (n) 2(n 1)
Its transfer function is
H(z) = 1 2z1
-
2.3 Inversion of the room impulse response 39
The Region of Convergence (ROC) of this z-transform is |z| > 0. As this systemhas a zero at z = 2, it is non minimum-phase.
The transfer function of its inverse system is:
G(z) =1
1 2z1 =z
z 2G(z) has a zero at the origin and a pole at z = 2. In this case there are two
possible regions of convergence and hence two possible inverse systems. If the
ROC of G(z) is taken as |z| > 2, then
g(n) = 2nu(n)
where u(n) is the unit step function
u(n) =
{1 if n 0,0 else.
(2.27)
This is the impulse response of a causal and instable system. On the other hand,
if the ROC is assumed to be |z| < 2, the impulse response of the inverse systemis
g(n) = 2nu(n 1).In this case the inverse system is anti-causal and stable.
2.3.2 Are room transfer functions minimum-phase ?
Any system can be represented as the cascade of a minimum-phase system with
an all-pass system [8]. An all-pass system is defined as a system for which the
magnitude of the transfer function is unity for all frequencies. Thus if Hap(z)
denotes the z-transform of an all-pass system, |Hap(e)| = 1 for all .The poles and zeros of an all-pass system occur at conjugate reciprocal locations
(see figure 2.14).
If we consider a non-minimum-phase system H(z), with, for example, one zero
outside the unit circle at z = 1z0, |z0| < 1, and the remainder of its poles and
zeros are inside the unit circle. Then H(z) can be expressed as
H(z) = H1(z)(z1 z0) (2.28)
-
40 2 Model of a reverberant signal
z-plane
unit circle@I
z0 = re
1z0= 1
re
p0 =1z0
= 1re
Figure 2.14: Pole () and zero () of an all-pass filter
where H1(z) is minimum-phase. Equivalently equation (2.28) can be written as
H(z) =(H1(z)(z
1 z0)) 1 z0z11 z0z1
=(H1(z)(1 z0z1)
) z1 z01 z0z1
= Hmin(z)z1 z01 z0z1
= Hmin(z)Hap(z)
(2.29)
where Hmin(z) is minimum-phase and Hap(z) is all-pass. Any pole or zero of H(z)
that is inside the unit circle also appears in Hmin. Any pole or zero of H(z) that
is outside the unit circle appears in Hmin in the conjugal reciprocal location.
The equivalent minimum-phase system has the same magnitude spectrum as the
original system.
It is interesting to compare the impulse response h(n) of an FIR system with the
impulse response hmin(n) of its equivalent minimum-phase system.
Figure 2.15 shows that the energy of hmin(n) is more concentrated around the
origin. This property can be formalized with the following equation3:
mn=0
|h(n)|2 mn=0
|hmin(n)|2, m N (2.30)
The energy of both sequences is the same since the magnitude of their Fourier
3A proof of this property is outlined in [8] page 371.
-
2.3 Inversion of the room impulse response 41
Figure 2.15: Energy of a non-minimum phase system (dashed - blue) and the corre-sponding minimum-phase system (red).
transforms is the same (by Parsevals Theorem). This means that the equality
occurs in (2.30) when m.
The room transfer functions have often more energy in the reverberant compo-
nent of the room impulse response than in the component corresponding to the
direct path (see figure 1.3). This implies that room transfer function are often
non-minimum-phase. A causal and stable inverse of a room impulse response is
therefore impossible to find in general. The non-causality problem can be solved
by introducing a delay, i.e. a delayed inverse filter is computed instead. However
the delay have to be generally quite long, and this is not satisfying for real-time
applications.
2.3.3 Multiple input inverse filter
As the room transfer functions are most of the time non-minimum-phase a perfect
dereverberation cannot be achieved with a single microphone. It is possible to
find a delayed inverse filter but this solution is not really adequate for real-time
processing.
However it is possible to find the exact inverse of a point in the room by us-
-
42 2 Model of a reverberant signal
ing multiple microphones4, if the room transfer functions corresponding to the
different sensors are coprime, i.e. they do not share common zeros [9].
This property is actually a direct application of the Bezouts theorem on poly-
nomials. Given M FIR filters with transfer function Hi(z), i = 1, . . .M , if the
Hi(z)s are coprime polynomials, then Gi(z), i = 1, . . .M , such that:H1(z)G1(z) +H2(z)G2(z) + . . .+HM(z)GM(z) = 1 (2.31)
where the orders of the Gi(z)s are smaller or equal than the highest order of
the Hi(z)s. Figure 2.16 shows how equation (2.31) can be used to invert the M
channels simultaneously. This method is called Multiple-input/output INverse
Theorem (MINT).
H1(z)
H2(z)
...
HM(z)
G1(z)
G2(z)
...
GM(z)
-
-
-
-
-
-
-
s(n) s(n) s(n)
Figure 2.16: Multiple input inverse filter
By using more than one microphone, the issue that room transfer functions are
non minimum-phase is bypassed. Moreover the inverse filters are simple FIR
filters, which can be computed by solving the linear system
d =[HT1 H
T2 HTM
]g = Hg (2.32)
where d = [1, 0, . . . , 0] is a vector of length 2L 1, g is the concatenation of thevector gi = [gi(0), . . . , gi(L 1)]T corresponding to the inverse filters
g =[gT1 . . . g
TM
]T(2.33)
and the His are the L (2L 1) Sylvester matrices corresponding to the poly-nomials Hi(z)
Hi =
hi(0) hi(L 1) 0... . . . . . . ...0 hi(0) hi(L 1)
(2.34)4Such a system is called Single-Input Multiple-Output (SIMO)
-
2.3 Inversion of the room impulse response 43
A Sylvester matrix permits to compute a convolution (or a polynomial multiplica-
tion) with a matrix multiplication. Given two signals x(n) and y(n), respectively
of length Lx and Ly, the convolution z(n) of x(n) and y(n) has Lx + Ly 1samples and can be written in a vector form as
z = XTy = YTx, (2.35)
where X, resp. Y, is the Ly (Lx + Ly 1), resp. Lx (Lx + Ly 1), Sylvestermatrix of x(n), resp. y(n), and y, resp, x, is the signal y(n), resp. x(n), written
as a column vector of length Ly, resp. Lx.
The linear system of equation (2.32) can be solved by computing the Moore-
Penrose pseudo-inverse5 of the matrix H, H+. The inverse filter is then computed
by
g = H+d. (2.36)
As d = [1, 0, . . . , 0], g is actually the first column of H+.
The linear system of equation (2.32) has infinitely many solution. The pseudo-
inverse method gives the solution with the smallest norm 2.
5 The Moore-Penrose pseudo-inverse is a matrixH+ of the same dimensions asHT satisfyingfour conditions: HH+H = H, H+HH+ = H+, HH+ and H+H are Hermitian.
-
Chapter 3
Enhancement of a speech signal
Reverberation produces a distortion that alters the intelligibility of speech. A
possible approach to the dereverberation problem is to consider the general prop-
erties of a speech signal, which are degraded by the reverberation.
A simple way to improve the reverberant signal is, for example, to detect re-
verberation tails between words. By removing, or attenuating, these parts which
only contain reverberation, the listening comfort is slightly improved. However,
this method, which is used in hearing aids, does not remove the distortion which
alters the words.
The two methods presented in this chapter use, more or less explicitely, the
harmonicity property of the voiced segments of a speech signal, in order to try
and recover the clean signal. In section 3.1 the approach of Nakatani[4], using a
adaptive harmonic filter, will be described. In section 3.2 an adaptive algorithm
working on the LP residual of the speech signal will be presented.
3.1 Harmonicity based dereverberation
Nakatani and al. propose in [4] an interesting single microphone dereverbera-
tion method called Harmonicity based dEReverBeration (HERB). This method
is based on the harmonicity model of speech described in section 2.1.
The principle is to estimate a dereverberation operator using the harmonic parts
of the speech signal. This operator, initially designed for the harmonic parts, is
expected to work on the non-harmonic parts as well.
-
46 3 Enhancement of a speech signal
Figure 3.1: Spectrograms of a sweeping sinusoid and its reverberant signal.
The performance of this method, presented in [4] and [10], are impressive. In this
section we will begin by describing the principle of this dereverberation process.
Then we will discuss its applicability on ASIMO.
3.1.1 Effect of reverberation on a sweep signal
In order to understand the basic idea of the HERB method, it is useful to observe
the effect of reverberation on a sweeping sinusoid. In discrete time the sinusoidal
sweep is define by
s(n) = A sin 2pi(k
2
(n
fs
)2+ fstart
n
fs
)(3.1)
where A is the amplitude, fstart is the frequency at t = 0, fs is the sampling
frequency, and k a constant. Its instant frequency varies linearly in time:
(n) = kn
fs+ fstart. (3.2)
Figure 3.1 (left) shows the spectrogram of a half second long discrete signal
which frequency sweeps from 100 to 4000 Hz. This spectrogram is obtained using
a Gammatone filter-bank, therefore is the frequency scale not linear (see (1.2)).
This signal is then convoluted with the impulse response shown on figure 2.7. The
resulting spectrogram is shown on figure 3.1 (right). We can observe, from this
spectrogram, that the sinusoidal component corresponding to the original signal
can be clearly identified. In each frequency band, the energy corresponding to this
-
3.1 Harmonicity based dereverberation 47
direct signal appears first and is followed by a reverberation tail. At a given
point in time, the energy of the signal is maximum for the frequency corresponding
to the direct signal.
The idea of the HERB method is to track the instant frequency (l) of a dominant
sinusoidal component in the reverberant signal at each short time frame. The
amplitude A(l) and phase (l) of this dominant sinusoidal are extracted and
used to synthesize the signal
s(n) =l
g(n nl)A(l) cos((l)
n
fs+ (l)
), (3.3)
where g(nnl) is a window function for overlap-add synthesis and nl is the timeindex centered in frame l.
3.1.2 Adaptive harmonic filtering
Although a sweep signal contains only one dominant sinusoidal, a harmonic sig-
nal contains several sinusoidal components whose frequencies correspond to its
fundamental frequency F0 and its multiples (cf. section 2.1). The aim of a har-
monic filter is to enhance these components. Since the fundamental frequency of a
speech signal changes over time, the properties of the filter have to be adaptively
modified according to F0 (see figure 3.2).
Harmonic
Filter
Estimation
of F0
6
-
-
-x(n) x(n)
Figure 3.2: Adaptive harmonic filtering
A simple approach of harmonic filtering is the comb filter defined as 1+z where is the period to be enhanced. The method proposed by Nakatani in [4] is to
filter the signal by synthesizing a harmonic sound as follows:
1. The fundamental frequency of the observed signal is estimated at each time
frame. If the time frame is short enough this fundamental frequency can be
considered constant.
-
48 3 Enhancement of a speech signal
2. The amplitudes and phases of individual harmonic components are esti-
mated using the short time Fourier transform (STFT), X(l,m), of s(n)
X(l,m) =n
g1(n nl)x(n)e2pimM
nnlfs , (3.4)
Ak,l = |X(l, [kF0,l])|, (3.5)k,l = X(l, [kF0,l]), (3.6)
where l is the index of the time frame, nl is the time index corresponding
to the center of the frame, m is the index of the frequency bin, M is the
number of points used for the Discrete Fourier Transform (DFT), Ak,l and
k,l are respectively the estimated amplitude and phase of the k-th harmonic
component, F0,l is the fundamental frequency of the time frame, g1(n) is
an analysis window function and [] discretizes a continuous frequency intothe index of the nearest frequency bin.
3. The output of the filter, x(n), is obtained by adding sinusoids
xl(n) =k
Ak,l cos
(kF0,l
n nlfs
+ k,l
), (3.7)
and combining them over succeeding frames
x(n) =l
g2(n (nl + lT ))xl(n), (3.8)
where xl(n) is a synthesized harmonic sound corresponding to the time
frame l, T is the frame shift in samples and g2(n) is a synthesis window
function.
Actually the harmonic filter itself is easy to implement. The main issue is to find
an accurate estimate of the fundamental frequency of the signal even in case of
strong reverberation.
3.1.3 Dereverberation operator
Harmonic case
The dereverberation operation is computed in the frequency domain using the
short time Fourier transform. Let X(l,m) be the STFT of a reverberant signal.
-
3.1 Harmonicity based dereverberation 49
X(l,m) can be represented as the product of the source signal, S(l,m), and the
room transfer function, H(m), which is assumed to be time invariant (cf. section
2.2). This transfer function can be divided into two function, D(m) and R(m).
The former corresponds to the direct signal, D(m)S(l,m), and the latter to the
reverberant part, R(m)S(l,m):
X(l,m) = H(m)S(l,m)
= D(m)S(l,m) +R(m)S(l,m)(3.9)
The aim of the dereverberation operator is to estimate the direct signalX (l,m) =D(m)S(l,m).
It can be obtained by subtracting the reverberant part R(m)X(l,m) from equa-
tion (3.9), or by finding the inverse filter W (m) such that
W (m) =D(m)
H(m)(3.10)
Then
X (l,m) =W (m)X(l,m)
=D(m)
H(m)
(H(m)S(l,m)
)= D(m)S(l,m).
(3.11)
The basic idea of the HERB method is the following: if S(l,m) is a harmonic
signal, the direct signal, contained in X(l,m) can be obtained using an adaptive
harmonic filter. At each time frame l an inverse filter W0(l,m) is computed in
the frequency domain using the output X(l,m) of the harmonic filter:
W0(l,m) =X(l,m)
X(l,m)(3.12)
As the signal X(l,m) is supposed to contain only the direct part of the signal
X(l,m), this filter will remove the reverberation on the time frame.
As the effect of the room is supposed to be constant the dereverberation operator
W (m) is estimated by averaging the inverse filter computed at the different time
frames.
W (m) = E {W0(l,m)} (3.13)
-
50 3 Enhancement of a speech signal
General case
This process can be applied on a speech signal S(l,m) by rewriting the equation
(2.1) in frequency domain
S(l,m) = Sh(l,m) + Sn(l,m) (3.14)
where Sh(l,m) is the harmonic part and Sn(l,m) is the non-harmonic part.
The observed reverberant signal X(l,m) is rewritten as
X(l,m) = D(m)Sh(l,m) + (R(m)Sh(l,m) +H(m)Sn(l,m)) (3.15)
where H(m) is the transfer function of the room, H(m) = D(m) +R(m).
The component D(m)Sh(l,m) can be approximately extracted from X(l,m) with
an adaptive harmonic filter. This approximated direct signal X(l,m) can be mod-
eled as:
X(l,m) = D(m)Sh(l,m) +(Rh(l,m) + Hn(l,m)
)(3.16)
where Rh(l,m) is a part of the reverberation of Sh(l,m) and Hn(l,m) is a part of
the direct signal and reverberation of Sn(l,m). It can be assume, if the fundamen-
tal frequency is perfectly estimated, that the only estimation errors on X(l,m)
are caused by this two unexpected remaining parts.
The dereverberation estimator is computed as in the harmonic case 3.12:
W (m) = E {W0(l,m)} = E{X(l,m)
X(l,m)
}(3.17)
Using the equation (3.16), the operator over a time frameW0(l,m) can be rewrit-
ten as (the time and frequency indexes have been removed for more clarity)
W0 =X
X=DSh + Rh + Hn
HS(3.18)
=
(D + Rh/Sh
)Sh
H (Sh + Sn)+
(Hn/Sn
)Sn
H (Sh + Sn)(3.19)
=D + Rh/Sh
H
1
1 + Sn/Sh+Hn/SnH
1
1 + Sh/Sn(3.20)
It can be proven (see in appendix A) that the expected value of a function 11+z
,
where z is a complex random variable, is equal to the probability that |z| < 1, if
-
3.1 Harmonicity based dereverberation 51
it is assumed that the phase of z is uniformly distributed, the phase of z and its
absolute value are statistically independent, and |z| 6= 1.Then, under the following conditions [10]:
1. The phase of Sn(l,m) and a joint event composed of Sh(l,m), Rh(l,m) and
|Sn(l,m)| are independent,2. The phase of Sh(l,m) and a joint event composed of |Sh(l,m)|, Hn(l,m)
and Sn(l,m) are independent,
3. The phase of Sh(l,m) and Sn(l,m) are uniformly distributed within [0, 2pi),
4. |Sh(l,m)| 6= |Sn(l,m)|,
the equation (3.17) can be derived as
W (m) D(m) + R(m)H(m)
P{|Sh(l,m)| > |Sn(l,m)|}, (3.21)
where
R(m) = Et
{Rh(l,m)
Sh(l,m)
}|Sh(l,m)|>|Sn(l,m)|
, (3.22)
P{} is a probability function, and Et{}A represents an average function overtime frames under a condition where A holds.
In the derivation of W (m), the term corresponding to the non-harmonic part
is neglected. Actually it is expected that, when |Sn(l,m)| > |Sh(l,m)|, i.e. thesignal is non-harmonic over the time frame, the output of the harmonic filter is
equal to zero, then
Et
{Hn(l,m)
Sn(l,m)
}|Sn(l,m)|>|Sh(l,m)|
0. (3.23)
Given equation (3.21) W (m) is expected to remove reverberation not only of the
harmonic components of the speech signal but also of the non-harmonic ones. It
approximates the inverse filter D(m)H(m)
except for a remaining reverberation due to
R(m).
The enhanced signal is then expected to be the sum of the direct signal and
a reduced reverberation part. However, because of the term P{|Sh(l,m)| >
-
52 3 Enhancement of a speech signal
|Sn(l,m)|}, the gain of the filter becomes zero in frequency regions where har-monic components were not included during the estimation of the dereverberation
operator.
In addition, even in frequency regions in which harmonic components were present
during the estimation phase, the filter gain is expected to decrease as the fre-
quency increase. The reason is that in a speech signal the energy of higher order
harmonic components is smaller than the energy of the sinusoidal component at
the fundamental frequency and its small multiples.
3.1.4 The HERB method
The figure 3.3 summarize the dereverberation process using the HERB method.
S(l,m) - H(m) -X(l,m) - W (m) - S(l,m)
?HarmonicFilter
?X(l,m)
E{X(l,m)X(l,m)
}@@@R
Figure 3.3: Diagram of the HERB dereverberation method
The system consists of the following sub-procedures:
1. Estimation of the fundamental frequency and the voiced durations.
2. Harmonic filtering.
3. Estimation of the dereverberation operator.
4. Dereverberation of the signal using this operator.
The estimation of the fundamental frequency is the most important point of the
method. F0 must be robustly estimated in order to expect a good dereverberation.
This task is difficult in presence of strong reverberation. Nakatani [10] proposes
to repeat the dereverberation process of figure 3.3 three times:
-
3.1 Harmonicity based dereverberation 53
STEP 1: The dereverberation process is applied on the observed reverberant
signal.
STEP 2: The dereverberated signal obtained in the first step is used to estimate
the fundamental frequency. This new F0 is used to control the adaptive
harmonic filter, but the input of this filter is the original reverberant signal.
STEP 3: The dereverberated signal obtained in the second step is now used as
new reverberant signal which is enhanced reapplying the whole dereverber-
ation process on it.
According to [10], the quality of the dereverberated signal improves at each step.
The second step can be repeated several times to even improve the estimation of
the fundamental frequency. By contrast, the quality of the signal does not always
improve when the third step is repeated, this is due to an accumulation of errors
in the dereverberation filters.
3.1.5 Test of the method
The performance of this method described in [4] and [10] are impressing. We have
implement this method to test two points:
Can this method easily be adapted to replace the STFT by a Gammatonefilterbank?
Can the process work in real-time and real environments?
For HRI the interest of this method resides in the fact that the fundamental
frequency of the signal is already computed for other processes. A pitch tracking
algorithm has already been developed at HRI [11] and can valuably be used to
estimate the fundamental frequency of the signals. As the computation of the
fundamental frequency seems to be the critical point of the HERB algorithm, we
expected a lot from this method.
Harmonic Filter
The aim of this section is to compare the harmonic filter proposed by Nakatani in
[4] and an implementation of the harmonic filter on the Gammatone filter-bank.
-
54 3 Enhancement of a speech signal
The implementation of the harmonic filter on the filter-bank is simple. The out-
put of the Gammatone filter bank are N signals corresponding to the different
frequency channels of the cochlea response. Knowing the fundamental frequency,
the frequency channels corresponding to F0 and its multiple are determined. At
a time sample t the signals of these channels are added.
The resulting spectrograms (see figure 3.4) shows that both implementation of
the harmonic filter give similar results.
Figure 3.4: up-left: Original signal (sweep with harmonics). up-right: Reverberantsignal. bottom-left: Harmonic estimate with the Gammatone filter-bank. bottom-right: Harmonic estimate with Nakatanis harmonic filter.
As expected the adaptive harmonic filter can be implemented without problems
on the Gammatone filterbank. Moreover the assumption that the fundamental
frequency is constant over a short time frame is not required any more as our
filterbank performs the harmonic filtering without the subsampling imposed by
the STFT used in [4]. The improvement of the method, using time wrapping,
proposed in [12] is useless with a Gammatone implementation.
-
3.1 Harmonicity based dereverberation 55
Dereverberation operator
In order to compute the dereverberation operator a training sequence is required.
During this adaptation phase the room impulse response must not change.
In the simulation the training sequence was composed of several sweeping sinu-
soid with their harmonics similar as the one shown on figure 3.4. The operator is
computed using a short-time Fourier transform. The restriction on the time win-
dow is really strong. It has to be long enough to contain a whole word (sweep)
including the complete reverberation tail.
It is at this point important to note that the window of the short-time Fourier
transform for the harmonic filter and for the estimation of the dereverberation
operator can not be the same. For the harmonic filtering the analysis window must
be as small as possible in order to respect the assumption that the fundamental
frequency of the signal is constant. On the other hand, for the estimation of the
filter W (m), the time window of the STFT must be of several seconds.
It is also assumed that, during the adaptation phase of the dereverberation fil-
ter, the pause between the words is long enough. Hence the reverberation tail
of a word would alter the the following word. Respecting these conditions the
dereverberation operator can be computed.
In order to estimate the performance of the algorithm, 500 random harmonic
signals (sweeps with harmonics) of 0.5 s each are used as training data. These
signals are convolved with the room impulse response shown in figure 2.7. As the
exact fundamental frequencies of these signals are known, a good estimation of
the dereverberation operation can be expected. This dereverberation operator is
then used to enhance a real speech signal convolved with the same room impulse
response (see figure 3.5).
Figure 3.6 shows the spectrogram of the enhanced signal. It is here important
to note that the speech signal used for the test contains only one word. As the
time window of the STFT used to compute the dereverberation operator is long
enough to contain a whole word, the enhanced signal is obtained by multiplying
the FFT of the whole reverberant signal with the dereverberation operator.
The dereverberation works relatively good. However, we can see on the spectro-
gram that the dereverberation filter is not causal. This is not surprising as we
explained in section 2.3 that room impulse responses are in general non minimum-
phase. Because of the non causality the beginning of the signal is altered. It can
-
56 3 Enhancement of a speech signal
Figure 3.5: Spectrogram of the clean and reverberant signal used to test the rever-beration operator.
Figure 3.6: Spectrogram of the enhanced signal computed in the frequency domain.
-
3.1 Harmonicity based dereverberation 57
be a problem as this part of the signal contains normally no reverberations and
can, therefore, be valuable for some processing.
We try to express the dereverberation filter in the time domain, using the inverse
Fourier transform of the operator computed in the frequency domain. In this case
the dereverberation does not work anymore (see figure 3.7).
Figure 3.7: Impulse response of the dereverberation operator and spectrogram of theenhanced signal computed in the time domain.
3.1.6 Discussion of the method
This method can theoretically remove very strong reverberations of a signal.
Moreover, as the dereverberation operator is computed in the frequency domain,
the computational cost is quite low.
However, the amount of required adaption data is prohibitive (about 500 words).
Then the pause between words has to be really large in case of long reverberation.
It is therefore impossible to use this method in real-time application. It will be
quite bothering to have to speak during several minutes with ASIMO before he
begins to understand what is said, and that supposing that neither the robot or
the speaker move during this time.
Given the restrictions on the adaptive phase of the algorithm this method can
not be applied in real-environment. In addition, a remark can be formulated on
the use of harmonic filtering to remove reverberations, even for a highly harmonic
signal. The harmonic filter manages quite well to remove the reverberation when
the fundamental frequency changes within the word. However a big part of the
reverberation stays if the fundamental frequency changes to slowly. Figure 3.8
-
58 3 Enhancement of a speech signal
shows the effect of the reverberation on the fundamental frequency in such a
case.
Figure 3.8: Effect of the reverberation on the fundamental frequency.
In this example the component corresponding to the fundamental frequency is
strongly disturbed by the reverberation. Increasing the frequency resolution (the
number of channels of the filter-bank) solve this problem. But more than 1000
channels are required to get a frequency selectivity as good as for the human
acoustical system. Due to the increasing computational load this is not feasible
for real-time applications.
-
3.2 Dereverberation using LP analysis 59
3.2 Dereverberation using LP analysis
The dereverberation using the harmonicity of the signal require too much training
data. Therefore, in this section, another dereverberation method will be discussed.
This method uses the autoregressive (AR) model of speech signals. Several meth-
ods have been proposed using linear prediction (LP) analysis [13] [14] [15].
3.2.1 Problem formulation
In section 2.1.4 it was explained that a speech signal s(n) can be expressed as
a linear combination of its L past sample values. The clean and the reverberant
speech become, respectively,
s(n) =
pk=1
aks(n k) + es(n), (3.24)
x(n) =
pk=1
bkx(n k) + ex(n), (3.25)
where ak and bk are the LP coefficients and es(n) and ex(n) the LP residual signal
(or prediction error).
The important assumption on which dereverberation methods using LP analysis
are based is that the LP coefficients are unaffected by the reverberation:
bk = ak k [1, L] N. (3.26)Actually this assumption holds only in a spatially averaged sense [16], i.e. using
several microphones:
E {bk} = ak k [1, L] N. (3.27)
Consequently the dereverberation process will try to enhance the LP residual of
the signal which structure is well known (see 2.1.4). The aim of the dereverber-
ation methods using LP analysis is to improve the LP residual signal such that
e(n) es(n). Then a clean speech estimate is obtained by
s(n) =L
k=1
bks(n k) + e(n), (3.28)
i.e. the estimated LP coefficients obtained by linear prediction analysis are used
to synthesize a signal out of the enhanced excitation signal e(n).
-
60 3 Enhancement of a speech signal
3.2.2 The kurtosis as measure of the reverberation
Figure 3.9: Example of platykurtic (left) and leptokurtic (right) distributions. Bothdistributions have the same standard deviation
Gillespie shows in [14] that the kurtosis of the LP residual is a valid reverberation
metric. The kurtosis 2 of a random signal x(n) is the degree of peakness of the
distribution, defined as the the fourth central moment 4 normalized by the fourth
power of the standard deviation (or the square of the variance):
2 =44
=E{(x(n) )4}
E{(x(n) )2}2 (3.29)
where = E {x(n)} is the mean value of x(n).As the kurtosis of a normal distribution is equal to 3, a kurtosis excess , denoted
2 and defined by
2 =44
3 (3.30)is often used. A distribution with a high peak 2 > 0 is called leptokurtic, a flat-
topped curve 2 < 0 is called platykurtic, and the normal distribution is called
mesokurtic.
Figure 3.9 illustrates the kurtosis measure. The distribution on the right is more
peaked at the center, we tend to conclude that it has a lower standard deviation.
But, on the other hand, it has thicker tails, which usually means that it has a
higher standard deviation. If the effect of the peakness exactly offsets that of the
thick tails, the two distributions will have the same standard deviation.
For a clean voiced speech, the LP residuals have strong peaks corresponding to
glottal pulses (see figure 3.10), whereas for reverberated speech such peaks are
spread in time. On figure 3.11, the probability density function of a clean signal
and of the convolution of this signal with the room impulse response computed
in the CARL Groups office (see figure 2.7) are estimated. Both signals have
been centered and normalized such that their means equals 0 and their standard
-
3.2 Dereverberation using LP analysis 61
Figure 3.10: On the left, extract of the LP residuals of a speech signal. Note the strongpeaks corresponding to the glottal pulses. On the right, the same signal impaired byreverberations.
Figure 3.11: Estimation of the probability density functions of the LP residuals ofa clean speech signal (blue) and of a reverberant signal (red). Both signal have beencentered and normalized such that their means = 0 and their standard deviations = 1.
-
62 3 Enhancement of a speech signal
deviation equals 1. The probability density functions are estimated by computing
the histograms of the signals and normalizing them by the number of samples
in the signals. The distribution of the kurtosis of the clean signal (blue) has
a higher peakness (2 = 42). The effect of the room reduces this peakness on
the reverberant signal (red) and the kurtosis of its LP residuals equals 10. By
maximizing the kurtosis of the LP residuals we can expect to improve the quality
of the observed signal.
3.2.3 Maximization of the kurtosis
In the time-domain
In order to enhance the reverberant signal x(n) an adaptive filter can be built,
which maximizes the kurtosis of its LP residual x(n). Given an L-taps adaptive
filter h(n) at time n, the output of this filter is y(n) = hT (n)x(n), where x(n) =
[x(n L + 1), . . . , x(n 1), x(n)]T . An LP synthesis filter yields y(n), the finalprocessed signal. Adaptation of h(n) is similar to a traditional Least-Mean-Square
(LMS) adaptive filter [17], except that the optimized value is a feedback function
f(n), corresponding to the gradient of the kurtosis.
Figure 3.12 (a) shows a diagram of the maximization system. The problem of
this algorithm is the LP reconstruction artifacts. However, this system is linear
and the order of the filters can be arbitrary changed, then h(n) can be computed
from x(n), but applied directly to x(n) (see figure 3.12 (b)).
A gradient method can be used to optimize the kurtosis. The gradient of the
kurtosis is computed by
2h
=4 (E {y2}E {y3x} E {y4}E {yx})
E3 {y2} (3.31)
This gradient can be approximated by
2h
(4((E {y2} y2 E {y4}) y)
E3 {y2}
)x = f(n)x(n) (3.32)
Were f(n) is the feedback function used to control the filter updates. For continu-
ous adaptation, the expected values E {y2} and E {y4} are estimated recursivelyby
E{y2(n)
}= E
{y2(n 1)}+ (1 )y2(n) (3.33)
E{y4(n)
}= E
{y4(n 1)}+ (1 )y4(n) (3.34)
-
3.2 Dereverberation using LP analysis 63
Figure 3.12: (a) A single channel time-domain adaptive algorithm for maximizing thekurtosis of the LP residuals. (b) Equivalent system, which avoids LP reconstructionartifacts.
where < 1 controls the smoothness of the estimates.
The update equation of the filter is given by
h(n+ 1) = h(n) + f(n)x(n) (3.35)
where controls the speed of adaptation.
In the frequency domain
However, according to Haykin [17] the convergence of a LMS-like algorithm in
time-domain is very slow. Therefore, Gillespie [14] proposes to adapt the algo-
rithm in the frequency-domain. Moreover by using more microphones and cal-
culating the feedback function on an averaged output of all the channels, the
accuracy and the speed of the adaptation is increased.
The frequency domain method proposed in [14] uses a modulated complex lapped
transform (MCLT) [18]. This filter-bank structure is close to a Discrete Cosine
Transform (DCT). The general diagram of the method in the frequency domain
for two microphone is described in figure 3.13.
-
64 3 Enhancement of a speech signal
Figure 3.13: Two-channel frequency-domain adaptive algorithm for maximization ofthe kurtosis of the LP residual.
3.3 Discussion of the method
The maximization of the kurtosis permits a real-time dereverberation. The adap-
tation is quick if a small adaptation filter is used. However, in the case of strong
reverberation the improvement on the signal is not perceptible.
If the length of the adaptive filter is increased, the kurtosis is still maximized and
the algorithm converges to a signal with maximum kurtosis. But the resulting
signal has sometimes a higher kurtosis than the original clean signal. The sound is
strongly distorted and sometimes not understandable anymore. Figure 3.14 shows
the original LP residual of the clean signal. This signal is artifically reverberated
and then enhanced by maximizing the kurtosis of the LP residuals. The resulting
LP residual has a higher kurtosis than the original one. This means that the
maximization has to be constrained. The clean speech has a higher kurtosis than
the reverberated one, but this does not mean that the signal with the highest
kurtosis is the clean signal.
Actually the length of the adaptive filter must not be longer than the period
of the glottal pulses. With this constraint, the efficiency of dereverberation is
limited.
Another drawback of this method is that the LP analysis, as explained in section
2.1.4, is a very good approximation of the magnitude spectrum of the speech
signal but strongly alters the phases spectrum. As the phase is crucial for source
localization, it should be studied if this method does not alter dramatically the
phase information of the signal.
-
3.3 Discussion of the method 65
Figure 3.14: On the left the LP residual of a clean signal. On the right the LP residualof the resulting dereverberated signal. The kurtosis of the dereverberated signal ishigher than the kurtosis of the original signal. The resulting signal is strongly distorted.
-
Chapter 4
Equalization of room impulse
responses
In chapter 3 the dereverberation approach considered the effect of the room as
a distortion which alters the harmonicity of the speech signal. This chapter will
discuss methods to estimate room impulse responses. These estimated impulse
responses can then be equalized (inverted) in order to recover the original clean
speech signal (see section 2.3).
In section 4.1 the principle of a channel estimation method using the second
order statistics of the observed signals will be explained. Then, in sections 4.2
and 4.3, two different implementations of this principle will be discussed. At last,
in section 4.4, some improvement ideas will be proposed.
4.1 Principle of the channel estimation
Some methods have been proposed to estimate one channel. For example, Hoop-
good proposes in [19] a single channel estimation method based on the non-
stationarity of speech and the stationarity of the room impulse response. How-
ever, in most of the cases, these methods require that the input signal is white
noise, which is not the case for a speech signal. On contrary the estimation of
several room impulse responses simultaneously is possible [20]. Moreover as it
was explained in section 2.3, it is much easier to find a global inverse for two or
more room impulse responses than the inverse of a single one. In this section a
method will be presented, which permits to estimate the impulse responses of a
-
68 4 Equalization of room impulse responses
Single-Input Multiple-Output (SIMO) system using only second order statistics
(SOS).
4.1.1 Hypothesis
In [20] Tong and al. show that a Single-Input Multiple-Output (SIMO) system
can be identified under the following conditions.
1. The autocorrelation matrix of the source signal is of full rank.
2. The channel transfer functions do not share any common zeros.
4.1.2 Basic idea
The relation between the input and the outputs of a SIMO system (see figure
4.1) is:
xi(n) = hi(n) s(n) i [1,M ] (4.1)
s(n)h1(n)
h2(n)...
hM(n)
x1(n) x2(n)...
...
xM(n)
Figure 4.1: SIMO System
In a vector/matrix form, such a signal model becomes:
xi(n) = HTi s(n) (4.2)
where
xi(n) =[xi(n), xi(n 1), . . . , xi(n L+ 1)
]T, (4.3)
Hi =
hi(0) hi(L 1) 0... . . . . . . ...0 hi(0) hi(L 1)
, (4.4)s(n) =
[s(n), s(n 1), . . . , s(n 2L+ 2)]T , (4.5)
-
4.1 Principle of the channel estimation 69
where L is the maximum length of the room impulse responses.
The idea of blind SIMO identification is to study the matrix
Rx =
i6=1Rxixi Rx2x1 RxMx1Rx1x2
i6=2Rxixi RxMx2
......
. . ....
Rx1xM Rx2xM
i6=M Rxixi
(4.6)where Rxixj = E
{xi(n)x
Tj (n)
}are the auto- and cross-correlation matrices of
the observed signals. The matrices Rxixj can be written as
Rxixj =1
TXiX
Tj (4.7)
where Xi is the L (T + L 1) Sylvester matrix of xi(n) and T is the numberof samples of xi(n).
If the matrix Rx is multiplied by the vector h =[hT1 h
T2 hTM
]Twe obtain
(for the first L rows):i6=1
Rxixih1 Rx2x1h2 RxMx1hM =
1
T
i6=1
Xi(XTi h1
) 1TX2(XT1 h2
) 1TXM
(XT1 hM
)=
1
T
i6=1
Xi(XTi h1 XT1 hi
)A left multiplication by the transpose of a Sylvester matrix is a convolution, then
the term XTi h1 XT1 hi actually equalsxi(n) h1(n) x1(n) hi(n) = s(n)
(hi(n) h1(n) h1(n) hi(n)
)(4.8)
and, as the convolution of real signals is commutative, this term equals zero. The
same development can be performed for the other rows of the matrix product
Rxh and it gives:
Rxh = 0, (4.9)
which means that the vector h lies in the null space of the matrix Rx.
4.1.3 How can this idea be implemented?
There are then two distinct approaches to identify the SIMO system:
-
70 4 Equalization of room impulse responses
An eigenvalue decomposition is performed on the matrix Rx and its nullspace is computed [21]. Under the hypothesis that Rs = E
{s(n)sT (n)
}is full rank, this null space corresponds to the unknown system h. This
batch method is discussed in section 4.2.
A set of filters gi(n) are adaptively estimated such that(i, j), (hi(n) gj(n) hj(n) gi(n)) = 0 [22]. This iterative method isdiscussed in section 4.3
4.1.4 Why have the channels to be coprime?
The second hypothesis, which requires that the channel transfer functions do not
share any common zeros, can be explained as follows:
Given for example two channels with impulse responses h1(n) and h2(n). If the
transfer function of these channels share common zeros then the impulse response
can be rewritten as
h1(n) = d(n) h1(n) (4.10)h2(n) = d(n) h2(n) (4.11)
where d(n) is by analogy with polynomials the greatest common divisor of h1(n)
and h2(n), and the transfer functions of h1(n) and h1(n) are coprime (do not
share any common zeros). Then x1(n) and x2(n) become
x1(n) =(s(n) d(n)) h1(n) (4.12)
x2(n) =(s(n) d(n)) h2(n), (4.13)
and, if the correlation matrix of s(n) d(n) is full rank, the methods will identifythe system [h1(n) h2(n)] instead of [h1(n) h2(n)].
4.1.5 Estimation of the length of the filters
Both the batch and iterative implementations require that the lengths of the
channels are given. The estimation of these lengths is very important as we will
explain in this subsection.
In the two microphones case, the channel estimation tries to find two FIR filters
g1(n) and g2(n) of lengths Lg + 1 such that
h1(n) g2(n) h2(n) g1(n) = 0. (4.14)
-
4.1 Principle of the channel estimation 71
where h1(n) and h2(n) are the two unknown FIR filters we want to identify. The
lengths of these filters are equal to Lh + 1, which is also unknown.
In the z-domain, the relation between the filters can be written as an operation
on polynomials
H1(z)G2(z) = H2(z)G1(z). (4.15)
The polynomials H1(z)G2(z) and H2(z)G1(z) are equal if and only if they have
exactly the same Lh + Lg zeros.
As H1(z) and H2(z) do not share common zeros, each zero of H1, resp. H2(z)
must also be a zero of G1(z), resp. G2(z). G1(z) and G2(z) contain at least Lhzeros and therefore Lg Lh.
When Lg = Lh, the method return directly the estimated channel. However,
when the lengths of the filters (or channel order) is over-estimated, additional
zeros appear. Figure 4.2 illustrates the system in the two-m