spectro-temporal analysis of speech using 2-d gabor filters

Spectro-Temporal Analysis of Speech Using 2-D Gabor Filters

Tony Ezzat, Jake Bouvrie, Tomaso Poggio

Center for Biological and Computational Learning,McGovern Institute for Brain Research

Massachusetts Institute of Technology, Cambridge, [email protected], [email protected], [email protected]

AbstractWe present a 2-D spectro-temporal Gabor filterbank based onthe 2-D Fast Fourier Transform, and show how it may be usedto analyze localized patches of a spectrogram. We argue that the2-D Gabor filterbank has the capacity to decompose a patch intoits underlying dominant spectro-temporal components, and weillustrate the response of our filterbank to different speech phe-nomena such as harmonicity, formants, vertical onsets/offsets,noise, and overlapping simultaneous speakers.Index Terms: speech analysis, spectro-temporal filterbanks, 2-D Gabor

1. IntroductionA typical narrowband magnitude spectrogram of speech dis-plays several important and well-known phenomena: harmonic-ity, which is exemplified by the presence of horizontal lines re-lated to the pitch of the speaker; low-frequency amplitude mod-ulations, related to the formants of the speakers vocal tract;vertical onset/offset edges in time, related to plosive soundsin speech; and noise, related to fricatives, aspirants, and otherphonemes which generate noise.

All of these speech-related phenomena may be viewed asdifferent types of spectro-temporal modulations, and one of thechallenges our auditory system faces in processing speech isthat it must detect, separate, and recognize these modulations ina fast and reliable manner.

Recent neurophysiological evidence from a number of ani-mals [1] [2] indicates that cells in the auditory cortex are, in fact,tuned to localized spectro-temporal modulations. The spectro-temporal receptive fields (STRFs) of these cortical cells looklike 2-D spectro-temporal Gabor filters, and an analogy be-tween these STRFs and a 2-D spectro-temporal Gabor filter-bank clearly suggests itself.

Motivated by these studies, we present in this work a simple2-D Gabor filterbank, and use it to analyze localized patches ofa spectrogram. We show in this work how such a filterbank re-sponds to the different types of phenomena commonly occuringin spectrograms, and further argue that, in principle, it has thecapacity to detect, separate, and recognize these phenomena.

Our 2-D Gabor filterbank is implemented using the 2-DFFT. We do this for two main reasons: firstly, the 2-D FFTis fast, since it makes use of the Fast Fourier Transform. Ad-ditionally, the 2-D FFT organizes its outputs into a 2-D grid,which allows us to easily visualize the filterbanks response asan image.

In prior work, Kleinschmidt, Gelbart, and colleagues [3] [4]also applied 2-D spectro-temporal Gabors to mel-spectrograms.However, in their approach, no organizational map of the Gaborfilter responses was formed. Instead, the 2-D Gabor outputswere lumped together into a one-dimensional feature vector for

use in recognition experiments. As a consequence, it is hardto interpret their results and see how the 2-D Gabor filterbankanalyzed the various types of spectrogram phenomena.

Shamma and colleagues [1] [5] have also applied 2-Dspectro-temporal Gabor filterbanks for speech discriminationand enhancement. In their work the spectro-temporal responseswere organized into a very large multi-dimensional tensorialrepresentation which is very hard to visualize and interpret. Wepresent an alternative filterbank decomposition in our work herewhich we believe is simpler and easier to interpret.

Finally, in our own previous work [6], we applied 2-D Ga-bor analysis using the 2-D FFT to spectrogram patches. How-ever, in that work we only noted the filterbanks response toharmonic phenomena, and failed to document how it respondsto other very important phenomena such as formants and ver-tical edges. This was partly due to the fact that patch DC val-ues were not removed prior to performing 2-D Gabor analysis,which made it difficult to see the response of the filterbank forthese other phenomena.

In the following sections, we briefly review how our spec-trograms are constructed (Section 2), and how are patches areselected (Section 3). Then we describe our 2-D Gabor filter-bank in Sections 4 and 5. Finally, in Section 6 we describe theresponse of the filterbank for different types of speech-relatedphenomena.

2. 1D STFTAll of the 16KHz utterances we consider are first STFT ana-lyzed using a 25msec Hamming window with a 1ms frame rateand a zeropadding factor of 4. This yields 1600 dimensionalSTFT frames, which are truncated to 800 bins due to the sym-metry of the Fourier transform. We limit our analysis in thispaper to the magnitude spectrogram of each utterance, whichwe represent notationally as S(f, t). Additionally, we limit ouranalysis to a linear frequency axis, deferring logarithmic fre-quency analysis to future work.

3. STFT PatchesAt every grid point (i, j) in the spectrogram, we extract a patchPij(f, t) of the spectrogram of size df and width dt. Theheight df and width dt of the local patch are important anal-ysis parameters: they must be large enough to be able to resolvethe underlying spectro-temporal components in the patch, butsmall enough so that the underlying signal is stationary. Suit-able parameter ranges are 5-15msec for the dt parameter, and600 800Hz for the df parameter. Additional analysis param-eters are the window hopsizes in time i and frequency j.Typically we set i to be 3-5ms and j to 150-350Hz, whichcreates overlap between the patches. Additionally, we subtractthe patch DC value 1

dfdt

Pf,t

Pij(f, t) from the patch Pij(f, t)

Figure 1: Left: Examples of spectro-temporal Gabor filtersGF,(f, t) for F = {1, 2, 3} and =

0, pi

4, pi2, 3pi4

. Right:

The magnitude Fourier transform of each corresponding Gaborfilter on the left.

before any further processing.

4. 2-D Gabor Filterbank Using the 2-D FFT

To perform 2-D Gabor filterbank analysis, we first multiply eachpatch Pij(f, t) by a 2-D Gaussian window W (f, t) located atthe patch center (f0, t0):

W (f, t) =1

2pifte

12

(ff0)

2

2f

+(tt0)

2

2t

!(1)

Typically, we fix the window bandwidth (f , t) to be one-thirdof the patch height and width respectively.

Secondly, a 2-D Fourier transform of size NH NW isapplied to the windowed patch to produce the 2-D spectro-temporal Gabor filterbank output:

Rij(, ) =P

f

PtW (f, t)Pij(f, t)e

j2pi NH

fej2pi

NWt

Typical values NH and NW are 512 and 256 respectively.By exchanging the terms W (f, t) and Pij(f, t), we can re-

write the filterbank output asRij(, ) =

Xf

Xt

Pij(f, t)G

,(f, t) (2)

where

G,(f, t) =W (f, t)ej2pi(

NHf+

NWt) (3)

Equation 3 above is the equation of a 2-D spectro-temporal Ga-bor filter. Rij(, ) may thus be viewed as the projection of apatch Pij(f, t) on an entire bank of spectro-temporal 2-D Ga-bor filters G,(f, t).

It is sometimes desirable to re-parameterize the spectro-temporal Gabors G,(f, t) in terms of their spectro-temporalfrequency F and orientation . This can be done in astraight-forward manner through the forward trigonometricmapping (F,) = (

2 + 2, tan1(, )) and the back-

ward trigonometric mapping (, ) = (Fcos, F sin).Shown in Figure 1 on the left are example 2-D GaborsGF,(f, t) for different values of F and .

Finally, it is well-known [7] that the Fourier transform ofa 2-D Gabor looks like a pair of conjugate Gaussian peaks,whose distance from each other is proportional to F , and whoseorientation is proportional to . Shown in Figure 1 on the rightare example 2-D Fourier transforms of the Gabor filters on theleft.

5. Patch 2-D Gabor Analysis & SynthesisShown in Figure 2 on the left is a representative patch froma narrowband magnitude spectrogram. The magnitude of the2-D spectro-temporal response |Rij(, )| for that patch is

Figure 2: Left: example spectrogram patch Pij(f, t). Mid-dle: magnitude spectro-temporal response |Rij(, )|. Right:reconstructed patch Pij(f, t) using the top 5 spectro-temporalcomponents.

plotted in the middle. In general, 2-D spectro-temporal Ga-bor responses of spectrogram patches exhibit multiple Gaussianpeaks that come in conjugate pairs. Each peak pair corre-sponds to a different spectro-temporal modulation contained inthe patch. In effect, the 2-D Gabor filterbank decomposes apatch into its spectro-temporal components. Peaks with largeamplitude in the spectro-temporal response Rij(, ) corre-spond to dominant spectro-temporal modulations in the patch.

In order to examine what spectro-temporal component eachof these peak pairs corresponds to, a simple peak-detectionstrategy is used to obtain a set C of candidate peak locations(peak, peak) and values (amplitude Apeak and phase peak)from the spectro-temporal response Rij(, ). We match theconjugate peak locations in C with each other into pairs andthrow out any peak candidates which do not have matching con-jugate peaks.

For each of these matched peak pairs, we can estimate aspectro-temporal orientation k and frequency Fk associatedwith that spectro-temporal component k as

k = tan1

peakpeak

(4)

and

Fk =

rpeakNH

2+peakNW

22

(5)where peak and peak refers to differences between theconjugate pair location coordinates. The amplitude Ak andphase k of the spectro-temporal component are just equal tothe amplitude Apeak and phasepeak of the peaks inRij(, )itself.

The spectro-temporal component k associatedwith a certain peak pair may thus be synthesized asAke

jkGFk,k(f, t). If there are k components in apatch then the patch may be approximately reconstructed as

Pij(f, t) = X

k

AkejkGFk,k(f, t)

!(6)

Shown in Figure 2 at right is the reconstuction of the patch onthe left using the top 5 spectro-temporal components from the2-D spectro-temporal Gabor response in the middle.

6. Phenomenological Analysis of 2-D GaborResponses

We now examine more carefully how different types ofcommonly-occurring phenomena in spectrograms are analyzedby the 2-D Gabor transform. The phenomena we will examineare harmonicity; low-frequency amplitude modulations relatedto formants; vertical onset/offset phenomena related to plosives;phonetic noise; the effect of adding white background noise;and finally, the effect of overlapping simultaneous speakers. Ineach of the following sections, we look at each phenomenonindividually.

Figure 3: First column: example harmonic patches Pij(f, t).Second column: magnitude spectro-temporal responses|Rij(, )| for each patch. Third column: spectro-temporalcomponents corresponding to largest (black X) and nextlargest (green X) peak pairs. Fourth column: reconstructedpatches Pij(f, t) using the top 2 spectro-temporal components.

6.1. Harmonicity

Shown in Figure 3 on the far left are two different representa-tive harmonic patches from a spectrogram. In the second col-umn, we plot the magnitude of the spectro-temporal response|Rij(, )| for each patch. In the third column we plot the re-synthesized spectro-temporal components associated with thetop two peak pairs in the spectro-temporal response. The firstspectro-temporal component corresponds to peak pair with thelargest amplitude (marked with a black X), while the secondspectro-temporal component corresponds to peak pair with thesecond largest amplitude (marked with a green X). Finally, inthe last column we plot the reconstruction Pij(f, t) of the inputpatch using only the top two components (as in Equation 6).

Clearly we can see that the spectro-temporal componentsassociated with the largest peaks are in fact the harmonic com-ponents. In general, harmonicity in a patch emerges as a set ofdominant conjugate peaks spaced with a distance proportionalto spectro-temporal frequency F and with an angle proportionalto the spectro-temporal orientation. This fact was noticed andemployed for the purposes of carrier estimation in [6] and pitch-tracking in [8].6.2. Formants

Further inspection of the spectro-temporal magnitude responses|Rij(, )| in Figure 3 reveals that the patches contain a secondcomponent exemplified by the presence of two smaller peakslocated closer to the origin in the spectro-temporal responses.Synthesizing the spectro-temporal component associated withthese secondary peaks reveals that they correspond to the low-frequency amplitude modulations associated with formants.

In general, low-frequency amplitude modulations in a patchemerge as a set of conjugate peaks that are spaced closer to theorigin in the spectro-temporal response. As with the harmonicpeaks, the distance and angle of the peaks corresponds directlyto the frequency and orientation of the modulation.

It is worthwhile to note here that 2-D Gabor filterbank anal-

Figure 4: First column: example plosive patches Pij(f, t).Second column: magnitude spectro-temporal responses|Rij(, )| for each patch. Third column: spectro-temporalcomponents corresponding to largest (black X) and nextlargest (green X) peak pairs. Fourth column: reconstructedpatches Pij(f, t) using the top 2 spectro-temporal components.

ysis is capable of separating harmonic from formant spectro-temporal components. Such a property could be used by ourauditory system to endow us with the ability to recognize speechin a manner that is invariant to the speaker uttering it: our audi-tory system could have evolved to focus its attention mainly onthe low-frequency components (which carry the phonetic infor-mation), ignoring the higher-frequency harmonic componentsrelated to the speaker.

On the other hand, the response to both harmonic and for-mant components is simultaneously preseved in the Gabor fil-terbanks outputs. Such a property could be used by our audi-tory system to enable us to recognize speech in a noise-robustmanner: whenever distinct harmonic peaks emerge in the filter-banks response, the auditory system can identify that responseas a distinct signature of speech, and it can then attend to the in-formation contained in the low-frequency peaks related to for-mants.

6.3. Vertical/Plosive Edges

Shown in Figure 4 on the far left are two different representativepatches which contain plosive phenomena. These phenomenaare characterized by rapid onset/offset of voicing or noise, andlook like vertical edges in a patch.

Inspection of the spectro-temporal magnitude responses|Rij(, )| for these plosive patches reveals the presence oftwo dominant peaks which are horizontal in their angular ori-entation, in contrast to the vertical angular orientation of bothharmonic and formant spectro-temporal modulations shown inFigure 3. Synthesizing the spectro-temporal component associ-ated with these peaks reveals that they do in fact correspond to avertical amplitude modulation associated with the plosive edge.

We note that vertical plosive onset/offset edges cannot bedetected unless a filterbank is used whose filters have a finiteextent in time. Consequently, it is quite heartening that, withinthe same 2-D Gabor filterbank framework, all three types ofharmonic, formant, and plosive phenomena can be separatelydetected.

Figure 5: First column: example noisy patch Pij(f, t). Secondcolumn: magnitude spectro-temporal responses |Rij(, )|.Third column: spectro-temporal components corresponding tolargest (black X), second largest (green X), and third largest(magenta X) peak pairs. Fourth column: reconstructed patchPij(f, t) using the top 3 spectro-temporal components.

Figure 6: Patch Pij(f, t) with increasing levels of whitenoise added, along with the corresponding magnitude spectro-temporal response |Rij(, )| for each patch. (Top left: 10 dBSNR, Top right: 5 dB SNR, Bottom left: -5 dB SNR, Bottomright: -10 dB SNR).

6.4. Phonetic Noise

Shown in Figure 5 on the left is a representative noisy patch cor-responding to the phoneme sh. As may be seen, the spectro-temporal response of a noisy patch contains multiple peaks atmultiple orientations and frequencies. In general we have foundthat this is a signature propery of noisy patches, in contrast toharmonic patches or plosive patches which usually contain oneor at most two dominant spectro-temporal components. For ex-ample, we need at least three spectro-temporal components inorder to begin to faithfully reconstruct the patch in Figure 5.6.5. Background Noise

In Figure 6 we investigate the effect of adding white noise to asound. Shown on the left of each figure pair is the same identicalpatch but with increasing amounts of white noise added1 . Onthe right hand side of each figure pair are the correspondingspectro-temporal responses. As may be seen, even though theSNR ratios are quite low, the spectro-temporal responses showvery clear harmonic peaks up to about -5 SNR ratios. At -10SNR, the spectro-temporal response begins to lose the dominantharmonic peaks.

Even though white noise adds a large number of uncorre-lated peaks to the spectro-temporal response, peaks correspond-ing to the relevant harmonic, formant, or plosive phenomenamay still be detectable. This is because the output of any onespectro-temporal Gabor filter is obtained by integrating infor-mation from the entire 2-D patch, which allows the 2-D spectro-

1White noise added in the time domain before the STFT is recom-puted.

Figure 7: Patches Pij(f, t) and associated magnitude spectro-temporal response |Rij(, )| for speaker A (left), speaker B(middle), and speaker A+B (right).

temporal filterbank to be more robust to noise than purely 1-Dspectral counterparts.6.6. Overlapping SpeakersFinally, we examine the effect of simultaneous speakers on thespectro-temporal response. Shown in the left of Figure 7 is apatch from speaker A and its associated spectro-temporal re-sponse. In the middle is the patch and response from speakerB (with same spectrogram coordinates (i, j) as the patch fromspeaker A). Finally, in the right of the figure is the patch andresponse from a spectrogram of speaker A and B. As may beseen, the spectro-temporal response of the simultaneous speak-ers contains identifiable peaks from both speakers. A strategysuggests itself for speaker separation which is based on iden-tifying the peaks in the combined spectro-temporal response,and assigning those peaks across frequency and time to eitherspeaker.

7. Discussion and ConclusionIn this work, we showed that the 2-D spectro-temporal Gaborresponse Rij(f, t) contains many useful and important proper-ties. In particular, we showed that 1) harmonicity emerges asa pair of very dominant vertical peaks; 2) formants emerge asa pair of vertical peaks spaced closer to the origin; 3) plosiveonsets/offsets emerge as a pair of horizontal peaks; and 4) noiseemerges as a large number of peaks at multiple orietations andfrequencies.

While this paper merely presented initial observations, fu-ture work will consist of more systematic exploration of the useof 2-D Gabor spectro-temporal responses for applications suchas speech recognition, de-noising, separation, and synthesis.

8. References[1] T. Chih, P. Ru, and S. Shamma, Multiresolution spectrotemporal

analysis of complex sounds, Journal of the Acoustical Society ofAmerica, vol. 118, pp. 887906, 2005.

[2] F.E. Theunissen, K. Sen, and A. Doupe, Spectral-temporal re-ceptive fields of nonlinear auditory neurons obtained using naturalsounds, Journal of Neuroscience, vol. 20, pp. 23152331, 2000.

[3] M. Kleinschmidt and D. Gelbart, Improving word accuracy withgabor feature extraction, in Proc. ICSLP, 2002.

[4] M. Kleinschmidt, Localized spectro-temporal features for auto-matic speech recognition, in Proc. Eurospeech, 2003.

[5] N. Mesgarani, M. Slaney, and S. Shamma, Speech discriminationbased on multiscale spectro-temporal features, in Proc. ICASSP,May 2004.

[6] T. Ezzat, J. Bouvrie, and T. Poggio, Max-gabor analysis and syn-thesis of spectrograms, in Proc. ICSLP, Pittsburgh, PA, 2006.

[7] JG Daugman, Uncertainty relation for resolution in space, spatialfrequency, and orientation optimized by two-dimensional corticalfilters, Journal of Optical Society of America, vol. 2, pp. 11601169, 1985.

[8] T.F. Quatieri, 2-d processing of speech with application to pitchtracking, in Proc. ICSLP, September 2001.

spectro-temporal analysis of speech using 2-d gabor filters

Documents