speech analysis using instantaneous frequency deviation · speech analysis using instantaneous...

4
Speech Analysis Using Instantaneous Frequency Deviation Anthony P. Stark and Kuldip K. Paliwal Signal Processing Laboratory, Griffith School of Engineering Griffith University, Brisbane Queensland 4111, Australia {a.stark, k.paliwal}@griffith.edu.au Abstract In this paper, our aim is to derive a phase spectrum represen- tation computed via the short-time Fourier transform. Specif- ically, we are interested in developing a narrow-band speech representation – employing 20-40 ms analysis windows. Fur- thermore, this representation should be as physically meaning- ful as the magnitude spectrum. To achieve these ends, we con- centrate on instantaneous frequency (IF) derived from the phase spectrum. In doing so, we introduce the IF deviation spectrum, and show that this spectrum exhibits pitch and formant structure similar to the magnitude spectrum. Lastly we demonstrate the advantages of the proposed IF deviation spectrum over the IF distribution spectrum proposed earlier in the literature. Index Terms: phase spectrum, instantaneous frequency 1. Introduction Most speech processing applications are based on the short- time magnitude spectrum, while relatively little attention is paid to the short-time phase spectrum. The aversion toward using the phase spectrum can be accounted for by two primary reasons. Firstly, the phase spectrum is difficult to interpret and process. To extract useful information, the phase spectrum requires significant processing. Furthermore it suffers from many tractability issues, including the phase unwrapping problem [1]. In contrast, the magnitude spectrum requires no such post-processing. Also, the visual cues that manifest within the magnitude spectrum correlate very well with our understanding of speech. Formant and pitch frequency are easily identifiable in the magnitude spectrum. The second reason for avoiding phase can be attributed to several well known perceptual experiments [2, 3, 4], which show marginal phase spectrum intelligibility over short (20-40 ms) win- dow durations. Contrary to these findings, it has recently been shown that stimuli constructed from the short-time phase spectrum can convey intelligibility comparable to its magnitude-only counterpart [5]. This result is supported by many studies which highlight a strong relationship between the two spectra - notably the recovery of magnitude spectrum from phase [6] and vice versa [7]. The present paper is motivated by our desire to further pursue this link, and aims to derive a meaningful magnitude-like spectral representation from the phase spectrum. The short-time phase spectrum is a function of time as well as frequency. While there can be many ways to derive mean- ingful representations from the phase spectrum, two possible ways that come to mind are those that are obtained by taking its frequency derivative or its time derivative. Differentiation along the frequency axis yields group delay and differentiation along the time axis gives instantaneous frequency (IF)[8]. Both group delay and instantaneous frequency are much more meaningful than the unprocessed phase, and both can be tied to physically relevant phenomena [9]. However, in this paper, we choose to restrict our focus to include only the IF spectrum branch of phase processing. Though there are several methods (and interpretations) for calculating IF, we further restrict our focus to include only IF via the short-time Fourier transform (STFT) phase spectrum. The STFT of signal x(t) is [10]: X(ω,t)= Z T 0 x(t + τ )w(τ )e jωτ dτ, (1) where w(t) and T are the window function and window dura- tion, respectively. The short-time IF spectrum ν (ω,t) is given as the time-derivative of the short-time phase spectrum as fol- lows, ν (ω,t)= ∂t ARG [X(ω,t)] . (2) For discrete-time signal processing, Eq. 2 can be given as: ν (ω,n)= ARG [X(ω,n + 1)] ARG [X(ω,n)] . (3) To avoid some of the phase unwrapping problems, we compute IF by using the Kay’s method as follows [11]: ν (ω,n)= ARG [X(ω,n + 1)X (ω,n)] . (4) This method offers a number of advantages over the direct discretization of Eq. 2. Firstly, it requires only a single complex phase (ARG) function. Secondly, it produces IF values within the range of π ν (ω,n) π. The majority of other meth- ods, do not guarantee this constraint – making interpretation of IF values falling outside this interval somewhat problematic. The IF spectrum has been extensively studied in the literature; it has been applied to formant extraction [10, 12], pitch extraction [13, 14] and speech recognition [15, 16, 17]. For the computation of IF spectrum from the STFT, two types of analysis procedures have been reported in the literature: narrow-band analysis and wide-band analysis. For narrow-band analysis, the duration of the analysis window is taken to be 20 to 40 ms. Since the duration is much longer than the typical pitch period of speech, the narrow-band analysis is generally used for pitch-asynchronous analysis (where the beginning of the analysis frame is needed to be synchronized with the pitch epoch). It has been shown in the literature [14, 18] that the IF spectrum obtained from narrow-band analysis contains information about the excitation source, but not about the vocal tract system. As a result, the narrow-band IF spectrum has been Accepted after peer review of full paper Copyright © 2008 ISCA September 22 - 26, Brisbane Australia 2602

Upload: others

Post on 14-Apr-2020

37 views

Category:

Documents


0 download

TRANSCRIPT

Speech Analysis Using Instantaneous Frequency Deviation

Anthony P. Stark and Kuldip K. Paliwal

Signal Processing Laboratory, Griffith School of EngineeringGriffith University, Brisbane Queensland 4111, Australia

{a.stark, k.paliwal}@griffith.edu.au

AbstractIn this paper, our aim is to derive a phase spectrum represen-tation computed via the short-time Fourier transform. Specif-ically, we are interested in developing a narrow-band speechrepresentation – employing 20-40 ms analysis windows. Fur-thermore, this representation should be as physically meaning-ful as the magnitude spectrum. To achieve these ends, we con-centrate on instantaneous frequency (IF) derived from the phasespectrum. In doing so, we introduce the IF deviation spectrum,and show that this spectrum exhibits pitch and formant structuresimilar to the magnitude spectrum. Lastly we demonstrate theadvantages of the proposed IF deviation spectrum over the IFdistribution spectrum proposed earlier in the literature.Index Terms: phase spectrum, instantaneous frequency

1. IntroductionMost speech processing applications are based on the short-time magnitude spectrum, while relatively little attention ispaid to the short-time phase spectrum. The aversion towardusing the phase spectrum can be accounted for by two primaryreasons. Firstly, the phase spectrum is difficult to interpret andprocess. To extract useful information, the phase spectrumrequires significant processing. Furthermore it suffers frommany tractability issues, including the phase unwrappingproblem [1]. In contrast, the magnitude spectrum requiresno such post-processing. Also, the visual cues that manifestwithin the magnitude spectrum correlate very well with ourunderstanding of speech. Formant and pitch frequency areeasily identifiable in the magnitude spectrum. The secondreason for avoiding phase can be attributed to several wellknown perceptual experiments [2, 3, 4], which show marginalphase spectrum intelligibility over short (20-40 ms) win-dow durations. Contrary to these findings, it has recentlybeen shown that stimuli constructed from the short-timephase spectrum can convey intelligibility comparable to itsmagnitude-only counterpart [5]. This result is supported bymany studies which highlight a strong relationship between thetwo spectra - notably the recovery of magnitude spectrum fromphase [6] and vice versa [7]. The present paper is motivatedby our desire to further pursue this link, and aims to derive ameaningful magnitude-like spectral representation from thephase spectrum.

The short-time phase spectrum is a function of time as wellas frequency. While there can be many ways to derive mean-ingful representations from the phase spectrum, two possibleways that come to mind are those that are obtained by takingits frequency derivative or its time derivative. Differentiationalong the frequency axis yields group delay and differentiation

along the time axis gives instantaneous frequency (IF)[8].Both group delay and instantaneous frequency are much moremeaningful than the unprocessed phase, and both can be tiedto physically relevant phenomena [9]. However, in this paper,we choose to restrict our focus to include only the IF spectrumbranch of phase processing.

Though there are several methods (and interpretations) forcalculating IF, we further restrict our focus to include only IFvia the short-time Fourier transform (STFT) phase spectrum.The STFT of signal x(t) is [10]:

X(ω, t) =

Z T

0

x(t+ τ )w(τ )e−jωτdτ, (1)

where w(t) and T are the window function and window dura-tion, respectively. The short-time IF spectrum ν(ω, t) is givenas the time-derivative of the short-time phase spectrum as fol-lows,

ν(ω, t) =∂

∂tARG [X(ω, t)] . (2)

For discrete-time signal processing, Eq. 2 can be given as:

ν(ω,n) = ARG [X(ω, n+ 1)]− ARG [X(ω,n)] . (3)

To avoid some of the phase unwrapping problems, we computeIF by using the Kay’s method as follows [11]:

ν(ω,n) = ARG [X(ω, n+ 1)X∗(ω,n)] . (4)

This method offers a number of advantages over the directdiscretization of Eq. 2. Firstly, it requires only a single complexphase (ARG) function. Secondly, it produces IF values withinthe range of −π ≤ ν(ω, n) ≤ π. The majority of other meth-ods, do not guarantee this constraint – making interpretation ofIF values falling outside this interval somewhat problematic.

The IF spectrum has been extensively studied in theliterature; it has been applied to formant extraction [10, 12],pitch extraction [13, 14] and speech recognition [15, 16, 17].For the computation of IF spectrum from the STFT, two typesof analysis procedures have been reported in the literature:narrow-band analysis and wide-band analysis. For narrow-bandanalysis, the duration of the analysis window is taken to be 20to 40 ms. Since the duration is much longer than the typicalpitch period of speech, the narrow-band analysis is generallyused for pitch-asynchronous analysis (where the beginningof the analysis frame is needed to be synchronized with thepitch epoch). It has been shown in the literature [14, 18] thatthe IF spectrum obtained from narrow-band analysis containsinformation about the excitation source, but not about the vocaltract system. As a result, the narrow-band IF spectrum has been

Accepted after peer review of full paperCopyright © 2008 ISCA

September 22-26, Brisbane Australia2602

used in the past for pitch extraction [13, 14]. The narrow-bandmagnitude spectrum, on the contrary, contains informationabout the excitation source as well as the vocal tract systemand has been used for both pitch and formant extraction [19].For wide-band analysis, the duration of the analysis window istypically taken to be 2 to 4 ms. Since it is much smaller thanthe typical pitch period expected in normal speech, wide-bandanalysis has to be used in a pitch-synchronous manner, other-wise the placement of analysis window with respect to pitchepochs will introduce significant frame-to-frame variability.The resulting IF spectrum contains information about the vocaltract system, but not about the excitation source [10, 20].Friedman [10] has derived a novel representation (known as theIF distribution representation) from the wide-band IF spectrumand shown that this representation provides information aboutthe vocal tract system (i.e., formant structure).

In this paper, our aim is to derive a representation fromthe narrow-band IF spectrum in such a way that it containsinformation about the source excitation as well as the vocaltract system – similar to the narrow-band magnitude spectrum.With this aim in mind, we examine the properties of IFdeviation – the measure of how far each IF value strays fromits center frequency. By taking the inverse absolute valueof the IF deviation, we have developed a novel phase-basedspectral representation. This new representation (referred hereas the IF deviation spectrum) gives clear visual indication ofunderlying pitch and formant structure, and is more correlatedto the narrow-band magnitude spectrum than the current IFdistribution representations.

The rest of this paper is organized as follows. In section 2,we briefly describe Friedman’s IF distribution representation. Insection 3, we describe our proposed a IF deviation based spec-tral representation. And lastly in section 4, we present someconcluding remarks and direction for future research.

2. The IF distribution representationBy noting the fact that the IF values tend to cluster arounddominant frequencies of the signal, Friedman [10] computedthe histogram of IF values (i.e., IF density) across the frequencyaxis. The IF density as a function of the bin’s center frequency,gives rise to the IF distribution spectrum (or, spectral repre-sentation). This representation when used with for widebandanalysis (2-4 ms window duration) was shown to displayformant detection. However, for narrow-band analysis (20-40ms window duration), the IF distribution spectrum loses theformant structure, displaying only the pitch information in theform of fine harmonic detail.

To illustrate this property of the narrow-band IF distributionspectrum1, we take a 32 ms long speech signal of the vowel ayuttered by a male speaker. We show the magnitude spectrum,the IF spectrum and the IF distribution spectrum in Fig. 1 (a),(b) and (c), respectively. We can see from Fig. 1(a) that themagnitude spectrum provides both pitch and formant informa-tion. The IF spectrum shown in Fig. 1(b) has a diagonal stair-case shape with horizontal stairs occurring at pitch harmonics(indicating high density of IF values near dominant harmonic

1For computing the IF distribution spectrum, we use the Hann anal-ysis window originally proposed by Friedman [10].

20

40

60

80

Pow

er (d

B) (a)

π

0

−π

ν(r

ads)

(b)

0 0.1 0.2 0.3 0.4 0.50

1

2

3

Sm

ooth

edIF

−den

sity

Normalized Frequency

(c)

Figure 1: Illustration of IF distribution spectrum for narrow-band analysis of a 32 ms long speech signal of vowel ay froma male speaker sampled at 8 kHz using a 32 ms Hann window.(a) Magnitude spectrum, (b) IF spectrum, and (c) IF distributionspectrum.

frequencies). Lastly, the IF distribution spectrum shown in Fig.1(c) has information about pitch harmonics, but does not exhibitany formant structure. In order to illustrate the spectro-temporalproperties of the IF distribution based spectral representation,we use a speech utterance of the sentence “He knew the skillof the great young actress.” spoken by a male speaker sam-pled at 8 kHz and perform a narrow-band analysis with a 32ms Hann window and 4 ms frame shift. The magnitude andthe IF distribution spectrograms of this utterance are shown inFig. 3 (a) and (b), respectively. Again, note that the magnitudespectrogram displays pitch as well as formant information. Onthe otherhand, the IF distribution spectrogram shows only pitchharmonics and does not exhibit formant structure.

3. Proposed IF deviation representationAs mentioned earlier, our aim in this paper is to derive a repre-sentation from the IF spectrum computed via the narrow-band(20-40 ms long analysis window) STFT. Like the magnitudespectrum, this representation should display the pitch as well asthe formant structure. The IF distribution spectrum describedin the preceding section exhibits pitch structure but not the for-mant structure. In this section, we introduce a new representa-tion based on IF deviation to achieve this aim. We define the IFdeviation ψ(ω, t) as:

ψ(ω, t) = ν(ω, t)− ω. (5)

For discrete-time signals, it becomes:

ψ(ω, n) = ν(ω,n) − ω. (6)

Or following Kay’s method (Eq. 4):

ψ(ω,n) = ARG [X(ω, n+ 1)X∗(ω, n)]− ω

= ARGhX(ω, n+ 1)X∗(ω, n)e−jω

i.

(7)

2603

20

40

60

80

Pow

er (d

B)

(a)

0 0.1 0.2 0.3 0.4 0.5

−10

10

30

50

20lo

(dB

)

Normalized Frequency

(b)

Figure 2: Illustration of IF deviation spectrum for narrow-bandanalysis of a 32 ms long speech signal of vowel ay from a malespeaker sampled at 8 kHz using a 32 ms Chebyshev 50 dB win-dow. (a) Magnitude spectrum and (b) IF deviation spectrum.

We have seen earlier from Fig. 1(b) that the IF spectrum be-haves like a diagonal stair-case, with horizontal stairs occurringat pitch peaks. The IF values basically track the frequencies ofpitch harmonic peaks. A closer inspection of this figure revealsthat the accuracy of the tracking (the IF deviation magnitude) isinversely proportional to the spectral magnitude; i.e., the higherthe magnitude of a harmonic peak, the better the IF value tracksits corresponding harmonic frequency. In other words, the ab-solute value of IF deviation at a particular frequency is inverselyproportional to the value of magnitude spectrum at that fre-quency; i.e.,

1

|ψ(ω, t)|∝ |X(ω, t)|. (8)

Therefore, we use the inverse, absolute IF deviation value todefine a new function,

α(ω, t) = |ψ(ω, t)|−1. (9)

We use the function α(ω, t) at a given time t to define a newspectral representation and call it the IF deviation spectrum2.We should point out that the relationship given by Eq. 8 isnot influenced by scaling |X(ω, t)| globally. Instead it refersto the relative spectral magnitudes within a given analysisframe. Figure 2(b) shows proposed spectrum. We can see fromthis figure that the IF deviation spectrum shows both pitchand formant structure similar to that shown by the magnitudespectrum in Fig. 2(a).

Figure 4(b) shows the IF deviation spectrogram for thesame speech utterance as used in the preceding section. We cansee that this spectrogram exhibits pitch and formant structure ina manner similar to the magnitude spectrogram. Again, com-parisons with Fig. 3 show its advantage over the IF distributionrepresentation in terms of formant structure preservation.

2For computing the IF deviation spectrum, we use the the Chebyshevequiripple windows [22].

4. ConclusionIn this paper we have developed a new spectral representationderived from the STFT phase spectrum. Past IF distributionrepresentations when used for narrow-band analysis were oftenlimited – being ill-suited for formant extraction. By focusingour attention on the IF deviation quantity, we have been ableto develop a narrow-band IF representation that capturesinformation for both excitation source as well vocal tract –mirroring much of the information provided by the magnitudespectrum. Visual inspection of the IF deviation spectrumsuggests its potential to many ASR applications in which theshort-time magnitude spectrum is traditionally used.

As a final note, while this paper has focused on creating amagnitude-like representation from phase, it still remains to beseen if phase can provide useful features not possible from themagnitude spectrum. Deriving such phase-exclusive featuresremains an interesting and challenging problem for the future.

5. References[1] H. Al-Nashi, “Phase unwrapping of digital signals,” Acoustics,

Speech, and Signal Processing [see also IEEE Transactions onSignal Processing], IEEE Transactions on, vol. 37, no. 11, pp.1693–1702, Nov 1989.

[2] M. Schroeder, “Models of hearing,” IEEE, vol. 63, pp. 1332–1350, 1975.

[3] A. V. Oppenheim and J. S. Lim, “The importance of phase in sig-nals,” Proceedings of the IEEE, vol. 69, no. 5, pp. 529–541, May1981.

[4] L. Liu, J. He, and G. Palm, “Effects of phase on the perceptionof intervocalic stop consonants,” Speech Commun., vol. 22, no. 4,pp. 403–417, 1997.

[5] L. D. Alsteris and K. K. Paliwal, “Further intelligibility resultsfrom human listening tests using the short-time phase spectrum,”Speech Communication, vol. 48, no. 6, pp. 727–736, 2006.

[6] M. Hayes, J. Lim, and A. Oppenheim, “Phase-only signal recon-struction,” Acoustics, Speech, and Signal Processing, IEEE Inter-national Conference on ICASSP ’80., vol. 5, pp. 437–440, Apr1980.

[7] D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” Acoustics, Speech, and Signal Process-ing [see also IEEE Transactions on Signal Processing], IEEETransactions on, vol. 32, no. 2, pp. 236–243, Apr 1984.

[8] A. V. Oppenheim and R. W. Schafer, Digital Signal Processing.Prentice–Hall, 1975.

[9] B. Boashash, “Estimating and interpreting the instantaneous fre-quency of a signal,” Proceedings of the IEEE, vol. 80, no. 4, pp.520–568, 1992.

[10] D. Friedman, “Instantaneous frequency distribution vs time: Aninterpretation of the phase structure of speech,” in Proc. Interna-tional Conf. Acoustics, Speech, Signal Processing, March 1985,pp. 1121–1124.

[11] S. Kay, “A fast and accurate single frequency estimator,” Acous-tics, Speech, and Signal Processing [see also IEEE Transactionson Signal Processing], IEEE Transactions on, vol. 37, no. 12, pp.1987–1990, Dec 1989.

[12] A. Potamianos and P. Maragos, “Speech formant frequency andbandwidth tracking using multiband energy demodulation.” [On-line]. Available: citeseer.ist.psu.edu/potamianos96speech.html

[13] T. Abe, T. Kobayashi, and S. Imai, “Harmonics tracking and pitchextraction based on instantaneous frequency,” Acoustics, Speech,and Signal Processing, 1995. ICASSP-95., 1995 InternationalConference on, vol. 1, pp. 756–759 vol.1, 9-12 May 1995.

2604

Nor

mal

ized

Freq

uenc

y(a)

0

0.2

0.4

Time (s)

Nor

mal

ized

Freq

uenc

y

(b)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.2

0.4

Figure 3: Illustration of IF distribution spectrogram for a speech utterance sampled at 8 kHz from a male speaker using narrow-bandanalysis with 32 ms Hann window with 4 ms frame shift. (a) Magnitude spectrogram and (b) IF distribution spectrogram representation.

Nor

mal

ized

Freq

uenc

y

(a)

0

0.2

0.4

Nor

mal

ized

Freq

uenc

y

Time (s)

(b)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.2

0.4

Figure 4: Illustration of IF deviation spectrogram for a speech utterance sampled at 8 kHz from a male speaker using narrow-bandanalysis with 32 ms Chebyshev 50 dB window with 4 ms frame shift. (a) Magnitude spectrogram and (b) IF deviation spectrogramrepresentation.

[14] F. Charpentier, “Pitch detection using the short-term phase spec-trum,” Acoustics, Speech, and Signal Processing, IEEE Interna-tional Conference on ICASSP ’86., vol. 11, pp. 113–116, Apr1986.

[15] K. Paliwal and B. Atal, “Frequency-related represention ofspeech,” in Eurospeech, 2003, pp. 65–68.

[16] A. Potamianos and P. Maragos, “Time-frequency distributionsfor automatic speech recognition,” Speech and Audio Processing,IEEE Transactions on, vol. 9, no. 3, pp. 196–200, Mar 2001.

[17] Y. Wang, J. Hansen, G. Allu, and R. Kumaresan, “Average instan-taneous frequency (aif) and average log-envelopes (ale) for asrwith the aurora 2 database,” in Eurospeech, September 2003, pp.25–28.

[18] J. Flanagan and R. Golden, “Phase vocoder,” The Bell SystemTechnical Journal, pp. 388–404, 1966.

[19] L. Rabiner and R. Schafer, Digital Processing of Speech Signals.Prentice Hall, 1978.

[20] L. D. Alsteris and K. K. Paliwal, “Short-time phase spectrum in

speech processing: A review and some experimental results,” Dig-ital signal processing, vol. 17, pp. 578–616, May 2007.

[21] L. Cohen, “Time-frequency distributions-a review,” Proceedingsof the IEEE, vol. 77, no. 7, pp. 941–981, Jul 1989.

[22] F. Harris, “On the use of windows for harmonic analysis withthe discrete fourier transform,” Proceedings of the IEEE, vol. 66,no. 1, pp. 51–83, Jan. 1978.

[23] L. Mandel, “Interpretation of instantaneous frequencies,” Am. J.Phys, vol. 42, pp. 840–846, 1974.

[24] D. Gabor, “Theory of communication,” IEE, vol. 93, pp. 429–457,1946.

[25] M. Sun and R. Sclabassi, “Discrete-time instantaneous frequencyand its computation,” IEEE trans. signal processing, pp. 1867–1880, May 1993.

[26] D. Friedman, “Formulation of a vector distance measure for theinstantaneous-frequency distribution (ifd) of speech,” Acoustics,Speech, and Signal Processing, IEEE International Conferenceon ICASSP ’87., vol. 12, pp. 1748–1751, Apr 1987.

2605