a simulation of vowel segregation based on across-channel

17
Dan Ellis Vowel segregation based on synchrony DRAFT 1994jan03 - 1 Media Lab Perceptual Computing Technical Report #252 presented at the 126th meeting of the Acoustical Society of America, Denver, October 1993 A simulation of vowel segregation based on across-channel glottal-pulse synchrony Daniel PW Ellis MIT Media Lab Perceptual Computing, Cambridge, MA 02139 [email protected] As part of the broader question of how it is that human listeners can be so successful at extracting a single voice of interest in the most adverse noise conditions (the ‘cocktail-party effect’), a great deal of attention has been focused on the problem of separating simultaneously presented vowels, primarily by exploiting assumed differences in fundamental frequency (f 0 ) (see (de Cheveigné, 1993) for a review). While acknowledging the very good agreement with experimental data achieved by some of this models (e.g. Meddis & Hewitt, 1992), we propose a different mechanism that does not rely on the different period of the two voices, but rather on the assumption that, in the majority of cases, their glottal pitch pulses will occur at distinct instants. A modification of the Meddis & Hewitt model is proposed that segregates the regions of spectral dominance of the different vowels by detecting their synchronization to a common underlying glottal pulse train, as will be the case for each distinct human voice. Although phase dispersion from numerous sources complicates this approach, our results show that with suitable integra- tion across time, it is possible to separate vowels on this basis alone. The possible advantages of such a mechanism include its ability to exploit the period fluctuations due to frequency modulation and jitter in order to separate voices whose f 0 s may otherwise be close and difficult to distinguish. Since small amounts of modulation do indeed improve the prominence of voices (McAdams, 1989), we suggest that human listeners may be employing something akin to this strategy when pitch-based cues are absent or ambiguous. Although this process has many parts, none of them per- fectly understood, we can list a few of the major techniques presumably employed in such circumstances: (a) Visual verbal cues such as lip-reading; (b) Other visual cues such as facial expression; (c) Context, limiting the range of possible interpreta- tions that can possibly be placed upon other evi- dence; (d) Acoustic information isolated by spatial cues (interaural time differences and interaural spectral intensity differences). This is most often what is suggested by the ‘Cocktail Party Effect’; (e) Acoustic information isolated from the total sound mixture on the basis of cues other than those related to spatial location. This last category, though defined in a rather indirect manner, is of considerable interest since it is the only method available (apart from semantic context) when visual and spatial information is removed, for instance 1. INTRODUCTION Although we are exposed to many sounds in everyday life, arguably the most important input to our sense of hearing is the speech of others. As with all hearing tasks, a major obstacle to the recognition and interpretation of such sounds is the isolation of the signal of interest (e.g. a particular spoken phrase) from any simultaneous inter- fering sound. These may be extraneous noise such as traffic or ringing telephones, or they may be other voices which are of less interest than the particular object of our attention. This latter case is particularly common, and in signal detection terms it is the most difficult : isolating a target from noise with the same average characteristics. This makes it difficult to design a mechanism to remove the interference from the target. Yet this is something we do with an almost incredible effectiveness. The loosely-defined set of processes referred to as the ‘Cocktail-Party Effect’ enable us to engage in spoken communication in the most adverse circumstances.

Upload: others

Post on 01-Feb-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 1

Media Lab Perceptual Computing Technical Report #252

presented at the 126th meeting of the Acoustical Society of America, Denver, October 1993

A simulation of vowel segregationbased on across-channel glottal-pulse synchrony

Daniel PW EllisMIT Media Lab Perceptual Computing, Cambridge, MA 02139

[email protected]

As part of the broader question of how it is that human listeners can be so successful at extractinga single voice of interest in the most adverse noise conditions (the ‘cocktail-party effect’), a greatdeal of attention has been focused on the problem of separating simultaneously presentedvowels, primarily by exploiting assumed differences in fundamental frequency (f0) (see (deCheveigné, 1993) for a review).

While acknowledging the very good agreement with experimental data achieved by some of thismodels (e.g. Meddis & Hewitt, 1992), we propose a different mechanism that does not rely on thedifferent period of the two voices, but rather on the assumption that, in the majority of cases, theirglottal pitch pulses will occur at distinct instants.

A modification of the Meddis & Hewitt model is proposed that segregates the regions of spectraldominance of the different vowels by detecting their synchronization to a common underlyingglottal pulse train, as will be the case for each distinct human voice. Although phase dispersionfrom numerous sources complicates this approach, our results show that with suitable integra-tion across time, it is possible to separate vowels on this basis alone.

The possible advantages of such a mechanism include its ability to exploit the period fluctuationsdue to frequency modulation and jitter in order to separate voices whose f0s may otherwise beclose and difficult to distinguish. Since small amounts of modulation do indeed improve theprominence of voices (McAdams, 1989), we suggest that human listeners may be employingsomething akin to this strategy when pitch-based cues are absent or ambiguous.

Although this process has many parts, none of them per-fectly understood, we can list a few of the major techniquespresumably employed in such circumstances:

(a) Visual verbal cues such as lip-reading;(b) Other visual cues such as facial expression;(c) Context, limiting the range of possible interpreta-

tions that can possibly be placed upon other evi-dence;

(d) Acoustic information isolated by spatial cues(interaural time differences and interaural spectralintensity differences). This is most often what issuggested by the ‘Cocktail Party Effect’;

(e) Acoustic information isolated from the total soundmixture on the basis of cues other than those relatedto spatial location.

This last category, though defined in a rather indirectmanner, is of considerable interest since it is the onlymethod available (apart from semantic context) whenvisual and spatial information is removed, for instance

1. INTRODUCTION

Although we are exposed to many sounds in everyday life,arguably the most important input to our sense of hearingis the speech of others. As with all hearing tasks, a majorobstacle to the recognition and interpretation of suchsounds is the isolation of the signal of interest (e.g. aparticular spoken phrase) from any simultaneous inter-fering sound. These may be extraneous noise such astraffic or ringing telephones, or they may be other voiceswhich are of less interest than the particular object of ourattention. This latter case is particularly common, and insignal detection terms it is the most difficult : isolating atarget from noise with the same average characteristics.This makes it difficult to design a mechanism to removethe interference from the target.

Yet this is something we do with an almost incredibleeffectiveness. The loosely-defined set of processes referredto as the ‘Cocktail-Party Effect’ enable us to engage inspoken communication in the most adverse circumstances.

Page 2: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 2

over a telephone link or listening to a radio broadcast.Experience in such situations reveals that we are able to doa reasonable job of ‘hearing out’ individual voices withoutbinaural or other cues.

There has been considerable interest in and research intothis problem in the past few decades, motivated by thepotential usefulness of devices that could enhance ‘desired’but corrupted speech, as well as perhaps by curiosity aboutthe strategy employed by our own perceptual mechanisms.One popular line of inquiry starts with the observation thata major attribute of much speech is its pseudoperiodicnature : ‘voiced’ speech is perceived as having a pitch,which corresponds to the approximate periodicity of thesound pressure waveform. When we are presented with amixture of voiced speech from two or more independentsources, we can perhaps exploit the fact that the differentvoices are likely to have distinct pitches as a basis forattending to one and discarding the rest. Informal intro-spection reinforces this idea of pitch as a primary organiz-ing attribute in speech.

While a number of different approaches have been taken tothe problem of automatically separating voices based onthe differences in their periodicity, this paper describeswhat we believe to be a new, additional method to help inseparation. We have been particularly struck by theenhanced prominence of sounds with slight frequencymodulation compared to completely unmodulated tones, asvery plainly demonstrated by McAdams (1984, 1989).Considering the frequency modulation characteristics ofnatural speech sounds, we speculated that there may bemechanisms in the auditory system that are able to detectthe short-term cycle-to-cycle fluctuations in the fundamen-tal period of real speech, and use these to help separatedistinct, simultaneous voices. We will describe the algo-rithm that we have developed to exhibit these qualities,and its effectiveness at this task, which exceeded ourpreliminary expectations. Due to its dependence on vari-ability over a very short time scale, this approach may beconsidered a time-domain algorithm in contrast to theharmonic-tracking frequency-domain approaches whichhave been popular.

In section 2, we discuss the nature of pitch-period variationin real speech as motivation for the new approach. Insection 3, we briefly review some previous work in vowelseparation, then explain the basis of our new technique.Section 4 gives some examples of the preliminary resultswe have obtained. The issues raised by the model arediscussed in section 5. We conclude in section 6 withsuggestions concerning how this work might be furthervalidated and developed.

2. A MOTIVATION — PITCH-PULSE VARIATIONIN REAL SPEECH

McAdams made a very compelling demonstration of thecapacity of the auditory system to segregate vowels basedon differences in fundamental frequency (McAdams 1984,1989). Three different synthetic vowels with differentfundamental frequencies are mixed together. The result-ing sound is a dense chord with no obvious vowel identity.However, by adding frequency modulation at a rate of a fewHertz and a depth of a few percent to any one of the vowels,it can be made to ‘jump out’ of the mixture, gaining a veryclearly discernible pitch and phonemic identity. Thisdemonstrates that the auditory system is able to hear outdifferent vowels without any visual, spatial or contextualcues, but that some other cue, such as frequency modula-tion, is needed to identify the spectral energy belonging toa particular voice.

This result has generated a great deal of interest. In thecontext of ‘Auditory Scene Analysis’, it is interpreted as thecommon frequency modulation attribute of the distinctspectral regions causing the vowel to be grouped together.This suggests a mechanism where modulation rate iscalculated for each frequency channel in the auditorysystem and compared across channels for matches. How-ever, experiments by Carlyon (1991) showed the situationto be more complex, since prominence due to modulationcan only be obtained for harmonically-related partials, notfor inharmonic complexes. Still other results bySummerfield & Culling (1992) found that, while modula-tion of one vowel against a static background aided identi-fication, if both target and background are modulated,even at different rates, the segregation falls back to thatachievable with static target and masker. Thus the rulesgoverning the benefit of frequency modulation must bequite complex to account for these results.

Nevertheless, the original McAdams demonstration re-mains a fascinating piece of evidence. While there aredoubtless several different reasons why a modulated tonemight be more ‘prominent’ than a static one, we wereinterested in the possibility that there might be groupingor fusion mechanisms in the auditory system that actuallyworked better in the presence of modulation than on purelystatic tones. This speculation is based in part on theinformal observation that even for nominally unmodulatedtones (such as a slowly-pronounced word, or a sung musicalnote), there is a significant qualitative difference betweenthe sound produced by a human speaker, and the sound ofan absolutely static machine-generated tone although theirspectral envelopes may be closely matched. The majorfactor in this distinction is that even when producing asteady pitch, there is a certain amount of cycle-to-cyclevariation in human vocalization which is absent in themachine-generated version. Thus we were curious toinvestigate the possibility of an auditory grouping schemethat might be particularly well suited to handling this kindof natural sound, where every cycle has some variation. Wesought to develop a model of a mechanism of this kind.

If we are interested in a process that is sensitive to theshort-term fluctuations found in natural speech, it is usefulto have some quantitative results describing the nature of

Page 3: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 3

this variation, which McAdams called ‘jitter’. We recordedsome examples of human-produced steady vowels (wherethe speaker was attempting to produce an unmodulatedsound). We then extracted the cycle lengths for thesesounds using a matched filter based on an ‘average’ pitchcycle waveform to locate a fixed reference point in eachcycle. Since we used the vowel /ah/ with a relatively flatspectrum, it was possible to extract cycle lengths with goodaccuracy.

The result was a series of cycle times with peak variationof about 2% of the average length. Treating these assamples of a continuous modulation function (a slightapproximation since the sampling was not exactly uni-form), we were able to calculate the spectrum of themodulation. A typical modulation contour and its spec-trum are shown in figure 1.

We see that the modulation is broad-band. While there isa strong low-frequency component to the pitch variationpeaking at around 3 Hz (which we might call vibrato orpitch modulation, since the tendencies extend across enoughcycles to be perceived as a change in pitch), there is almostas much variation on a cycle-to-cycle basis, with a largelywhite (flat) spectrum. This is the particular component towhich we were referring by the term ‘jitter’ above.

Smoothing (i.e. low-pass filtering) the modulation contourproduces a pitch modulation curve, and the differencebetween this and the actual modulation contour can bethought of as the high-frequency (6-30 Hz) jitter residual.We experimented with constructing new vowels based oncombinations of different amounts of both these compo-nents, resynthesizing vowels that varied from having nomodulation at all, to vowels where either or both of themodulation components were exaggerated by a factor of upto eight. Informal listening suggested that while eithertype of modulation increased the naturalness of the syn-thetic vowels, it was only the combination of them both —i.e. a broadband cycle length variation — that soundedtruly natural and non-mechanical. (Other research de-scribing the nature of modulation in natural voicingsincludes the work by Cook (1991) on singing voices.)

This result is significant because the experiments of Carlyonand Summerfield that measured the impact of ‘frequencymodulation’ on various perceptual abilities used simplesinusoidal modulation, in order to have stimuli that wereeasy to describe, generate and control. However, it mightbe that the auditory system has evolved to make use ofsome of the properties of natural sounds that were not wellduplicated by such narrowband modulation, such as theshort term variation we have termed jitter. If this is thecase (and we will argue that the main model of this paperhas this property) then it may be that these experimentshave involved stimuli that prevented the full segregationbenefits of modulation from being obtained.

A possible distinction between slowfrequency modulation and jitter stimuli

By way of supporting our suggestion that complex toneswith low frequency (≤ 8 Hz) narrowband frequency modu-lation might not allow the optimum performance of modu-lation-detecting mechanisms in the auditory system, wewill sketch an idea of the kind of processing that might findbroadband jitter-like modulation easier to segregate thansmooth fm. Consider a system that can detect periodchanges between adjacent cycles that exceed some fixedpercentage. With sinusoidal modulation, the peak adja-cent cycle-length variation occurs at the maximum slopei.e. at the zero crossing of the modulation function. Broad-band ‘white’ modulation - for instance, samples from aGaussian distribution - will achieve the same level of cycle-to-cycle modulation with a far smaller ‘mean’ variation;however, if the threshold for detecting adjacent-cycle varia-tion is bigger than the mean variation, steps of this size willoccur relatively infrequently, with the expected time be-tween such extreme modulation events depending on howfar up the ‘tail’ of the distribution the threshold lies.

Now consider an auditory grouping unit that is observingvarious partials and trying to organize them into separatesounds. The aspect of this unit crucial to the currentargument is that it accumulates evidence across time. Inparticular, if it has evidence that certain partials should begrouped together in time slice tn, it can remember thisevidence and continue to group the partials involved for

figure 1: Cycle length modulation function extracted from real speech (a sustained /ah/), and its Fourier transform.

0 0.5 1 1.57.45

7.5

7.55

7.6

7.65

7.7

time / s

cycl

e le

ngth

/ m

s

/ah/ cycle lengths (223 cycles)

0 5 10 15 20 25 30-100

-90

-80

-70

-60

-50

modulation rate / Hz

leve

l re

dc /

dB

Fourier transform of /ah/ cycle length variation

Page 4: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 4

many subsequent time steps, even though no furtherevidence may be observed. (In the interests of robustness,this association should ‘wear off’ eventually, in some un-specified fashion).

We present this unit with a combination of two modulatedvowels. If the vowels are sinusoidally modulated to asufficient depth that the adjacent-cycle-variation detectoris triggered, or if the sensitivity of this detector is adjustedto trigger at some minimum rate, this triggering will occurfor a significant chunk of every half cycle (since the secondderivative of this modulation, the rate at which the cycle-to-cycle variation changes, is smallest when the variationitself is at a maximum). We might imagine that themodulation detector is overloaded, and since both vowelsare highly modulated for quite a large proportion of eachcycle, it may be harder to find disambiguating time-sliceswhere the harmonics of only one of the vowels are highlymodulated thereby establishing the independent identityof that vowel. The size of the hypothetical time slice usedfor evidence collection relative to the modulation periodwill be a factor in this situation.

By contrast, if the modulation of the two vowels is accord-ing to a white Gaussian jitter function, then the thresholdof the adjacent-cycle-variation detector can be adjusted toprovide events that are sufficiently infrequent to allow theevidence observer a clear look at each event, yet suffi-ciently regular that enough evidence is collected to main-tain a confident arrangement of the stimuli across time.

This arises because of the different probability distributionof instantaneous cycle-to-cycle variation under the twoschemes, as illustrated in figure 2 below: The sinusoidalmodulation is sharply bimodal, making it very difficult toadjust a threshold to achieve a particular density of cycle-variation events. By contrast, the Gaussian modulationhas long, shallow tails making such a task far easier.

This example is intended merely to illustrate a simplescenario that highlights an important difference betweennarrowband and broadband cycle length modulation. Thisexample is not a particularly good description of the modelthat is the main subject of this paper, since the tracking ofindividual components is a very inappropriate way toconsider the perception of formant-structured sounds likevowels. However, the general idea of irregularly spaceddisambiguation events is central to our model, and will bere-introduced later.

figure 2: Examples of modulation patterns according to sinusoidal and Gaussian functions, along with theirtheoretical distribution functions.

0 10 20 30 40 50 60 70 80 90 100

-2

-1

0

1

2

arbitrary timebase

norm

aliz

ed v

aria

tion

Sinusoidal cycle length variations (sd = 1.0)

0 10 20 30 40 50 60 70 80 90 100

-2

-1

0

1

2

arbitrary timebase

norm

aliz

ed v

aria

tion

Gaussian cycle length variations (sd = 1.0)

0 0.5 1

-2

-1

0

1

2

probability

distribution

0 0.5 1

-2

-1

0

1

2

probability

distribution

Page 5: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 5

3. THE GLOTTAL-PULSE SYNCHRONY MODELAS AN ALTERNATIVE TO PERIODICITYMETHODS

A brief review of previous work in doublevowel separation

Vowel separation seems a tantalizingly tractable problem,since the presumed different fundamental frequenciesshould form a powerful basis for segregation. Such atechnique was described by Parsons (1976), which soughtto separate two voices by identifying both pitches and usingthis to segregate the harmonics belonging to each talker ina narrow-band short-time Fourier transform.

de Cheveigné (1993) describes this and many subsequentmodels based both on that kind of frequency domainrepresentation, and on equivalent time-domain processingsuch as comb filters, which he prefers largely for reasons ofphysiologic plausibility. He classifies the processing per-formed by these methods as either enhancement (where thetarget is identified and boosted relative to the interference)or cancellation (where the interfering voice is identifiedand subtracted). He proposes a model of the latter, includ-ing hypothetical neural circuitry, on the basis that signalseparation is most useful when the interference, but notthe target, is readily identified.

Some recent research has focused on modeling the separa-tion strategies employed by the auditory system, which isalso our objective in the current paper. Some approachesseek to duplicate the behavior of human subjects in givenpsychoacoustic tests. The models of Assmann andSummerfield (1989, 1990) match human identificationrates for static vowel pairs by finding vowel pitches andextracting the harmonic spectra, but work within theconstraints of an ‘auditory filterbank’ (Moore & Glasberg,1983). Since this filterbank is unable to resolve the indi-vidual high-frequency harmonics, this information is ex-tracted from a subsequent autocorrelation stage at the lagcorresponding to the appropriate pitch.

The Meddis & Hewitt (1992) model similarly consists of anauditory filterbank followed by a nonlinear ‘inner hair cell’model providing rectification and dynamic range compres-sion and generating auditory nerve ‘firing probabilities’which are autocorrelated in each channel. Their vowelseparation strategy sorted channels of the filterbank ac-cording to the dominant periodicity observed in theautocorrelation - i.e. the ‘pitch’ in that particular channel.Autocorrelation provides a particularly effective mecha-nism for projecting the shared fundamental periodicity indifferent frequency channels onto a common feature, sinceevery formant of a given vowel will have a peak in itsautocorrelation centered at the lag corresponding to theperiod - reflecting the fact that the vowel sound, and henceany subrange in its spectrum - is periodic in that cycle time.In the Meddis & Hewitt model, the dominant periods fromeach channel are pooled to provide the pitch estimates, andthen individual channels are attached to one or other of thedetected pitches accordingly. This model was capable ofvery close agreement with the performance of humanlisteners for identification rate as a function of fundamen-tal frequency separation of static vowels.

Summerfield and Assmann (1991) attempted to isolatewhich particular consequence of pitch difference aided theseparation of two voices. By careful control of each har-monic, they constructed artificial stimuli which exhibitedeither harmonic frequency separation or formant bursttemporal separation without most other attributes of doublevowels. They found that neither of these qualities wassufficient to obtain the gain in recognition accuracy af-forded by conventional synthesis of vowels with 6% fre-quency difference. However, they did not experiment withmodulation of the artificial vowel spectra, which wouldhave been particularly relevant to this paper.

All the methods make a pseudoperiodic assumption for theinput signal i.e. they treat the signal as perfectly periodic,at least over some short analysis window. (However, thetime-varying ‘corellograms’ of Duda et al (1990) are in-tended to exploit the dynamic nature of the sound as a basisfor separation). All the separation procedures involve theexplicit extraction of the pitch of one or both vowels. Theseaspects are very reasonable, since aperiodic ‘vowels’ (aformant synthesizer excited by a free-running Poissonprocess perhaps) would not be perceived as voice-like.However, we were curious about the possibility of tech-niques for detecting and extracting sounds that did notassume perfect periodicity over some analysis window,with the attendant loss of efficiency when handling real,jittered, sounds — particularly as such real sounds aresometimes easier to segregate than their truly periodicimitations.

The new method - detecting formants byglottal-pulse synchrony (GPS)

The central idea behind the algorithm we are going todescribe is that the different regions of spectral energy thatoriginate in a particular voice may be grouped together byobserving the time synchronization of different formantregions excited by the same glottal pulse. Whereas meth-ods such as those mentioned above have exploited featuresin the Fourier domain or the autocorrelation arising fromthe periodic nature of glottal excitation, we are interestedin grouping the energy associated with each individualpulse, not the combined effect of successive pulses. Thisfollows naturally from our interest in variations betweenindividual cycles, but it seems to preclude us from using thestrong cues of periodicity normally employed. We willexplain how we can manage without them below, but firstwe review some important aspects of voiced sounds uponwhich the new method relies.

The method of production of voiced sounds in people isapproximately as follows: air pressure from the lungs isobstructed by the vocal cords in the glottis, which periodi-cally open and flap shut in a form of relaxation oscillator.The resulting pulses of air flow form an excitation to theupper throat, mouth and nose cavities which act as resona-tors, so that the resulting pressure radiation has the sameperiodicity as the glottal pulses, but with a dynamicallymodified spectrum. By moving the tongue and otherarticulators we change the nature of this spectrum andthereby produce the different vowels of speech. Math-ematically, we may describe this production in the well-known source-filter formulation:

Page 6: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 6

s t( ) = e t( ) ∗ v t( ) (1)

where s t( ) is the radiated voice signal, e t( ) is the glottal

pulse excitation. v t( ) is the slowly-varying vocal tracttransfer function, here approximated as a time-invariant

impulse response. e t( ) is well approximated as animpulse train at least over a bandwidth of a few kilohertz,yielding:-

e t( ) = Aiδ t − ti( )i

∑ (2)

where { ti } are the near-periodic instants of glottal pulsa-

tion, each of amplitude { Ai }, and thus

s t( ) = Aiv t − ti( )i

∑ (3)

i.e. the output waveform is the superposition of individualvocal tract impulse responses shifted in time by the glottalpulses. Ignoring the nasal branch, the vocal tract can bemodeled as a single concatenation of tubes of varying cross-section, which results in an all-pole structure for the filter’sLaplace transform,

V s( ) = K

s − sj( )j

∏(4)

or, equivalently, an impulse response consisting solely ofdamped resonances:

v t( ) = aj ⋅ exp −t τ j( ) ⋅ cos ω jt + φj( )j

∑ (5)

where the resonant frequencies {ω j }, amplitudes { aj } anddecay times { τ j } are determined by the geometry and

losses associated with the resonant cavities in the vocaltract.

If we now look at the time waveform and time-frequencyanalysis (spectrogram) of some example speech, we can seethe kinds of features we would expect according to thisanalysis. Figure 3 shows 100 milliseconds (about 10 cycles)of a male voice pronouncing the vowel /ah/. The top panelis the pressure waveform as a function of time, showing thelarge amplitude burst at the start of each cycle, as the vocaltract is initially excited by a release of pressure through theglottis. This then decays away approximately exponen-tially in a complex mixture of resonances until the start ofthe next cycle. The lower panel shows a wideband short-time Fourier analysis on the same time scale, with avertical (linear) frequency scale, and gray density showingthe energy present at each time-frequency co-ordinate.Since the analysis window was much shorter than the cyclelength, we do not see any of the harmonics of thepsuedoperiodic voice, but rather we see a broadband stripeof energy at each glottal pulse. This stripe varies inintensity with frequency, showing peaks at (in this case)

around 1000, 4000 and 7000 Hz; these are the centerfrequencies of the vocal tract resonances we have describedabove, normally known as the formants of the voicedspeech.

The decomposition of the voice into a succession of time-shifted impulse responses implied in (3) above leads us toconsider this same display applied to just one glottal pulse.The signal in figure 4 was generated by exciting a 10-poleLPC model of the voice above with a single impulse. Whatwe see is an approximation to an isolated stripe out of thepattern on a somewhat stretched time-base. It is not hardto imagine reconstructing the entire pattern by superpos-ing time-shifted copies of this lone example, as suggestedby equation 3.

In the context of this presentation of speech sound, the ideabehind the glottal-pulse synchrony (GPS) technique issimple to explain: The problem of extracting a voice frominterfering sound is essentially the problem of identifyingthe formant peaks of that voice, since the particular pho-neme being conveyed is encoded in the frequencies of thosepeaks. If there are numerous energy peaks in the receivedspectrum, how can segregate target and interference? Inlight of the images and analysis above, perhaps we cangroup together all the formants that belong to a particularvoice by detecting the way that they are all time-aligned invertical stripe structures, which is to say that they are allsynchronized to the same pulse of glottal energy excitingthe speaker’s vocal tract. If we can at least organize thecomposite sound into disjoint sets of formants arising fromdifferent speakers, our segregation problem is simplifiedinto choosing which of these is the target voice, and whichare interference. While this process is clearly easier topostulate than to accomplish, it is the basis of the techniquewe have developed which we will describe in more practicalterms presently. Specifically, our algorithm detects sets ofspectral regions that appear to exhibit synchronized peaksin energy, as if in response to a common excitation. Al-though these associations exist only over a very brief time,it is possible to use them to separate out voices across timeby imposing weak continuity constraints on the formantsin a voice to extend these near-instantaneous spectralassociations across time into complete phonemes or words.

It is interesting to contrast this approach withautocorrelation methods, in particular the vowel segrega-tion model of Meddis and Hewitt (1992) (MH), since theirmodel was the inspiration for this work and shares severalcomponents (thanks to their generous policy of makingsource code available). Both models start with an approxi-mation to the cochlea filterbank followed by an elementthat calculates the probability of firing of a given inner haircell on the Basilar Membrane. The MH model then calcu-lates the autocorrelation of this probability function ineach channel to find the dominant periodicity within eachchannel. Thus each channel is first processed along thetime dimension to measure the period between moments ofhigh firing probability, and only then are the channels thatshare a given periodicity grouped across frequency to formthe complete vowel. (In the two vowel case, this is done forthe ‘dominant’ periodicity or autocorrelation peak of theentire population; the remaining channels are presumedto belong to the other vowel). However in the GPS method

Page 7: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 7

0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00time/s0k

0k51k01k52k02k53k03k54k04k55k05k56k06k57k07k58k08k59k09k5

10k010k5

frq/Hz

ah.aif

0m 2m 4m 6m 8m 10m 12m 14m 16m 18m 20m time/s0k

0k51k01k52k02k53k03k54k04k55k05k56k06k57k07k58k08k59k09k5

10k010k5

frq/Hz

hy.asc.aif

figure 3: Wideband spectrogram of a section of the long vowel /ah/. Time goes left to right; the top panel shows theactual waveform; the bottom panel has (linear) frequency as the vertical axis, and the gray density shows the amount

figure 4: As figure 3, but showing the impulse response of an all-pole model based on the /ah/ vowel. Note theexpanded timescale (22 milliseconds across the figure compared with 170 ms in figure 3).

Page 8: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 8

the first dimension of processing for each burst is acrossfrequency to find other, co-synchronous events. Only aftergrouping on this basis are the spectra traced through timeto form more complete sounds. Thus there is a contrastbetween the order of across-frequency and along-time pro-cessing between the two models.

The importance of this difference becomes apparent whenone considers the processing of signals with significantcycle-to-cycle variation (such as the jitter discussed insection 2). In order to obtain a reliable and consistentestimate of the periodicity in a given channel, the MHautocorrelation function must have a time window extend-ing over several cycles. But variation within those cycleswill compromise the calculation of a single peak periodic-ity; since the MH method is rather sensitive to the form ofthe peaks in the autocorrelation, any jitter in the inputsignal can only serve to hinder their grouping scheme byblurring the period estimates for each channel. Thisappears to be at odds with the enhanced perceptual promi-nence and naturalness conferred by moderate jitter insignals.

On the other hand, the GPS method is quite impervious tothe spacing between successive pulses, and any variationthat might occur. All it relies upon is the consistentalignment of the formant bursts on the individual glottalpulses, and sufficiently frequent bursts to allow trackingthrough time. In this way, cycle length variation will notcompromise the detection of the voice. Indeed, as we willargue below, it can be a strong cue to the segregationbetween two close voice sounds.

Synchrony skew

We have suggested an algorithm which distinguishes avoice burst from background noise by the consistent timealignment of each of its spectral formant peaks. Implicit inthis is the assumption that ‘time alignment’ denotes exactsynchronization, and if this was in fact the nature of thesignal, the grouping of formants might be a simple matter:whenever a formant burst is detected, simply look acrossthe spectrum for other simultaneous bursts. However, thereal situation is not so simple, as formant bursts at differ-ent frequencies, despite originating in a single compactglottal event, may be quite widely dispersed in time beforethey reach the auditory processing centers. Some of thefactors contributing to this dispersion are:-

(a) The phase spectrum of the vocal tract. Even thesimple all-pole cascade model can introduce signifi-cant dispersion across frequency;

(b) The dispersive effects of the transmission path fromspeaker to ear, including effects of reflections anddiffractions;

(c) Frequency or place-dependent delays in the func-tion of the ear, most notably the propagation delaydown the Basilar membrane.

While (c) should be constant and is doubtless subject tostatic compensation in the auditory system, is does strikea final blow against the superficially attractive idea that wemight be able to implement the GPS algorithm withoutresorting to delay lines. But in any case the unpredictableand very significant effects of (a) and (b) mean that no

simple solution will suffice; rather than simply causingsimultaneous formant bursts in different channels, the cueof common glottal excitation will appear as a constant,often small, but initially unknown time skew between thechannels. Thus the problem is complicated into findingboth the channels that carry bursts in a synchronizedpattern, and the interchannel timing relations that consti-tute that pattern.

Evidently, we will not be able to deduce these timings byobserving a single vowel burst, even if we make assump-tions about the maximum possible interchannel timingdifference; we cannot distinguish between formants of thetarget voice and energy peaks from the interfering noisewhich happen to fall inside our window, suggesting acompletely spurious formant. (In the two-vowel situation,such an occurrence is very likely.) The solution is to storeall the timings implied by the energy bursts in the windowas potential components of a formant spectrum, but torefrain from actually using them as the basis for a sepa-rately grouped voice until the timings have been confirmedby repeated observations.

This process may be described more formally as seeking toestimate the conditional probability of a firing in onechannel at a given timing relative to a firing in anotherchannel. If peripheral channel n exhibits firing events at

times { tn,i }, we can define a ‘firing function’ for that chan-nel:

1 tn,i ≤ t < tn,i + ∆t for some iFn t( ) ={

0 otherwise (6)

(where ∆t accounts for the temporal resolution of thedetection system). We would like to estimate the condi-tional probability of a firing occurring in channel m at aspecific time-skew τ relative to a firing in channel n i.e.

PF n,m,τ( ) = Pr Fn t + τ( ) = 1 Fm t( ) = 1{ } (7)

If we have a set of formants at different frequencies beingexcited by the same glottal pulses, we would expect thisconditional probability to be very high for the pairs ofchannels involved at the appropriate relative timings.Thus the approach undertaken by the GPS method is tocontinually estimate this conditional probability for manychannel combinations and time skews to find the pairs ofchannels that do exhibit a high correlation by this mea-sure.

Since the vocal tract and hence the formant spectrum of a

given voice are dynamic, the function PF n,m,τ( ) is timevarying and can be estimated at a given instant based onnearby events. We can calculate the estimate at a particu-lar time T as the ratio of coincidences to the total time thecondition was true:

PFT

^

n,m,τ( ) =e− T −t α Fn t + τ( ) ⋅ Fm t( )∫ dt

e− T −t α Fm t( )∫ dt(8)

Page 9: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 9

The complete Glottal Pulse Synchronystrategy for grouping vowels

We have now explained the principle of operation of themodel, and can present the functional structure in detail.Figure 5 shows the entire model in block-diagram form,from the acoustic signal at the left to a resynthesizedisolated vowel on the right. As noted above, the first twostages are taken from the MH model, but the remainder isdifferent. We explain each block in turn.

The first stage simulates the frequency analysis of thecochlea with a linear filterbank based on the ‘gammatone’approximation to auditory filter shapes (Patterson & Moore,1986, as referenced in Meddis & Hewitt, 1991). We used afilter spacing of between two and five per ERB unit, over arange 100 Hz to 4 kHz - on the order of 60 overlappingchannels. The filters are real (not complex), and theoutputs are undecimated, so the data carried to the nextstage is a set of bandpass signals, one per filter.

The next stage is the MH cochlea inner hair cell model,nominally translating the basilar membrane displacement(the output of the linear filter) into a probability of firing forthe associated nerve fiber. This stage approximates therectification and adaptation effects observed in the firingsof actual auditory nerves in response to acoustic stimuli.

Following that is our pseudo-‘firing’ module. The GPSmodel is described in terms of detecting formant bursts foreach pitch pulse at different frequencies, therefore wedesire the input spectrum transformed into a representa-tion that has discrete events for each such burst (at leastwithin each channel). We experimented with a variety ofmodels of hair-cell firing, ranging from a simple Poissonevent generator driven by the firing probability, to modelsthat included refractory effects such as the Ross (1982) 3rd-order Poisson model. However, the noise added by suchstochastic elements was quite overwhelming, presumablybecause the effective number of nerve fibers was very small(one per peripheral channel). Eventually we decided tosacrifice physiological resemblance at this stage and de-vised a deterministic formant-burst event generator thatproduces one firing for each local maximum in the firingprobability function over a sliding time window of typically5 ms. This does a very satisfactory job of producing onepulse every pitch cycle in channels where given formantsare dominant. It is not clear how this function could begenerated in an actual physiological system; nerve fibersdo of course generate firings synchronized to their inputs

Here each event has been weighted by e− T −t α whichincreases the influence of events close to the time for whichwe are evaluating the estimated probability. The timeconstant for this simple weighting function is α , whichmust be large enough to span several pitch cycles if we areto include a sufficient number of events in the basis for ourestimate to make it reasonably stable, while still remain-ing short enough to allow the formant spectrum to beconsidered invariant for those events. (Note however thatthe stability of the pitch period over this window haspractically no influence on the result, as we desire). Thusα is around 10-20 milliseconds, larger than any τ we will

use, and far larger than ∆t . We may simplify this calcula-

tion as the sampling of one firing function Fm at the firing

times of the other channel, { tn,i }, normalized to the totalnumber of firings:

PFT

^

n,m,τ( ) ≈e

− T −tn,i α

i∑ Fm tn,i( )

e− T −tn,i α

i∑

(9)

If we assume that the estimate at time T can only look

backwards in time i.e. depend on events for which tn,i< T ,and if we only update our estimate each time we get a

conditioning event i.e. at each of the { tn,i }, we can recur-

sively calculate PFtn,i

^

n,m,τ( ) from its previous value,

PFtn,i−1

^

n,m,τ( ), as:

PFtn,i

^

n,m,τ( ) = Fm tn,i( ) + e− tn,i −tn,i−1 α ⋅ PFtn,i−1

^

n,m,τ( )

(10)

This is effectively the calculation we perform with thehistogram ‘plane update’ described in the next subsection.

Mask& resynth.

Formantclustering

Co-synchronydetector

‘Firing’event

generator

Gammatonefilterbank

bandpasssignals

Inner haircell model

firingprobabilities

formant burstevents

conditionalprobability

matrix

formantsignatures

Continuitytracking

vowelmixture

time-frequencymask

vowel

synchmod1993dec08

figure 5: Block diagram of the complete Glottal Pulse Synchrony (GPS) vowel separation scheme.

Page 10: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 10

at this level of resolution, though not with such consis-tency.

It is thus a very reduced spectral representation, consist-ing of a set of frequency channels containing binary eventsonce every few milliseconds, that is passed to the nextstage. This is the co-synchrony detector that is the heart ofthe method. Essentially, it is a three-dimensional histo-gram, as illustrated in figure 6. The two major dimensionsare both indexed by peripheral channel (i.e. frequency),and the third axis is time skew or lag. Thus each bin iscalculating a value of the estimated conditional probability

function PFT

^

n,m,τ( ) described above i.e. the likelihoodthat a firing in peripheral channel n (indexed along thevertical axis of the histogram) will follow a firing in channelm (the horizontal axis) at a relative time τ (the short axisgoing into the page). The actual method of updating thismatrix is as follows: when an event is detected in channeln , an entire plane of the histogram is updated. Every otherchannel, indexed across m , corresponds to a row in thisplane with a bin for each possible time skew τ . If there isan event in the other channel within the range of timeskews considered (i.e. within a couple of milliseconds of theburst causing the update), the bin appropriate to that timeskew is incremented up to some saturation limit. All thebins that do not correspond to events are decrementedunless they are already empty. Thus every bin in the planeis changed, unless it is already at a limit of its excursion.(In practice, most bins will spend most of the time at zero).This saturating count is a crude approximation to theexponentially-weighted estimate of equation (10), with theadvantage that it permits the use of small integers for eachelement in the array.

We note in passing the similarity between this matrixdetecting co-occurrences of firings in different frequencychannels to the binaural co-incidence detector of Colburn(1977), which detects consistent timing skew betweenchannels of the same center frequency, but from opposingears, as a factor contributing to perceived spatial origin.

+

-

0 relativetiming

firingchannel

other channels

time-skewedevent

3Dhist1993oct05

figure 6: The three-dimensional histogram forestimating the conditional probabilities of firing in pairsof channels at particular relative timings.

This procedure behaves as follows: Presented with a suc-cession of pitch bursts with a consistent formant pattern,the bins corresponding to the relative timings between theformants for each pair will be incremented on every pitchcycle, and the other bins will remain at zero. Thus bylooking for the maximum bin contents along the time skewaxis for each channel m compared to a single channel n(i.e. one ‘plane’), we will see which channels have occurredwith a consistent relative timing. If channel n contains aformant of a particular voice, the m s that have largemaximum bin contents (across τ ) will be the other formantsof the same voice, and for each channel pair the τ of the binthat contains the maximum will give the relative timing(phase difference) between those two formants (of no par-ticular interest to the grouping mechanism, which is mainlyinterested in the existence of some consistent relativetiming).

An example of the values in this matrix during a typicalmixed voice sound is shown in figure 7. Here we see thehistogram collapsed along the time skew dimension. Thevertical and horizontal co-ordinates are the two frequencychannels n and m respectively, and the level of gray showsthe maximum histogram score across τ for that pair ofchannels. Because the histogram is updated by planes(horizontal rows in the figure), the image is not quite

figure 7: A ‘snapshot’ of the histogram collapsed alongthe time skew axis 0.171 seconds into a mixed vowel.Each axis is labeled by the center frequencies of thechannels. Gray density shows the estimated conditionalfiring probability.

time = 0.171

100.0

398.3

1.0k

2.1k

4.1k

100.0 398.3 1.0k 2.1k 4.1k

quitprintpause

Page 11: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 11

symmetric about the n= m axis; if there are more firingsin channel A than channel B, then there will be moreupdates to the bins in channel A’s row that correspond tochannel B than there are of the bin for channel A in channelB’s row - although for positive counts, we expect symmetryi.e. an event in channel A at a time t relative to an event inchannel B will be recorded as counts in the two histogrambins (A, B, t) and (B, A, -t).

In figure 7, we can see the presence of two independent setsof synchronized formants, appearing as grid patterns ofdark spots in the grayscale image. Thus frequencies 120Hz, 2000 Hz and 6000 Hz (approx.) tend to occur withconsistent time skew, as do 400 Hz, 1000 Hz and 4000 Hz,but not the intersection i.e. the bins indexed by 1000 Hzand 2000 Hz are pale indicating no identified regulartiming skew, even though there are evidently firings inboth channels, synchronized to other frequencies.

The next stage of processing is to convert this large map ofconditional probability estimates into a few inferred formantspectra, accomplished by the next block from figure 5, the‘formant clustering’ unit. This considers each row in thematrix from the previous stage and attempts to find a smallnumber of formant spectra that, between them, resemblemost of the rows in the matrix. The algorithm is asimplified version of k-means clustering: Starting with anempty set of output formant spectra or ‘signatures’, we gothrough each row of the matrix: If it has a peak value abovea threshold (indicating that it contains some useful co-occurrence information), it is compared using a distancemetric to each of the current signatures. If it is close to one,it is added to that cluster, and the signature is modified toreflect the contribution of the new row (a shift of the meanof the cluster). If the it is not sufficiently close to any of thesignatures, a new signature is created based on that row.The result of this clustering is a small number (betweenzero and five) of candidate formant signatures for eachtime step when the clustering is run (every 10 or 20milliseconds). Note that since these signatures have afrequency axis, we are referring to as spectra. However,the function of frequency is average conditional firingprobability, and does not reflect the relative intensity ofthe actual formants in the signal (except in so far as moreintense formants will be less affected by interference andthus more easily detected by synchrony).

The penultimate stage of the model attempts to formcomplete voices by tracking formant signatures throughtime. The procedure is that once a voice has been identi-fied, the formant signature most similar to the last ‘sight-ing’ of the voice is chosen as a continuation. Exactly whento start tracking a new voice is still an unresolved issue;currently, we track only one voice at a time, and the systemis given a ‘seed’ (a known formant of the target voice at aparticular time) so that it may lock on to the correct set ofsignatures. The result of this tracking is a two dimensionalmap, a function of frequency (channel) and time, showingthe contribution of each time-frequency tile to the sepa-rately-identified voice. This can then be used as a ‘mask’for the time-frequency decomposition from the originallinear filterbank, and subband signals masked this waycan be passed through a reconstruction filter to resynthe-size a sound approximating the separated voice. (This

approach to resynthesis is borrowed from Brown (1992)).This is represented by the final block of figure 5, ‘Mask &resynth’.

4. RESULTS FROM A PRELIMINARYIMPLEMENTATION

In this section we describe our first results with thismethod. We distinguish this implementation from thedescription in the previous section owing to some simplifi-cations made in the interests of computational efficiency.The most significant of these is that we did not maintainthe full three-dimensional histogram described above;rather, for each pair of frequency channels we tracked thescore only for the two most recently observed inter-firingtime skews, but not all the other possible skews. If, duringa plane update, the calculated time skew to another chan-nel was close or equal to one of the time skews beingtracked, the histogram score for that skew was incrementedaccordingly. If the new relative timing measurement wasnot one of the two being tracked, the worse-scoring of thosetwo was discarded, and the new time skew tracked insteadfrom then on. While this modification complicates theanalysis of the process, it is intended that ‘correct’ timeskew will be detected in situations where there is a strongcorrelation between channels, even if there is significantinterference between two voices in one of the channels.

The other simplification concerns the way that voices areformed through time. Rather than creating all possiblecomponent voices automatically, or seeding a voice at aparticular time-frequency point and following it forwardfrom there, our tests involved static vowels that could bereliably associated with a particular formant throughouttheir extent. It was not necessary to track; we simplyformed voices by taking the vowel signature that containedthe known formant at each time. Thus these results do nottest the mechanism of tracking voices through time sug-gested for the ‘Continuity tracking’ block in figure 5.

Despite these simplifications, the results are in line withour intentions. Figure 8 illustrates two extended vowelsboth separately and mixed together. The top image of eachpair shows the output of the inner-hair-cell firing probabil-ity stage, which is approximately a rectified version of theoutput of the filterbank. The horizontal dimension is timeand the vertical axis shows the center frequency of thecorresponding channels of the filterbank. The firing prob-ability is shown as gray density, and the resulting image issimilar to a wideband spectrogram, with the addition thatindividual cycles of the ringing of each filter can sometimesbe resolved (as a result of the half-wave rectification). Thesecond image of each pair is the same sound as representedby the firing event generator. The axes are the same as forthe upper image. Here we can see how each pitch cyclegives rise to an approximately vertical structure of burstevents, with the white space showing the absence of firingevents between pitch bursts.

In the bottom pair of images, we see the analysis of amixture of the vowels at approximately equal perceivedloudness. The resulting sound is quite complex, althoughcertain frequency regions are visibly dominated by one orother of the vowels, such as the 700 Hz formant for the

Page 12: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 12

orfm.hex

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s100

200

330

510

730

1k0

1k4

1k9

2k6

3k5

4k6frq/Hz

eefm.hex

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s100

200

330

510

730

1k0

1k4

1k9

2k6

3k5

4k6frq/Hz

orvm.hex

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s100

200

330

510

730

1k0

1k4

1k9

2k6

3k5

4k6frq/Hz

eevm.hex

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s100

200

330

510

730

1k0

1k4

1k9

2k6

3k5

4k6frq/Hz

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s100

200

330

510

730

1k0

1k4

1k9

2k6

3k5

4k6frq/Hz

oefm.hex

oevm.hex

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s100

200

330

510

730

1k0

1k4

1k9

2k6

3k5

4k6frq/Hz

figure 8: Firing probability (top of each pair) and firing events (bottom) for the vowels /or/ and /ee/, and theirmixture.

ormk.hex

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s100

200

330

510

730

1k0

1k4

1k9

2k6

3k5

4k6frq/Hz

eemk.hex

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s100

200

330

510

730

1k0

1k4

1k9

2k6

3k5

4k6frq/Hz

figure 9: Extracted time-frequency masks for the vowels in figure 8.

Page 13: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 13

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s0k

0k5

1k0

1k5

2k0

2k5

3k0

3k5

4k0

4k5

5k0frq/Hz

or+ee.aif

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s0k

0k5

1k0

1k5

2k0

2k5

3k0

3k5

4k0

4k5

5k0frq/Hz

ee.aif

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s0k

0k5

1k0

1k5

2k0

2k5

3k0

3k5

4k0

4k5

5k0frq/Hz

or.aif

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s0k

0k5

1k0

1k5

2k0

2k5

3k0

3k5

4k0

4k5

5k0

5k5frq/Hz

or-sy.aif

50m 100m 150m 200m 250m 300m 350m 400m 450m time/s0k

0k5

1k0

1k5

2k0

2k5

3k0

3k5

4k0

4k5

5k0frq/Hz

ee-sy.aif

figure 10: Wideband spectrograms for the two original vowels (top row), their mixture (middle), and thereconstructed vowels from the separation (bottom row).

Page 14: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 14

vowel /or/ shown at the left of the figure and the 2 kHzformant for /ee/.

The discriminating formants in these two regions were infact the basis of the simplified continuity matching used inthese schemes. Figure 9 shows the time-frequency masksderived for the two vowels, by selecting vowel signaturesfrom the output of the clustering stage that exhibited therelevant formants. These masks are dark for time-fre-quency regions exhibiting high consistency of timing skewto the ‘seed’ formants, and light for areas that show negli-gible correlation. The masks are then multiplied by thecorresponding outputs of the original cochlear filterbank,which are then combined into a single output signal by aninverse filter bank. The sounds produced by these twomasks, i.e. the separated vowels, are shown in figure 10 asconventional wide-band spectrograms, along with the spec-trograms of the original vowels and their mixture. As canbe seen, although each separated signal is missing signifi-cant regions of energy where it was masked by the othervowel, the net result has captured much of the character ofthe sound, and successfully rejected the interfering sound.Listening to the outputs confirm that the separation isgood, and the vowels are easily recognized, although obvi-ously distorted in comparison to the originals.

5. DISCUSSION

We have introduced a detailed model with complex behav-ior. It is useful to consider some of the tuning parametersof the model and their influence.

The cochlear filterbank and inner hair cell firing probabil-ity models have a profound influence on the model behav-ior, since they determine what information from the rawstimulus is available to the rest of the processing. Thefrequency selectivity, level adaptation and signal syn-chrony of these stages are particularly important. How-ever, we hope that by using the relatively establishedMeddis and Hewitt modules for these stages we can cir-cumvent extensive arguments and justifications.

The simple operation of the burst-event generation stage,described in section 3, is dependent only on the timewindow over which it operates. This was set to 5 millisec-onds to be approximately the period of the highest pitch forwhich the system will operate. Since at most one event canbe generated within this window, this helps ensure thedesired behavior of a single event per channel per pitchcycle.

In the implementation of the timing-skew histogram, thereare two important parameters of the relative timing di-mension: the maximum skew that is registered, and thetiming resolution. In our implementation, we looked forevents for 5 milliseconds on either side of the update event:Again, this is intended to capture all events within onepitch cycle, since it is difficult to imagine a reliable mecha-nism for tracking time skews that exceed the pitch cycletime. The resolution was about one-hundredth of this i.e.50 microseconds. If this is made too large, then fine jittereffects may not be noticed, and the segregation ability ofthe process is reduced. If it is too small, it becomes overlysensitive to timing noise in the burst detection process, and

also raises questions of plausibility. However, the humanability to discriminate azimuth to a few degrees based oninteraural timing differences proves that this level oftiming sensitivity can occur at least in some parts of theauditory system. The actual implementation of resolutionin the histogram bins was side-stepped by the parameter-ized way in which timing was represented in our imple-mentation. For a real histogram with actual timing bins,it would be necessary to use some kind of point-spreadfunction when incrementing bins to avoid artifactual be-havior on the boundaries between bins.

Perhaps the most interesting parameter is the sluggish-ness of the probability estimation process, equivalent tothe time window upon which estimate is based. In theinteger-only histogram implementation, the exponentialbehavior of the recursive estimator in equation (10) hasbeen approximated with a straight line segment thatcauses the time support of the estimate to scale with thepitch cycle, such that a sudden change in the vocal tractparameters will be fully reflected in the histogram afterfive complete cycles. While this only approximates theform of adaptation to new evidence, and makes an implicitassumption that the spacing between events is some con-stant period, it possibly provides adaptive behavior moresuitable to our ultimate task of separating the vowels. Amore careful analysis of the nature of this approximationis required.

In the original formulation of equations (8) through (10),the sluggishness was represented by the time window ofthe conditional firing probability estimates. A larger α(which would be approximated by a more shallow linesegment, i.e. saturation at a larger level) would give a moreconservative result, requiring more instances to fully ac-cept a particular timing. While this would be less oftenwrong, it would also be in danger of losing track of rapidlymodulated voice. In fact, we consider the most interestingaspect of the present model to be the adaptation to evidenceacross time, in contrast to the pseudo-static nature ofprevious models. We can be quite confident that the actualperceptual system has far more sophisticated strategiesfor temporal evidence accumulation than simple one-poleintegrators. It is interesting none the less to begin to modelthese effects.

Given that the GPS model relies exclusively on the tempo-ral repetitions of between-channel patterns to make itsgroupings, it is worth asking whether real speech is suffi-ciently static to make such a principle feasible. Ourcurrent model requires some four pitch cycles with essen-tially constant vocal tract response to lock-on to the pat-tern. This is only 50 milliseconds even at low voice pitches,a reasonable minimum syllabic duration. However, tran-sitions can be shorter than this, so it will be interesting tosee how a complete system, including voice tracking throughtime, deals with speech that modulates at realistic rates.With very fast transitions, the speech may be chopped intosections whose internal connection is reasonably evident,but with no clear continuity across the transitions. How-ever, other mechanisms may be postulated to mend thesediscontinuities : two sections of speech with abutting endand start times are showing evidence of common origin.

Page 15: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 15

In section two we explored the usefulness of jitter, arguingthat a modulation detector with a variable threshold butlimited time resolution would be able to adjust itself to anoptimal rate of detected events if the modulation wasgoverned by a Gaussian jitter function (with gradual roll-off at the extremes of the modulation probability densityfunction), but such adjustment would be difficult or impos-sible for sinusoidal modulation with its sharp, bimodaldistribution. There is another qualitative argument thatmay be made in favor of the utility of jitter, in the contextof a sensitive relative-timing detection scheme such as thecurrent one. If a periodic signal is composed energy indifferent frequency bands which show a consistent relativetiming, then a system to detect relative timing will cor-rectly observe the signal, as will an alternative system thatfirst calculates periodicities within each channel (such asthe Meddis & Hewitt model). However, if the signaldeparts from periodicity by some jitter mechanism that,say, elongates a particular cycle, the relative-timing-basedscheme will not be in the least perturbed, since the delayedglottal pulse will most likely still excite the same relativetiming amongst the formants — each channel will have itsburst delayed by the same amount, and the pattern ispreserved. If the system is attempting to segregate twovoices that are close in pitch, this consistency of jitter-derived displacement of one voice’s formants can effec-tively disambiguate the voices where period measure-ments fail. Thus the presence of jitter, and a detectionscheme that is impervious to its effects, can be particularlycrucial for the perceptual isolation of voices with poor pitchseparation.

It is important to recognize that the current model iscompletely insensitive to pitch cycle length and its stabil-ity. This is quite unrealistic, since random jitter at highlevels will rapidly disrupt the stability and clarity of asound; something approximating smooth pitch variation isimportant for the perception of voice. Since the GPS modelcannot provide this behavior, it cannot be considered asufficient model of voice separation. Rather, it may be oneof a variety of techniques available to the auditory system,each used when appropriate to the particular problem athand. Periodicity-based methods may comprise otheralternatives, and there may be still other principles. Thisis entirely consistent with a theory of perception andcognition based on corroboration between multiple, dispar-ate mechanisms of indifferent individual reliability, butcapable of excellent performance when successfully com-bined (a ‘society’ in the sense of Minsky (1986)).

A final comment regards the physiological plausibility ofthe method. One inspiration for the design was the knowncross-correlation mechanisms identified in bat echoloca-tion systems by Suga (1990). There, correlations betweencertain features of a reflected sonar pulse were mappedacross two dimensions of individual neurons. The three-dimensional histogram of the GPS model could perhaps beimplemented with the same building blocks, but the extradimension certainly reduces its likelihood of it being adirect description of a neurological system since the num-ber of elements involved would be considerable. It might bethat a low-resolution time-skew axis would make the thirddimension relatively shallow, yet still retain useful behav-

ior. The ‘two-and-a-half’ dimensional modification weactually used in our implementation (where the timingskew is represented explicitly rather than implicitly in anarray of bins) requires less computational complexity, buta different kind of representation that compromises itsplausibility also. It may also be unnecessary to compareevery frequency channel against every other, but insteadreduce computation by using a local window to calculate adiagonal ‘band’ from the middle of our histogram. Thiswould still find adjacent pairs of formants (provided bothlay within the window). These could then be assembledinto complete vowels by transitivity.

6. CONCLUSIONS AND FUTURE WORK

Future tests of the model

This report describes a very preliminary stage of develop-ment of this model. There are a number of tests ofimmediate interest whose results would guide furtherdevelopment and refinement. There are also some inter-esting psychoacoustical experiments suggested by the pre-sentation of this algorithm as a potential strategy em-ployed by the brain. We will describe some of this plannedfuture work.

The one example we tried was not terribly difficult: the twovowels were very distinct in formant structure, and theyhad wide pitch separation. However, the prediction madein section 5, that exploiting jitter should allow segregationeven with little difference in fundamental frequency, isrelatively easy to test and could reveal a particularlyadvantageous application of this technique.

While Meddis & Hewitt’s model achieved an extremelygood match to experimental results on the confusion ofvowels as a function of pitch separation by actual listeners,their artificial stimuli lacked the kind of jitter that shouldbe very important in such situations, but which is not welldetected by their model. Our contention that the currentlyproposed system gains advantage by exploiting this jitterwould be tested by generating sets of artificial vowels withand without simulated jitter of various intensities andtypes. We would expect the GPS model to perform muchbetter in the presence of wideband jitter, and the MH modelto perform perhaps a little worse compared to unmodulatedvowels. To be able to conduct the same kinds of tests asdescribed in the Meddis and Hewitt paper, we would needto add some kind of vowel classification stage to the outputof the GPS. This could be done in several ways, for instancewith a low-order cepstral analysis of the separated, resyn-thesized signal.

The perceptual assertion upon which this entire paper isfounded is that the human auditory system is indeedsensitive to correlated modulation as caused by glottalpulse train jitter. We have argued circumstantially forthis, but it would be more satisfying to have direct psycho-physical evidence. One test would be to use the samecollection of synthetic vowel combinations with variousdegrees of modulation of different kinds, and to test humanability to segregate as a function of these parameters. This

Page 16: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 16

evidence could provide very strong validation of the physi-ological relevance of our model, or alternatively demon-strate its irrelevance!

Another prediction of this kind of mechanism is thatexcessive phase dispersion should degrade the coherenceof a signal; if there is more than one pitch-cycle of addi-tional timing skew between different formants, then itbecomes much harder to correlate between appropriatepairs of events (originating from the same glottal pulse).Although this could still be achieved with a sufficientlylarge time dimension for the three-dimensional histogram,short cuts of the kind we actually used in section 4 wouldnot be applicable. We know that moderate phase disper-sion has little impact on the perceptual quality of a vowel,but it would be interesting to investigate the point at whichdispersion becomes deleterious, and the perceptual char-acterization of the impact.

On a similar theme, and in the spirit of Summerfield &Assmann (1991), it would be possible to generate syntheticvowels where each formant was excited by a separate pulsetrain. If these pulse trains had the same underlying periodbut different superimposed jitter functions, the resultingsynthetic voice would have average jitter characteristicssimilar to those of real speech, but without any of thecoherent timing skew detected by the GPS model. Suchsounds would be poorly fused by the model; it would beinteresting to test if they were similarly ‘fragile’ for humanlisteners.

Conclusions

(1) An argument was made that for certain possibleperceptual mechanisms the random nature of cycle-to-cycle variation in natural speech could be materi-ally different from the sinusoidal modulation typi-cally employed in experiments.

(2) We described in some detail an algorithm by whichthe human auditory system might be able to segre-gate the contributions of each of several voices in anacoustic mixture based upon the coherent timingskew between different regions of spectral dominanceof a single voice. A particular aspect of this strategyis that it must combine evidence over time (i.e. sev-eral pitch cycles) in a nonlinear fashion to form aconfident output.

(3) Results from a preliminary implementation of thismodel tested with real voice samples show a surpris-ing ability to segregate voices on this basis alone.

There can not, however, be any doubt that other cues suchas voice pitch are extremely important to the perceptionand segregation of voices. Therefore we propose that if thecurrent model does in fact reflect some genuine aspect ofhuman auditory processing (as our future tests may re-veal), it is only one of a collection of voice-segregationmethods available to human listeners. Future researchshould concentrate not only on the discovery and refine-ment of individual techniques, but also on the interestingquestion of how the auditory system might combine theresults of different processes.

ACKNOWLEDGMENT

This work was generously supported by the MIT MediaLaboratory, in particular the Television of Tomorrowconsortium. The author is in the United States as aHarkness Fellow of the Commonwealth Fund of New York,whose support is gratefully acknowledged.

We would like to thank Dr. Lowel P. O'Mard and theSpeech and Hearing Laboratory at the LoughboroughUniversity of Technology for making their ‘LUTEar’ audi-tory model software available, which constituted a signifi-cant portion of this project.

REFERENCES

Assmann, P. F. and Summerfield, Q. (1989). “Modelingthe perception of concurrent vowels: Vowels withthe same fundamental frequency,” JASA 85(1), 327-338.

Assmann, P. F. and Summerfield, Q. (1989). “Modelingthe perception of concurrent vowels: Vowels withdifferent fundamental frequencies,” JASA 88(2),680-697.

Brown, G. J. (1992). “Computational auditory sceneanalysis: A representational approach,” Ph.D. thesisCS-92-22, CS dept., Univ. of Sheffield.

Carlyon, R. P. (1991). “Discriminating between coherentand incoherent frequency modulation of complextones,” JASA 89(1), 329-340.

de Cheveigné, A. (1993). “Separation of concurrentharmonic sounds: Fundamental frequency estima-tion and a time-domain cancellation model ofauditory processing,” JASA 93(6), 3271-3290.

Colburn, H. S. (1977). “Theory of binaural interactionbased on auditory-nerve data. II. Detection of tonesin noise,” JASA 61(2), 525-533.

Cook, P. R. (1991). “Identification of control parametersin an articulatory vocal tract model with applica-tions to the synthesis of singing,” Ph.D. thesis,CCRMA, Stanford Univ.

Duda, R. O., Lyon, R. F., Slaney, M. (1990).“Correlograms and the separation of sounds,” Proc.IEEE Asilomar conf. on sigs., sys. & computers.

McAdams, S. (1984). “Spectral fusion, spectral parsingand the formation of auditory images,” Ph.D. thesis,CCRMA, Stanford Univ.

McAdams, S. (1989). “Segregation of concurrent sounds.I: Effects of frequency modulation coherence,” JASA86(6), 2148-2159.

Meddis, R. and Hewitt, M. J. (1991). “Virtual pitch andphase sensitivity of a computer model of theauditory periphery. I: Pitch identification,” JASA89(6), 2866-2882.

Meddis, R. and Hewitt, M. J. (1992). “Modeling theidentification of concurrent vowels with differentfundamental frequencies,” JASA 91(1), 233-245.

Minsky, M. (1986). The Society of Mind, (Simon andSchuster, New York).

Page 17: A simulation of vowel segregation based on across-channel

Dan Ellis – Vowel segregation based on synchrony DRAFT 1994jan03 - 17

Moore, B. C. J. and Glasberg, B. R. (1983). “Suggestedformulae for calculating auditory-filter bandwidthsand excitation patterns,” JASA 74(3), 750-753.

Parsons, T. W. (1976). “Separation of speech frominterfering speech by means of harmonic selection,”JASA 60(4), 911-918.

Patterson, R. D. and Moore, B. C. J. (1986). “Auditoryfilters and excitation patterns as representations offrequency resolution,” in Frequency Selectivity inHearing, edited by B. C. J. Moore (Academic,London).

Ross, S. (1982). “A model of the hair cell-primary fibercomplex,” JASA 71(4), 926-941.

Suga, N. (1990). “Cortical computational maps forauditory imaging,” Neural Networks 3, 3-21.

Summerfield, Q., and Assmann, P. F. (1991). “Perceptionof concurrent vowels: Effects of harmonic misalign-ment and pitch-period asynchrony,” JASA 89(3),1364-1377.

Summerfield, Q. and Culling, J. F. (1992). “Auditorysegregation of competing voices: absence of effects ofFM or AM coherence,” Phil. Trans. R. Soc. Lond. B336, 357-366.