spectral-temporal factors in the identification of environmental sounds

14

Click here to load reader

Upload: charles-s

Post on 06-Apr-2017

235 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Spectral-temporal factors in the identification of environmental sounds

Redistri

Spectral-temporal factors in the identificationof environmental sounds

Brian Gygi,a) Gary R. Kidd, and Charles S. WatsonDepartment of Speech and Hearing Sciences, Indiana University Bloomington, Indiana 47405

~Received 25 February 2003; revised 28 October 2003; accepted 3 November 2003!

Three experiments tested listeners’ ability to identify 70 diverse environmental sounds using limitedspectral information. Experiment 1 employed low- and high-pass filtered sounds with filter cutoffsranging from 300 to 8000 Hz. Listeners were quite good~.50% correct! at identifying the soundseven when severely filtered; for the high-pass filters, performance was never below 70%.Experiment 2 used octave-wide bandpass filtered sounds with center frequencies from 212 to 6788Hz and found that performance with the higher bandpass filters was from 70%–80% correct,whereas with the lower filters listeners achieved 30%–50% correct. To examine the contribution oftemporal factors, in experiment 3 vocoder methods were used to create event-modulated noises~EMN! which had extremely limited spectral information. About half of the 70 EMN wereidentifiable on the basis of the temporal patterning. Multiple regression analysis suggested that someacoustic features listeners may use to identify EMN include envelope shape, periodicity, and theconsistency of temporal changes across frequency channels. Identification performance with high-and low-pass filtered environmental sounds varied in a manner similar to that of speech sounds,except that there seemed to be somewhat more information in the higher frequencies for theenvironmental sounds used in this experiment. ©2004 Acoustical Society of America.@DOI: 10.1121/1.1635840#

PACS numbers: 43.66.Lj, 43.66.Mk, 43.66.Ba@NFV# Pages: 1252–1265

haavrcevkau-

thhee

-

atep

la9erteb

cta-

at-the

l-

on

e

talma-

veoftheds,ing

pro-y at

d inata-ien-min-hetheh,n

a

I. INTRODUCTION

The study of the perception of environmental soundslagged behind that of the three classes of stimuli that hbeen the subject of the great majority of auditory reseaspeech, music, and laboratory-generated test sounds. Ntheless, research has shown that humans have a remarability to easily and quickly identify a large range of natrally occurring sounds~Lasset al., 1982; Ballas, 1993! rep-resenting a wide range of objects and interactions inworld ~Gaver, 1993a!. Because of the variety of sources, trange of spectral and temporal variation among nonspeenvironmental sounds greatly exceeds that of speech~seeAttias and Scheiner, 1997!, for which the source, either actual or simulated~in the case of synthesized speech!, is thehuman vocal tract.

Their tremendous variety makes it difficult to arrivegeneralizations that apply to the entire class of environmtal sounds. Several studies have instead focused on theception of particular classes of sounds~e.g., bouncing andbreaking bottles, Warren and Verbrugge, 1984; hands cping, Repp, 1987; mallets striking metal pans, Freed, 19footsteps, Liet al., 1991!. Many of these investigations havexamined the associations between the acoustic propeand the identifiability of specific sounds. Some have lookat how specific physical features of objects are specifiedacoustics, such as the amount of liquid in a cylinder~Cabeet al., 2000! and the dimensions and shapes of obje~Carelloet al., 1998; Kunkler-Peck and Turvey, 2000; Lak

a!Current affiliation: East Bay Institute for Research and Education, Mtinez, CA. Electronic mail: [email protected]

1252 J. Acoust. Soc. Am. 115 (3), March 2004 0001-4966/2004/1

bution subject to ASA license or copyright; see http://acousticalsociety.org/c

se

h:er-ble

e

ch

n-er-

p-0;

iesdy

s

tos et al., 1997!. Others have focused on subjective~or cog-nitive! judgments of sound qualities or properties in antempt to relate these judgments to acoustic properties ofsounds~Halpernet al., 1986; Ballas and Howard, 1987; Balas, 1993, Gaver, 1993b!.

Many identification studies have examined recognitiaccuracy of a moderately sized corpus~20–40 sounds! inquiet at suprathreshold levels~Lasset al., 1982; Vanderveer,1979!, often with an emphasis on the effect of cognitivfactors on listeners’ performance~Ballas, 1993!. Gaver~1993a! developed a preliminary taxonomy of environmensounds, based on dimensions of the type of interactingterial ~gas, liquid, solid! and the type of interaction~e.g.,collision, explosion, splash!. These varied approaches hademonstrated the ready identifiability of a diverse bodyenvironmental sounds, and provided some evidence ofacoustic properties listeners use to identify specific sounsuch as damping of the envelope for bouncing and breakbottles~Warren and Verbrugge, 1984! and centroid~spectrummean! for footsteps~Li et al. 1991!. In contrast, Lutfi andcolleagues~Lutfi and Oh, 1997; Lutfi, 2001! have shown thatunder certain conditions, listeners do not attend to the appriate acoustic features, giving undue weight to frequencthe expense of other, more informative features.

In the present series of experiments, an approach useearly studies of the perception of speech is applied to a clog of environmental sounds. Research by telephone sctists and engineers devoted a considerable effort to detering the importance of various frequency regions for tidentification of speech. One method was to measureidentifiability of low-, high-, and bandpass filtered speecwhich eventually led to the development of the Articulatio

r-

15(3)/1252/14/$20.00 © 2004 Acoustical Society of America

ontent/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 2: Spectral-temporal factors in the identification of environmental sounds

7

J. Acoust. Soc. Am.

Redistribution subject to ASA

TABLE I. List of sounds used in the environmental sound identification study.

Sound Code Dur~ms! Sound Code Dur~ms!

Airplane flying AIR 3426 Harp being strummed HAR 2839

Baby crying BAB 3119 Helicopter flying HCP 3100

Basketball bouncing BBA 3184 Horse neighing HRN 1464

Beer can opening BER 431 Horse running HRR 2046

Bells chiming BLS 3100 Ice dropping into glass ICE 738

Billiard balls colliding BIL 1684 Laughing LAF 1836

Bird calling BRD 1283 Match being lit MAT 1366

Bowling BOW 3062 Paper being crumpled PAP 1697

Bubbling BUB 2212 Phone ringing PHN 3100

Buzzer sounding BZZ 2523 Ping-pong ball bouncing PNG 3945

Camera shutter clicking CAM 1398 Printing paper PRN 3154

Car accelerating CRA 3209 Projector running PRJ 2909

Car starting CRS 3551 Rain RAI 3362

Cars crashing CRC 3110 Rocking chair RCK 3234

Cars honking CRH 3367 Rooster crowing ROO 2004

Cash register closing REG 2841 Scissors cutting paper SCI 280

Cat meowing CAT 1054 Screen door closing SCR 3292

Chimp calling CHP 2315 Sheep baaing SHP 1365

Chopping wood AXE 2624 Shoveling SHV 1585

Clapping CLP 1035 Siren blaring SIR 1534

Clock ticking CLK 2882 Sneezing SNZ 844

Coughing COF 1043 Splash SPL 2338

Cow mooing COW 1290 Stapling STA 822

Cymbals being hit CYM 1204 Tennis ball being hit TEN 2761

Dog barking DOG 1007 Thunder rolling THU 2895

Door opening and closing DOR 1802 Toilet flushing TOI 2494

Drumming DRM 2747 Train moving TRN 3333

Electric guitar~strum! GTR 2087 Tree falling TRE 3424

Electric saw cutting ESW 2600 Typing on keyboard TYK 1033

Flute playing FLU 2566 Typing on typewriter TYT 2755

Footsteps FST 2700 Water pouring POU 3302

Gargling GRG 2523 Waves crashing WAV 2876

Glass breaking GLS 1018 Whistle blowing WHI 2047

Gun shot GUN 902 Windshield wipers WWP 3023

Hammering a nail HAM 1833 Zipper ZIP 1733

ut

tideghiedtda

ibl r

orgesi-

fre-

ar-ltsis

ize

ob-

Index ~AI ! ~French and Steinberg, 1947!. The primary goalwas to determine the overall bandwidth required to commnicate human conversations, and the frequency contenindividual speech sounds was not of concern.

In the first two experiments described below, the idenfiability of 70 environmental sounds was measured unconditions of low-, high-, and bandpass filtering. Althouthe methodology is similar to those of early speech studthe primary goal was not to determine the overall bandwineeded to recognize the majority of environmental sounbecause, as with vowels and consonants, it is clear that menvironmental sounds occupy small portions of the audrange. Instead, the purpose was to identify the spectra

, Vol. 115, No. 3, March 2004 G

license or copyright; see http://acousticalsociety.org/c

-of

-r

s,hs,nylee-

gions that include the most useful identifying information fa wide range of environmental sounds by measuring chanin identification performance under different filtering condtions. Listeners will not necessarily attend to the samequency band under all listening conditions~depending oncontext!, and the attended-to spectral regions may be nrower, wider, or may vary over time. However, the resuprovide an indication of where the most useful informationlocated, as well as a measure of listeners’ abilities to utilinformation in different spectral regions~including regionsthat may provide little identifying information!. This infor-mation may aid in the assessment of listening strategiesserved in other contexts.

1253ygi et al.: Factors in the identification of environmental sounds

ontent/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 3: Spectral-temporal factors in the identification of environmental sounds

oinllitn.ebisn

onormch

inta

nom

isd

reed

ioti

red

oft aonrereltea

onbell

ithB-tsig

a

re

18–dio-t

ingedad-d

waset

t 80

atee

, aner.is

rederv-the

beondedved.pos-

eea

eyThect.ldlis-eregesas

.

asingd tobed tochd tox-

Redistri

Many environmental sounds are really a collectionrepeated brief sounds, such as a ping-pong ball bouncchopping wood, or a clock ticking. These sounds usuahave fairly broad spectra, being akin to noise bursts, andlikely their temporal patterning that allows identificatioConsequently, they may be recognizable even with extremlimited spectral information. This hypothesis was testedcreating what are termed here event-modulated no~EMNs!. An EMN is a broadband noise whose temporal evelope is modulated by the envelope of a particular envirmental sound; the resulting sound preserves the tempstructure of the environmental sound, but has a unifospectrum. This is similar to the noise-band vocoder teniques used to create vocoder speech~e.g., Shannonet al.,1995!. Together, the two types of stimulus manipulations~fil-tering and EMN! demonstrate the spectral and temporalformation listeners use to identify familiar environmensounds.

II. THE CATALOG OF ENVIRONMENTAL SOUNDS

The 70 environmental sounds used in the experimereported here are listed in Table I. They were taken frhigh-quality sound effects CDs~Hollywood Leading Edgeand Sound FX The General! sampled at 44.1 kHz. Thesounds were selected on the basis of familiarity, pairwdiscriminability, and ready identifiability in the quiet baseon pilot studies. An effort was made to select a fairly repsentative sampling of the classes of meaningful soundscountered in everyday listening: nonverbal human sounanimal vocalizations, machine sounds, the sounds of varweather conditions, and sounds generated by human acties. The sounds were equated for overall rms, with a cortion for pauses of greater than 50 ms~the pause-correcterms for these sounds is discussed in Appendix A! and storedas binary files. While no formal study of the recognitionthe stimuli in their unfiltered forms was conducted, all bufew of the 70 unfiltered sounds were correctly identifiedthe first trial in the training phase of experiment 1. All weidentified with greater than 95% accuracy by the fourth psentation. This, as well as performance in the widest ficondition reported below, demonstrates that all soundseasily identified in their unfiltered form. The mean duratiof the sounds was 2.3 s, with the shortest sound beingcan opening~431 ms! and the longest being ping-pong babouncing~3945 ms!.

III. EXPERIMENT 1. IDENTIFICATION WITH LOW- ANDHIGH-PASS FILTERING

A. Methods

1. Stimuli

The set of 70 environmental sounds was filtered wthird-order Chebyshev type I filters, with slopes of 48 doctave and a level of260 dB in the stopbands. The highpass filter cutoffs (f c , measured at the 3-dB down poin!were 300, 600, 1200, 2400, 4800, and 8000 Hz. Low-pasf c

were 300, 600, 1200, 2400, and 4800 Hz. The 8000-Hz hpass filter was added after pilot testing, which indicated anf c

higher than 4800 Hz would be necessary to properly estim

1254 J. Acoust. Soc. Am., Vol. 115, No. 3, March 2004

bution subject to ASA license or copyright; see http://acousticalsociety.org/c

fg,yis

lyyes--

ral

-

-l

ts

e

-n-s,usvi-c-

-rre

er

/

h-

te

the full range of performance. The filter coefficients wegenerated byMATLAB ’s signal-processing module~sptool!.

2. Listeners

The listeners were four females between the ages of30, all with normal hearing as measured by pure-tone augrams~thresholds,15 dB HL!. All were undergraduates aIndiana University, and were paid for their participation.

3. Procedure

The listeners were seated in listening booths, facVT100 terminals and keyboards. The stimuli were playover Sennheiser HD 250 Linear headphones. The hephones’ frequency response~measured using pure tones anan acoustic coupler! varied by no more than 3 dB from 10–25 000 Hz.1 The stimuli were generated from the stored rafiles by TDT 16-bit D/A converters and amplified byCrown headphone amplifier. The presentation level wasso that the unfiltered, equated stimuli were presented adB SPL at the headphones.

It should be noted that there was no attempt to equthe rms of the filtered stimuli. A result was that in somcases, usually with the highest or lowest bandpass filterextremely small amount of energy was output by the filtWhile this meant that a few stimuli were barely audible, thapproach was preferred to equating levels of the filtesounds, because of the greater ecological validity in presing the relative levels of all frequency components acrossfilter conditions.

The listeners were given a sheet listing the sounds topresented and three-letter codes with which to resp~shown in Table I!. The labels for the sounds were intendto give a sense of the source objects and the events involThe response codes were selected to be as distinctive assible; with a few exceptions, codes differed by two or thrletters ~one exception was TYT and TYK for typing ontypewriter and typing on a keyboard, respectively!.

Listeners were instructed to identify the sound thheard by typing the appropriate code on the keyboard.list of codes was always within view in front of each subjeIf the listener did not respond within 7 s, a prompt wouflash on the screen, encouraging them to respond. If thetener responded with a code that was not valid, they wprompted to reenter the code. Throughout the training staof all experiments reported here, right–wrong feedback wprovided, but not in the testing stages.

The experiment proceeded in four sequential phases

~a! Pretraining: Listeners were told that their ability toidentify various common environmental sounds wgoing to be tested, and that they were to respond usthree-letter codes. Each listener was then encouragelook over the sheet containing the list of sounds toused and the associated codes with which to responeach. Then, five practice trials were given in whilisteners were played an unfiltered sound and askeidentify it. These practice sounds were alternative eamples of ones that were used in later testing~e.g., adifferent cat sound!.

Gygi et al.: Factors in the identification of environmental sounds

ontent/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 4: Spectral-temporal factors in the identification of environmental sounds

srl

m

nie

e-le

ticicrolyaouevto.toea

ereo

re

foser

deerrlyorte

e,

edischelyn-ersons.terizeer-

desus00

ma-ointen-

the

ld

edcon-s

en-he

ei-awderare

reot

n at

th

Redistri

~b! Familiarization: The set of 70 unfiltered sounds wanext presented in the order that it appeared on thesponse sheet, and the listeners identified each. Theteners were told that in this phase they should atteto learn the sounds and the associated codes.

~c! Training: The unfiltered sounds were presented in radom order. The complete list was presented a sufficnumber of times for listeners to reach.95% accuracywith relatively short latencies for entering the threletter codes. This required four iterations of the wholist, about 1.5 h of testing time. Although no systemaeffort was made to distinguish between sounds whwere recognized perfectly and those for which an eror two was made, on the fourth iteration of the list onnine sounds were recognized with less than 100%curacy, and those all had only one error among the fsubjects. Listeners were given a 1-min break afterery block of 35 trials, which averaged about 8 mincomplete, and a 5-min break after every four blocks

~d! Testing:The listeners were told that they were goingbe tested with filtered sounds, and they then heardamples of high- and low-pass filtering of a sound thwas not used in the later tests. The filtered stimuli wthen presented in random order, and the listeners idtified them using the three-letter codes. Over 9 daystesting, the complete list of 770 stimuli~70 sounds311filter types! was presented five times. Breaks wegiven as in the training phases.

B. Result and analysis

Figure 1 shows the probability of correct responseseach listener across allf c for both the high- and low-pasconditions. For both high- and low-pass filter conditions pformance was nearly perfect~95%–100% correct! in theleast severe filter conditions, but recognition graduallyclined as more information was filtered out. However, pformance at the most extreme filter cutoffs was still faigood, especially in the high-cutoff conditions, where perfmance never dropped below 70% correct. This is consis

FIG. 1. Performance by four listeners~designated L1–L4! under low- andhigh-pass filtering conditions. The approximate crossover frequency isisointelligibility point for the low- and high-pass functions~see the text!.

J. Acoust. Soc. Am., Vol. 115, No. 3, March 2004 G

bution subject to ASA license or copyright; see http://acousticalsociety.org/c

e-is-pt

-nt

hr

c-r-

x-ten-f

r

-

--

-nt

with some results from the voice identification literatursuch as Compton~1963! in which identifiability was not af-fected by high-pass filtering but was substantially impairby low-pass filtering. The proportion correct for all soundslisted in Appendix B, along with the mean and s.d. for eafilter condition. In general, the s.d. of the sounds is inversrelated to the filter width. However, in the widest filter coditions the low variance is due to a ceiling effect, as listenachieved near-perfect recognition scores in these conditi

One measure speech scientists have used to characthe importance of various frequency regions for speech pception is the crossover point, a single frequency that divithe speech spectrum into equally intelligible halves. Variovalues have been found for the crossover point, from 19Hz for nonsense syllables~French and Steinberg, 1947! to1189 Hz for continuous discourse~Studebakeret al. 1987!.In general, according to Bellet al. ~1992!, the crossoverpoint tends to decrease as the redundancy of the speechterial increases. When defined as the isoperformance pthe crossover frequency for low- and high-pass environmtal sounds is about 1300 Hz~noted in Fig. 1!, which is in thelower end of the range for speech sounds. However, sincehigh-pass function has more area below it~due to the muchshallower slope than the low-pass function! the frequencydividing the spectrum into equally intelligible halves woulikely be somewhat higher than 1300 Hz.

Despite the overall tendency for high-pass filtersounds to be better recognized than lowpass, there wassiderable variation in the identifiability of individual soundin both conditions. Figures 2 and 3 show the range of idtification performance for a group of specific sounds in tlow- and high-pass conditions. The selected sounds werether easily identifiable, not identifiable at all, or showedgraded loss in identifiability as more and more high or lofrequencies were eliminated. As might be expected, unhigh-pass filtering, sounds such as thunder and wavesmore difficult to identify, while most of the other sounds aeasily recognized. The differences in identifiability are ndue to audibility; both thunder and waves are audible evethe most severe high-passf c . Overall, for both filter condi-

e FIG. 2. Identifiability of selected sounds under high-pass filtering.

1255ygi et al.: Factors in the identification of environmental sounds

ontent/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 5: Spectral-temporal factors in the identification of environmental sounds

s-

bndir

nlsrah

n

eesxtth

otndithni-

e-lterly,dif-etterhed

be--(vstri-the

h-8.4

t iss-ata

upthe

oleere

Redistri

tions ~low- and high-pass!, the correlation between the rmoutput by the filter andp(c) was close to zero. Under lowpass filtering there are sounds such as glass breakingwhich the essential information is almost totally removedmoderate filtering. And, in both conditions there are southat retain their identifiability as long as any portion of thespectrum is audible.

However, high- and low-pass filtering only indicate geeral regions that are necessary for identification. It is apossible that specific frequency bands within their ovespectrum are particularly informative for certain sounds. Tsecond experiment, therefore, examined the identificatiobandpass filtered environmental sounds.

IV. EXPERIMENT 2. IDENTIFICATION WITHBANDPASS FILTERING

A. Methods

1. Stimuli

The sounds used were those in experiment 1. They wfiltered with third-order Chebyshev type I filters, with slopof 48 dB/octave and a level of260 dB in the stopbands. Sidifferent passbands (f c measured at the 3-dB down poin!were used, listed in Table II. The procedure for creatingstimuli was the same as in experiment 1.

FIG. 3. Identifiability of selected sounds under low-pass filtering.

TABLE II. Bandpass filter specifications.

f c Low ~Hz! f c High ~Hz! Log midpoint ~Hz!

BP1 150 300 212.13

BP2 300 600 424.26

BP3 600 1200 848.53

BP4 1200 2400 1697.06

BP5 2400 4800 3394.11

BP6 4800 9600 6788.23

1256 J. Acoust. Soc. Am., Vol. 115, No. 3, March 2004

bution subject to ASA license or copyright; see http://acousticalsociety.org/c

forys

-olleof

re

e

2. Listeners

Six women and two men were recruited who had nparticipated in experiment 1. The criteria for age range ahearing thresholds were the same as in experiment 1. Wone exception, they were all undergraduates at Indiana Uversity.

3. Procedure

The procedure was the same as in experiment 1.

B. Result and analysis

Figure 4 shows the mean probability of correct rsponses to all of the sounds for each listener in each ficondition. The eight subjects all performed quite similaralthough there are noticeable and consistent individualferences: the best listener’s scores were 10%–20% bthan those of the worst listener at every filter setting. Tmost difficult filter settings were the two lowest, BP1 anBP2 ~31% and 51% correct, respectively!, whereas meanperformance in the four highest bandpass settings wastween 70%–80%. Scheffe´’s post hoctest showed that performance differed significantly under each filter settingp,0.05) from every other filter setting, except for BP4BP5, and BP3 vs BP6. If the overall function were symmecal, a level of performance comparable to BP1, based onlog axis shown in Fig. 4, would be achieved with a higbandpass filter spanning the frequency region 19.2–3kHz. Since this is above the range of human hearing, iclear that the function is not symmetrical. An empirical quetion for a later date is whether the decline in performancefilter settings higher than 9600 Hz is gradual or showssharp drop-off.

Not all of the 70 sounds, however, followed the gropattern. Some sounds were much better recognized onbasis of the lower frequencies~e.g., bubbles!, some wererecognized well only at one specific filter setting~bowling!and several were identified perfectly across almost the whrange of bandpass filter settings. Across filters there w

FIG. 4. Listener performance by bandpass filter type.

Gygi et al.: Factors in the identification of environmental sounds

ontent/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 6: Spectral-temporal factors in the identification of environmental sounds

Redistri

FIG. 5. Single-channel event-modulated noise~EMN! created using the bubbles sound and broadband noise.

inni

tivn-thae

ic

anrepha

tint

elstnnHeecne

gysed

in-

ndthe

houtm-his

tont-lye istooral

n-ingise,r.m isly

toalll ofralN.

ver,have

quite large differences in identifiability between sounds:general, the between-sound performance was only sigcantly correlated in contiguous filters~e.g., BP3 and BP4!.

The preceding experiments focused on the informavalue of different spectral regions for identifying enviromental sounds. However, it may be that in many casestemporal information is more important than the spectral pterning for identifying these sounds. Certainly, there are seral sounds with strong characteristic periodicity~drums,gallop, helicopter!, and others with distinctive, nonperiodtemporal structure~baby crying, laugh, bowling!. It may bethat such sounds could be identified by the energy inspectral region as long as the temporal information is pserved. In the next experiment the importance of the temral patterning was tested with environmental sounds thatmost of the fine-grained frequency information removed.

V. EXPERIMENT 3. IDENTIFICATION OFEVENT-MODULATED NOISES

There has been a good deal of research using speechhas had most of the temporal information removed usvocoder methods~see, e.g., Dudley, 1940, 1958; Carra1984; Eremenko and Ermakov, 1985!. One of the most strik-ing findings is that a relatively small number of chann~sometimes as few as three or four! is needed for almosperfect speech recognition, if the envelopes of each chaare low-pass filtered at a cutoff frequency of at least 50~Shannonet al., 1995!. This implies that temporal cues arsufficient to identify speech with a very coarse grain of sptral information. Given the results in the previous sectiowhich suggested that the same might be true of environm

J. Acoust. Soc. Am., Vol. 115, No. 3, March 2004 G

bution subject to ASA license or copyright; see http://acousticalsociety.org/c

fi-

e

et-v-

y-

o-d

hatg,

elz

-,n-

tal sounds, it was decided to test this with a methodolosimilar to Shannon’s, using environmental sounds proceswith vocoder techniques.

There are many ways of implementing vocoder prciples; one recently used by Shannonet al. ~e.g., 1995, 1998!is a simplification of the original design. Called a noise-bavocoder, it uses multiple noise-band carriers to representspeech envelope signals from broad spectral bands witimplementing the voicing detector and pitch estimator eployed by Dudley. Figure 5 is a schematic showing how twas used with the envelope of an environmental soundcreate an example of what will be termed here evemodulated noise~EMN!. In the simplest case, there was onone broadband filter applied to the original sound, so therlittle or no spectral information remaining. This is referredas a single-channel EMN, and as Fig. 5 shows, the tempenvelope of the original waveform~bubbles! is preserved,while the spectrum is almost flat.

Frequency information can be reintroduced by partitioing the signal with several bandpass filters, and then usthe envelope obtained from each filter to modulate a nothe bandwidth of which is that of the corresponding filteThe several noises are summed and the resulting spectrucloser to that of the original sound, although it is perceptib‘‘noisy.’’

The primary goal of the following experiments wasdetermine the identifiability of the set of environmentsounds used in the previous studies when all or nearly athe spectral information was removed, leaving only tempoinformation. Three experiments were conducted using EMExperiments 3a and 3b used single-channel EMN. Howebecause the various studies by Shannon and others

1257ygi et al.: Factors in the identification of environmental sounds

ontent/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 7: Spectral-temporal factors in the identification of environmental sounds

heo

do

tte

erth

he

eren

th

ha

diri

vere

tio

peintee

isthtwths7

iviste

ota

hethe

Nent. Ononbe-

ay,70

e offirstour

is-en

theers inasmil-ok

ntial2

eynnel

ur-2

begal-%oint

ds.thenel

ad

ing-

ix-ent

asbe-erein-

s.

Redistri

shown remarkably accurate identification of speech wonly modest spectral information was reintroduced, it wasinterest to learn whether environmental sounds showesimilar pattern. In a third study, 3c, listeners were testedthe identification of six-channel EMN.

A. Methods

1. Stimuli

The same 70 environmental sounds were used as inprevious experiments. Single-channel event-modulanoises were created by

~a! Extracting the envelope by half-wave rectifying thwaveform and passing through a four-pole ButterwoLP ( f c520 Hz) filter; and

~b! Multiplying the envelope by a broadband noise of tsame bandwidth~0–22.5 kHz!.

Six-channel event-modulated noises were created by

~a! Extracting the envelope of all six bandpass filtered vsions of each sound used in the previous experimby the method described above;

~b! Multiplying each envelope by a bandpass noise ofsame bandwidth; and

~c! Adding the resulting six waveforms back together.

All manipulations were carried out inMATLAB . TheEMN were then equated for rms minus silences of more t50 ms, as was done previously for the original sounds.

2. Listeners

Three different studies were carried out, each usingferent sets of eight listeners with the same age and heathreshold criteria as experiments 1 and 2. Listeners werewomen and six men, all undergraduates at Indiana Unisity. One group had participated in experiment 2. All wepaid for their participation.

3. Procedure

The setting, apparatus, and presentation level weresame as in experiments 1 and 2. Three EMN identificatstudies using different listeners were conducted.

Experiment 3a. Single-channel EMN identified by exrienced listeners. The listeners were those who took partthe bandpass filter studies. This study was run immediaafter the bandpass filter trials, so the listeners were extremfamiliar with the original sounds. They were told that in thstudy, the sounds would be altered in a manner similar toof some electronic musical instruments, and were playedexamples of EMN with a sound that was not used insubsequent testing. The 70 EMN were then presented aexperiment 2. In one 1-1/2-h session, the complete list ofstimuli was presented five times.

Experiment 3b. Single-channel EMN identified by nalisteners. A new set of eight listeners was recruited, consing of four males and four females. None of them had takpart in any of the previous studies. Two of the listeners, bmales, dropped out after the first day of testing, so their d

1258 J. Acoust. Soc. Am., Vol. 115, No. 3, March 2004

bution subject to ASA license or copyright; see http://acousticalsociety.org/c

nfan

hed

-t,

e

n

f-ng18r-

hen

-

lyly

ato

ein0

e-nhta

were not included in any of the subsequent analysis. Tlisteners were given instructions and pretraining as infirst phase of the bandpass filter experiment. The 70 EMwere then presented as in experiment 3a. This experimtook place over two 1-1/2-h sessions on subsequent daysthe first day, the listeners did not have a familiarizatiphase, so they had no knowledge of the original soundsfore they identified the EMN. At the start of the second dbefore further testing, they heard the original unprocessedsounds and had to respond as in the familiarization phasexperiment 2. Then, they were tested again as on theday. On the first day, the set of 70 EMN was presented ftimes; on the second day, five times.

Experiment 3c. Six-channel EMN identified by naive lteners. The participants were eight new listeners, three mand five women, none of whom had taken part in any ofprevious studies. The stimuli were six-channel EMN raththan single-channel EMN. The procedure was the same aexperiment 3b. On the first day, the set of 70 EMN wpresented four times; on the second day, five times. Faiarization with the original set of unprocessed sounds toplace on the start of the second day only.

B. Results

Table III shows thep(c) for each group of listeners inexperiment 3a, b, and c on each day. There was a substaimprovement for both groups of naive listeners on dayafter they heard the original waveforms, even though thheard each one only once. Because the day 1 single-chaEMN performance for the naive group was so poor, all fther discussion of the EMN results will refer to the dayfindings for both groups of naive listeners. It shouldnoted, however, that 15 sounds, among them helicopter,lop, and ping-pong, were identified with better than 50accuracy even on day 1 by naive listeners, who at that phad no experience with the original sounds.

There was a strong effect of experience with the sounExperienced listeners, who were trained to criterion onoriginal unprocessed sounds, identified the single-chanEMN significantly better than the naive listeners who honly heard the original sounds once, on day 2,t(12)56.74,p,0.000 02. The sounds with a strong temporal pattern~scissors, axe, footsteps! tended to show the greatest improvement in identifiability due to learning of the sounds.

Comparing the results from the single-channel and schannel EMN conditions, there was significant improvemas spectral information was added,t(12)512.61, p,0.000 001. The identification of harmonic sounds, suchbubble, bird, and dog, showed the greatest improvementtween the two conditions. There were some sounds that wnot affected by experience or by the amount of spectral

TABLE III. P(c) by day for each listener group in the EMN experiment

Experienced~3a!single-channel

Naive ~3b!single-channel

Naive ~3c!six-channel

Day 1 0.46 0.13 0.36

Day 2 0.23 0.66

Gygi et al.: Factors in the identification of environmental sounds

ontent/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 8: Spectral-temporal factors in the identification of environmental sounds

ps. Thein the

J. Acoust. Soc. Am.

Redistribution subject to ASA

TABLE IV. Most and least identifiable sounds for all groups of listeners, along with theirp(c) in eachcondition. The italicized sounds were recognized either nearly perfectly or not at all across the three grouboldfaced sound was not identified at all in the Single Channel EMN conditions, and nearly perfectlySix-Channel condition.

Most identifiable sounds

Single-channel EMNexperienced listeners

Single-channel EMNnaive listeners

Six-channel EMNnaive listeners

FOOTSTEPS 1.00 HELICOPTER 0.93 BABY CRY 0.98

GALLOP 1.00 RAIN 0.69 BIRD 0.98

HELICOPTER 1.00 CLOCK 0.64 CLOCK 0.98

MATCH 1.00 GALLOP 0.63 PING PONG 0.98

PING PONG 1.00 PING PONG 0.63 GALLOP 0.94

Least identifiable

Single-channel EMNexperienced listeners

Single-channel EMNnaive listeners

Six-channel EMNnaive listeners

BIRD 0.00 BIRD 0.00 FLUTE 0.16

FLUTE 0.00 CAR ACCEL. 0.00 CARS HONK 0.18

CRASH 0.00 CRASH 0.00 ELECTRIC SAW 0.18

ELECTRIC SAW 0.00 ELECTRIC SAW 0.00 PHONE 0.18

GUITAR 0.00 GUITAR 0.00 BUZZER 0.20

si

bnd

.apal

tiemno

ne

ra

traena

ta

a

l-a

heof

y of

nceis

ris-ras-

tivetraltralthal-

formation: helicopter and gallop were among the most earecognized sounds in each condition~helicopter was identi-fied correctly at least 90% of the time!, whereas electric sawand flute were never identified more than 18% of the timeany of the groups. A listing of the five most identifiable afive least identifiable sounds for each group is in Table IV

Overall, the EMN identification studies showed that,with speech, in the absence of frequency information temral information is sufficient to identify many environmentsounds~35 out of 70! with at least 50% accuracy~chancebeing 1.4% correct!. The sounds that tend to be best idenfied under these conditions are those with more salient tporal patterning. There was also a strong effect of experiewith the sounds, presumably because listeners with the mextensive training from experiment 2 became more attuto the temporal features important for identification.

VI. ACOUSTIC FACTORS IN EMN IDENTIFICATION

A. Factors considered

In an attempt to quantify temporal structure, sevemeasurements of the original waveforms~preprocessing!were made. These variables reflected different spectemporal aspects of the sounds including statistics of thevelope, autocorrelation statistics, and moments of the loterm spectrum. The measures and a brief descriptionbelow.

1. Envelope measures: Long-term rms Õpause-corrected rms, number of peaks, number of bursts,burst duration Õtotal duration

The first measure roughly corresponds to the amounsilence in a waveform. Peaks are quick transients, burstsmore sustained rises in amplitude, consisting of a 4-dB g

, Vol. 115, No. 3, March 2004 G

license or copyright; see http://acousticalsociety.org/c

ly

y

so-

--

cered

l

l-n-g-re

ofre

in

sustained for at least 20 ms@based on an algorithm deveoped by Ballas~1993!#. Burst duration/total duration yieldsmeasure of the ‘‘roughness’’ of the envelope.

2. Autocorrelation statistics: Number of peaks,maximum, mean, s.d.

The autocorrelation matrix reveals periodicities in twaveform, and the statistics measure different featuresthese periodicities, such as the intensity and consistencthe periodicities.

3. Correlogram-based pitch measures (from Slaney,1995): Mean pitch, median pitch, s.d. pitch,max pitch, mean pitch salience, max pitch salience

The correlogram measures the pitch and pitch salieby autocorrelating in sliding 16 ms time windows, and soa combined spectral and temporal variable.

4. Moments of the spectrum: Mean (centroid), s.d.,skew, kurtosis

These were included to see if some spectral charactetics were preserved when the spectral information was dtically reduced.

5. Spectral shift in time measures: Centroid mean,centroid s.d., mean centroid velocity, s.d.centroid velocity, max centroid velocity

The centroid mean and s.d. are based on consecu50-ms time windows throughout the waveform. The speccentroid velocity was calculated by measuring the speccentroid in sliding 50-ms rectangular time windows. As withe correlogram, the spectral shift in time is a spectr

1259ygi et al.: Factors in the identification of environmental sounds

ontent/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 9: Spectral-temporal factors in the identification of environmental sounds

dinp

ro

Mlaymnaisisdad

-eceaisnthyeeth

’’ oud

le-

thehead

elentVIIix-heN.nsrst

on.ity,han-ionsrs.

-.

e-ix-

thee-ikelytten-t toon-isthe

the

edg ofin-la-

Redistri

temporal feature which may be represented as temporalcontinuities in an EMN, since in the real world changesthe spectrum of a sound are usually accompanied by temral discontinuities.

6. Cross-channel correlation

This measures the consistency of the envelope acchannels.

B. Modeling results

Each of these measures was correlated with the Eidentification results for each group~for the single-channeand six-channel conditions with naive listeners, only the d2 data were used!. The magnitude of the correlation masuggest the degree to which each measure captures a teral feature that listeners use to identify sounds in the abseof spectral information, or alternatively, the perceptu‘‘weight’’ given to that measure. Multiple regression analysof the experienced listeners’ results using forward stepwregression retained seven variables which were largely inpendent: the highest correlation between any two of the vables wasr 50.39. The partial correlations of all retainevariables~ordered by decreasing magnitude! are shown inTable V. The variable with the largestr, spectrum s.d., suggests that sounds that have broader spectra are better rnized as EMN than sounds with narrower spectra. As mtioned, several of the rhythmic sounds, such as scissorsgallop, have long-term spectra similar to a broadband noand so they are not changed appreciably as EMN. Altertively, the spectrum s.d. may reflect a spectral attribute oforiginal sound that is correctly inferred from the EMN blisteners familiar with the original sounds. The next thrvariables more directly represent temporal aspects ofwaveform such as the amount of silence, the ‘‘roughnessthe envelope, the periodicity, and the presence of amplitbursts. Together, the seven variables accounted for 63%the variance in the identification thresholds.

TABLE V. Single channel EMN, experienced listeners’ partialr.

Spectrum s.d. 0.485

Burst duration/total duration 0.445

Cross-channel correlation 0.367

Maximum spectral centroid velocity 0.319

# of autocorrelation peaks 0.298

Pause-corrected rms/total rms 20.196

Spectral centroid velocity s.d. 20.155

Multiple regression solution,R50.79,R250.63

TABLE VI. Single channel EMN, naive listeners’ partialr.

# of autocorrelation peaks 0.360

Cross-channel correlation 0.359

Burst duration/total duration 0.340

Spectrum s.d. 0.139

Pause-corrected rms/total rms 20.019

Multiple regression solution,R50.62,R250.39

1260 J. Acoust. Soc. Am., Vol. 115, No. 3, March 2004

bution subject to ASA license or copyright; see http://acousticalsociety.org/c

is-

o-

ss

N

y

po-cel

ee-ri-

og-n-nde,a-e

efeof

Multiple regressions on the naive listeners’ singchannel EMN results accounted for less variance~39%!, al-though there was a problem of restricted range sincenaive listeners’ performance was overall not very good. Tsolution yielded five independent variables, all of which hbeen included in the solution for experienced listeners~seeTable VI!.

The multiple regression solutions for the six-channEMN data with naive listeners retained nine independvariables and accounted for half of the variance. Tableshows the partial correlations for the variables with the schannel EMN. The starred variables in Table VII were tsame as for the experienced listeners single-channel EM

Three variables which occurred in all three solutiowere the number of autocorrelation peaks, the ratio of buduration to total duration, and cross-channel correlatiThese are all temporal features, reflecting periodicamount of silence, and coherence of envelope across cnels. There are two notable differences between the solutfor the single- and six-channel EMN for the naive listeneAll the statistics of the autocorrelation peak~number, maxi-mum, mean, and s.d.! were significant predictors of the identification of six-channel EMN, but not single-channel EMNIn addition, the spectrum mean~also called the centroid! wasa significant predictor only in the six-channel condition, rflecting the presence of more spectral information in schannel EMN.

In general, the amount of variance accounted for andrelatively few outliers~points outside the 95%-confidenclimits of the regression line! in the various multiple regression solutions suggest that in each case these variables lrepresent acoustic features to which listeners pay some ation. The different variables retained in each solution pointhe different features listeners were attending to in each cdition. Most of the variables used are fairly low order, so itpossible there are higher-order variables incorporatinglower-order ones that would do a better job of capturingtemporal structure of these sounds.

VII. DISCUSSION AND CONCLUSIONS

These studies illustrate that some methods borrowfrom speech research can contribute to our understandinthe recognition of environmental sounds. Although determing importance functions and something akin to an Articu

TABLE VII. Six-channel EMN, naive listeners’ partialr.

# of autocorrelation peaks 0.381*

Maximum autocorrelation peak 0.319

Burst duration/total duration 0.303*

s.d. autocorrelation peak 20.293

Cross-channel correlation 20.230*

Spectrum mean 20.230

Mean salience 20.171

Mean spectral centroid velocity 0.170

Mean autocorrelation peak 0.170

Multiple regression solution,R50.70,R250.5

Gygi et al.: Factors in the identification of environmental sounds

ontent/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 10: Spectral-temporal factors in the identification of environmental sounds

ahea

hioid

igosH0%n0

Huu

ugs

ththhiea

-

taeon

-h

rdA

nd

inothndc

indsiat

efouc

cutn

thn

nn

inctra

s theforisely

ndsos-va-ost

tieslidtalleast-

e a

ison.

a--atbefortalan-rly

vi-els.

sricen-esralen-

d 6

Redistri

tion Index for all environmental sounds is probably not fesible, given the diversity of environmental sounds, tfindings do suggest some frequency regions that are, onerage, more informative than others for identification. Tmost important frequencies are in the 1200–2400-Hz regcomparable to those found for speech. Like speech, a wvariety of environmental sounds can be either severely hor low-pass filtered and still be identifiable. Even at the mextreme low- and high-pass filter settings, 300 and 8000the percent of correct identification was never less than 5For speech, by comparison, the important frequency bafor calculating the AI span a more restricted range, from 3to 6400 Hz.

The crossover point for environmental sounds, 1300~see Fig. 1!, is approximately the same as that for continuodiscourse. As mentioned earlier, several studies have fothat the crossover frequency for speech materials is thoto decrease as the redundancy of the speech materialcreases~Studebakeret al., 1987; Bellet al., 1992!. This maybe because the redundancy increases the intelligibility ofhigh-frequency end of the speech signal, by reducingeffective size of the possible catalog of speech sounds wmay be present. The high-frequency region of the spespectrum has less energy than the low-frequency end,typically would yield lowerp(c) under low redundancy listening.

The crossover point might indicate that environmensounds, too, have a high redundancy. However, for spematerials redundancy is generally defined as linguistic ctext, in which the possibilities for a given speech sound~pho-neme, morpheme, word! are highly constrained by the preceding or subsequent speech sounds. It is not clear wunits, if any, within a sound might act as phonemes or wowith sequential dependencies that a listener can detect.though meaningful sequences of environmental sousurely exist, Ballas and Mullins~1991! have shown that the‘‘grammar’’ for such sequences is much less constrainthan it is for speech, especially in experiments such as thdescribed here. This is also likely to be the case inpresent experiments, in which single environmental souselected at random are presented outside of any relevanttext. It is certainly possible that the experimental settcould lead to underestimation of the identifiability of sounthat would be much more recognizable in an approprsetting.

A concern with the closed-set format~even though itwas 70 AFC! is that it could lead to overestimation of thidentifiability of some sounds, due to response biasescertain sounds. Although there may have been some respbiases, they were greatly reduced by the training on theprocessed forms of the 70 sounds. One consequence oftinuing that training until the listeners could respond acrately with short latencies was that they certainly camehave far more homogeneous expectations for all the southan they did before the training. In fact, an analysis ofconfusion matrices for each of the experiments showedlarge or consistent response biases for any particular sou

The greater recognizability of high-pass filtered enviromental sounds as compared to speech may be due to

J. Acoust. Soc. Am., Vol. 115, No. 3, March 2004 G

bution subject to ASA license or copyright; see http://acousticalsociety.org/c

-

v-en,e

h-tz,

.ds0

zsndhtin-

ee

chchnd

lch-

atsl-s

gsees

on-g

e

ornsen-on--odseo

ds.-the

presence of relatively more high-frequency informationthe environmental sounds. Figure 6 plots the summed speof the environmental sounds used in these studies versusummed long-term spectra of multiple talkers, equatedoverall level. The slope for the environmental soundslower than that for speech, suggesting that there is relativmore high-frequency energy, at least for the set of souused in this study. Clearly, a wide variety of spectra is psible with environmental sounds, because of the greaterriety of objects and events that produce these sounds. Almall speech is created by the movement of air through caviof various sizes. With very rare exceptions, impacts of sobodies are not involved. However, of the 70 environmensounds used in these studies, 25 of them were created atin part by the impact of two solid bodies. Since brief impacts, by definition, create broadband transients~the brevitybeing proportional to the rigidity of the bodies involved!, itwould be expected that environmental sounds would havflatter spectrum~thus more high-frequency energy! thanspeech.

The ability to recognize many environmental soundsrobust despite removal of fine-grained spectral informatiWhen listeners’ decisions were based on temporal informtion and limited spectral information, half of the environmental sounds in this sample set were still recognizedbetter than 50% correct by experienced listeners. It maythat somewhat more spectral information is necessarynear-perfect identification of a wide range of environmensounds than for near-perfect identification of speech; Shnon et al. ~1995! showed that speech was recognized neaperfectly with four channels, whereas identification of enronmental sounds was approximately 66% with six channA listener who was extremely familiar with the sounds~thefirst author! had 90% or better recognition on only two-thirdof the six-channel EMN, and some EMN, like phone, electsaw, and harp, were almost unidentifiable. Like speech,vironmental sounds have a multiplicity of acoustic cuavailable to listeners, due to their complex spectral-tempocomposition. For both classes of sounds this redundancyables identification with a degraded signal.

FIG. 6. Long-term spectra of 70 summed environmental sounds ansummed talkers, equated for overall level.

1261ygi et al.: Factors in the identification of environmental sounds

ontent/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 11: Spectral-temporal factors in the identification of environmental sounds

hatcemorhea

wto

Thim

iofa

nu-

euc

n-ifi

s

t,’’

-x-

e

ostwnoriesrs’idor-, na-ral-theds.the

ares-s to

O1er

01

Redistri

The speech studies which motivated this paper, sucvocoder studies and the development of the AI, have greassisted the development of voice communication deviRecently there has been quite a bit of work involving coputer use of natural sounds, such as ‘‘auralization,’’ auditwarning devices, and work with virtual environments. Tresults reported here may be useful to investigators andplications developers in these areas.

As these filtered-sound and EMN studies have shothe differences between individual sounds make it difficultdraw generalizations about all environmental sounds.spectral-temporal variation in environmental sounds ismense, from steady-state inharmonic sounds~buzzers! torapidly varying harmonic sounds~birds singing!. Evenwithin types of sounds there can be a great deal of variatdepending on the particular token used. So, they may innot all be listened to in the same way. Lewicki~2002! sug-gested a separation between ‘‘animal vocalizations’’ a‘‘other environmental sounds’’ in terms of coding by the aditory system, and Gaver’s~1993a! taxonomy may relate tolistening strategies, as well as to source distinctions.

A more productive approach to understanding the pception of environmental sounds may be to consider sclasses of those sounds. Some possible bases for suchsification are

Acoustic features, such as the harmonic/inharmonic cotinuum; the frequency region which allows the best identability; or, perhaps the salience of temporal structure.

The sources and events which produced the sound, asproposed by Gaver~1993a!. Thus acoustic commonalitie

1262 J. Acoust. Soc. Am., Vol. 115, No. 3, March 2004

bution subject to ASA license or copyright; see http://acousticalsociety.org/c

aslys.-y

p-

n,

e-

n,ct

d

r-b-las-

-

might be specified that differentiate between ‘‘impac‘‘water-based,’’ ‘‘air-based,’’ and ‘‘simple,’’ ‘‘hybrid,’’ and‘‘complex’’ environmental sounds.

Higher-order semantic features, such as causal ambiguity, ecological frequency, or importance to the listener. Eamples: ‘‘alerting sounds,’’ ‘‘rewarding sounds,’’ ‘‘ignorablsounds.’’

Future research with environmental sounds will almcertainly reveal a combination of bottom-up and top-doprocessing, as has been observed with many other categof familiar stimuli. As in the case of speech, the listeneknowledge of likely sequences of sounds will greatly aboth detection and recognition. While there may be no fmal grammar constraining the sequences of such soundsture clearly imposes a great many limitations on the specttemporal structure of environmental sounds andprobabilities of various sequences of environmental sounThese constraints operate on many different levels, frommoment-to-moment details of a single event~such as a doorclosing! to constraints on the types of acoustic events thatlikely to occur in a given context. The extent to which liteners are sensitive to these different constraints remainbe determined.

ACKNOWLEDGMENTS

This research was supported by Grant No. RDC00250 from National Institute on Deafness and othCommunicative Disorders and by Grant No. MH12436-

these

7

82

6

5

7

3

3

5

5

2

3

92

8

5

TABLE VIII. RMS when omitting pauses of different lengths, shown below for 16 of the sounds used instudies.

Sound

Pause length

Full sound15 ms 25 ms 35 ms 50 ms 65 ms 100 ms

Door opening 2176.21 2158.5 2129.79 2090.01 2090.01 2030.68 1637.1

Baby 1490.27 1483.97 1483.97 1483.97 1483.97 1483.97 1446.7

Horse gallop 2769.68 2769.68 2769.68 2769.68 2769.68 2769.68 2705.

Drums 4448.92 4448.92 4448.92 4448.92 4448.92 4448.92 4368.8

Electric saw 2354.36 2354.36 2354.36 2354.36 2354.36 2354.36 2318.6

Clock 2883.74 2883.74 2818.6 2818.6 2818.6 2818.6 1344.0

Helicopter 3153.31 3153.31 3153.31 3153.31 3153.31 3153.31 3092.2

Cough 2223.79 2223.79 2223.79 2223.79 2223.79 2223.79 2115.6

Cow 1513.16 1513.16 1513.16 1513.16 1513.16 1513.16 1453.1

Cymbal 2698.41 2698.41 2698.41 2698.41 2698.41 2698.41 2617.3

Bird 2508.29 2508.29 2508.29 2508.29 2508.29 2508.29 2496.7

Bubbles 2600.22 2600.22 2600.22 2600.22 2600.22 2600.22 2543.5

Dog 2412.54 2340.24 2340.24 2286.93 2286.93 2286.93 2287.6

Car starting 1656.27 1656.27 1656.27 1656.27 1656.27 1656.27 1638.

Rooster 4161.07 4161.07 4161.07 4161.07 4161.07 4161.07 4057.9

Water pouring 765.64 765.64 765.64 765.64 765.64 765.64 797.6

Mean rms 2643.66 2637.64 2633.88 2630.16 2628.85 2621.48 2481.2

Gygi et al.: Factors in the identification of environmental sounds

ontent/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 12: Spectral-temporal factors in the identification of environmental sounds

00

008030

00000

00000000009100950000.000000910

00000000.0000930000000000.0083000000.94.00.95.00.00.00.00

00.0000.0080.00.93.000000.00.00.00000000000000.957400.00.00

.98

.05

Redistri

APPENDIX B: PROPORTION CORRECT FOR ALL SOUNDS IN EXPERIMENT 1 BY FILTER TYPE

Sound

High-pass filter cutoff~Hz! Low-pass filter cutoff~Hz!

300 600 1200 2400 4800 8000 300 600 1200 2400 48

AIRPLANE 0.50 0.76 0.71 0.73 0.65 0.78 0.87 0.95 0.91 0.94 1.AXE 1.00 0.86 0.93 0.73 0.41 0.56 0.47 0.76 0.77 0.73 0.B-BALL 0.83 0.73 0.71 0.73 0.58 0.78 1.00 0.89 0.89 1.00 0.9BABY 1.00 1.00 1.00 1.00 1.00 1.00 0.95 1.00 1.00 1.00 1.0BEER 1.00 1.00 1.00 1.00 0.96 1.00 0.12 0.87 1.00 1.00 1.BELLS 1.00 1.00 1.00 1.00 1.00 1.00 0.86 1.00 1.00 1.00 1.BILLIARDS 1.00 0.96 0.92 1.00 0.73 0.11 0.53 0.87 1.00 1.00 1.0BIRD 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.73 1.00 1.BOWLING 1.00 0.82 0.77 0.09 0.18 0.44 0.47 0.81 0.94 0.94 1.BUBBLES 0.94 0.95 0.67 0.48 0.38 0.22 1.00 1.00 1.00 0.94 1.BUZZER 0.75 1.00 1.00 1.00 0.94 1.00 0.31 0.73 1.00 0.90 1.CAMERA 1.00 1.00 1.00 0.94 1.00 0.86 0.00 0.71 0.93 1.00 1.CAR START 1.00 0.95 1.00 0.90 0.95 1.00 0.82 0.75 1.00 1.00 0.CAT 1.00 0.91 1.00 1.00 1.00 0.78 0.00 0.78 1.00 1.00 1.CHIMP 1.00 1.00 1.00 1.00 1.00 1.00 0.00 1.00 1.00 1.00 0.CLAPS 1.00 1.00 0.67 0.60 0.42 0.29 0.13 0.35 0.92 1.00 1.CLOCK 1.00 1.00 1.00 1.00 1.00 1.00 0.69 0.76 0.93 1.00 1.COPTER 1.00 1.00 0.91 0.93 0.74 0.89 0.80 0.95 0.85 1.00 1COUGH 1.00 1.00 0.83 0.95 0.85 1.00 0.80 1.00 1.00 1.00 1.COW 1.00 1.00 1.00 1.00 0.82 0.29 1.00 1.00 1.00 1.00 1.CRASH 1.00 0.94 0.73 0.81 0.80 0.57 0.47 0.71 0.87 1.00 0.CYMBAL 1.00 1.00 1.00 1.00 1.00 1.00 0.94 0.88 0.95 0.86 1.0DOG 1.00 1.00 1.00 0.80 0.65 0.67 0.36 0.80 1.00 1.00 1.DOOR 1.00 0.73 0.60 0.71 0.59 1.00 0.40 0.92 0.89 1.00 1.DRUMS 0.88 0.92 0.93 1.00 1.00 0.89 1.00 1.00 0.95 0.96 1.FLUTE 1.00 1.00 1.00 1.00 1.00 0.71 1.00 1.00 1.00 1.00 1.FOOTSTEP 1.00 1.00 1.00 1.00 0.82 0.86 1.00 1.00 1.00 1.00 1GALLOP 1.00 1.00 1.00 1.00 1.00 1.00 0.94 1.00 1.00 1.00 1.GARGLE 1.00 1.00 1.00 0.88 0.43 0.44 0.82 1.00 0.95 1.00 0.GLASS 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.71 1.GUITAR 1.00 1.00 1.00 0.95 1.00 0.29 1.00 1.00 1.00 1.00 1.GUN 1.00 1.00 1.00 0.77 0.47 0.33 0.67 0.88 1.00 1.00 1.HAMMER 1.00 0.93 0.79 0.64 0.69 0.67 0.62 0.87 1.00 0.95 1.HARP 1.00 1.00 1.00 1.00 1.00 0.57 1.00 1.00 1.00 1.00 1.ICE DROP 1.00 1.00 1.00 1.00 1.00 0.56 0.13 0.85 1.00 1.00 1KEYBOARD 1.00 1.00 0.74 0.64 0.54 0.29 0.07 0.82 0.80 0.95 0.LAUGH 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.MATCH 1.00 1.00 1.00 0.94 1.00 0.89 0.40 0.65 0.73 0.94 1.NEIGH 1.00 1.00 1.00 1.00 0.93 1.00 0.96 0.88 1.00 1.00 1.PAPER 0.92 1.00 0.90 0.93 0.81 0.71 0.05 0.05 0.47 0.80 0PEELOUT 1.00 0.91 0.76 0.80 0.05 0.14 0.82 0.94 1.00 1.00 1PHONE 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.67 1.00 1.00 0PINGPONG 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.94 0.80 1POUR 1.00 1.00 1.00 1.00 0.95 1.00 0.80 0.87 0.93 1.00 1PRINTER 1.00 0.95 1.00 1.00 1.00 0.89 0.93 0.91 1.00 1.00 1PROJECTOR 0.94 1.00 0.93 1.00 1.00 1.00 0.64 0.77 1.00 1.00 1RAIN 1.00 1.00 1.00 1.00 0.95 0.86 0.09 0.18 0.69 0.91 1.REGISTER 1.00 1.00 1.00 1.00 1.00 1.00 0.82 0.86 1.00 1.00 1ROCKING 1.00 1.00 1.00 1.00 1.00 1.00 0.64 0.68 1.00 1.00 1.ROOSTER 1.00 1.00 1.00 1.00 0.96 1.00 0.67 0.95 1.00 1.00 1SAW 0.92 1.00 0.93 1.00 1.00 0.57 0.00 0.87 0.93 0.94 0.SCISSOR 1.00 1.00 1.00 1.00 1.00 0.86 0.47 0.73 0.92 0.93 1SCREEN 1.00 0.94 1.00 1.00 0.88 1.00 0.08 0.79 1.00 0.96 0SHEEP 0.92 1.00 0.94 0.80 0.64 0.57 0.05 0.73 0.76 0.89 1SHOVEL 1.00 0.94 0.93 0.93 0.79 1.00 0.88 0.91 1.00 1.00 1.SIREN 1.00 1.00 1.00 1.00 0.94 1.00 0.95 1.00 1.00 1.00 1.SNEEZE 1.00 1.00 1.00 0.80 0.56 0.67 0.84 0.93 1.00 0.96 1SPLASH 1.00 1.00 0.68 0.75 0.41 0.43 0.15 0.53 0.76 1.00 1STAPLER 1.00 1.00 0.93 1.00 0.94 1.00 0.88 0.82 0.92 1.00 1TENNIS 1.00 1.00 1.00 0.95 0.79 0.86 0.24 0.74 1.00 1.00 1.THUNDER 0.50 0.35 0.18 0.23 0.29 0.00 0.93 1.00 0.73 0.91 1.TOILET 1.00 1.00 1.00 0.88 0.76 0.43 0.86 1.00 1.00 1.00 1.TRAFFIC 1.00 1.00 1.00 1.00 0.93 1.00 0.94 1.00 1.00 1.00 1.TRAIN 1.00 1.00 0.86 1.00 0.64 0.89 0.00 0.64 0.82 1.00 1.TREE FALL 1.00 1.00 1.00 1.00 0.67 0.44 0.26 0.73 0.95 1.00 1.TYPEWRITER 1.00 1.00 0.87 0.89 0.89 1.00 0.35 0.69 0.91 0.95 0WAVES 0.75 0.73 0.55 0.18 0.27 0.14 0.20 0.53 0.67 0.82 0.WHISTLE 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.WIPERS 1.00 1.00 0.93 1.00 0.89 0.78 1.00 1.00 1.00 1.00 1ZIPPER 1.00 1.00 1.00 1.00 1.00 1.00 0.53 0.53 1.00 1.00 1

Mean 0.97 0.96 0.92 0.89 0.81 0.76 0.57 0.79 0.92 0.97 0s.d. 0.10 0.10 0.15 0.20 0.24 0.29 0.37 0.25 0.15 0.06 0

1263J. Acoust. Soc. Am., Vol. 115, No. 3, March 2004 Gygi et al.: Factors in the identification of environmental sounds

bution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 13: Spectral-temporal factors in the identification of environmental sounds

600

Redistri

APPENDIX C: PROPORTION CORRECT FOR ALL SOUNDS IN EXPERIMENT 2 BY FILTER TYPE

SOUND

Bandpass cutoffs~Hz!

150–300 300–600 600–1200 1200–2400 2400–4800 4800–9

AIRPLANE 1.00 0.89 0.85 0.73 0.60 0.72AXE 0.42 0.63 0.60 0.48 0.55 0.63BABY 0.33 1.00 1.00 1.00 1.00 1.00B-BALL 0.72 0.75 0.81 0.63 0.56 0.53BEER 0.15 0.33 0.75 0.44 0.60 0.63BELLS 0.06 0.95 1.00 1.00 1.00 0.93BILLIARDS 0.00 0.10 0.47 0.85 0.98 0.55BIRD 0.00 0.03 0.14 0.97 1.00 0.95BOWLING 0.08 0.50 0.78 0.23 0.10 0.00BUBBLES 0.93 1.00 0.98 0.93 0.53 0.18BUZZER 0.03 0.19 0.98 0.90 0.88 0.78CAMERA 0.00 0.48 1.00 1.00 1.00 0.93CAR START 0.10 0.72 0.78 0.98 0.98 0.98CAT 0.03 0.70 0.83 1.00 1.00 1.00CHIMP 0.00 0.38 1.00 1.00 0.98 1.00CLAPS 0.45 0.50 0.58 0.31 0.58 0.58CLOCK 0.45 0.93 0.98 1.00 1.00 1.00COPTER 0.43 0.83 0.95 0.90 0.90 0.80COUGH 0.10 0.78 0.93 0.98 0.83 0.58COW 0.48 1.00 0.98 0.95 0.93 0.88CRASH 0.08 0.05 0.03 0.05 0.20 0.25CYMBAL 0.23 0.68 0.80 0.98 0.92 1.00DOG 0.00 0.98 0.98 0.98 0.75 0.43DOOR 0.50 0.23 0.33 0.40 0.38 0.35DRUMS 0.91 0.83 0.85 0.93 1.00 0.90FLUTE 0.58 0.97 1.00 1.00 1.00 0.83FOOTSTEP 0.63 0.78 0.61 0.93 0.89 0.80GALLOP 0.95 0.83 1.00 0.95 0.97 1.00GARGLE 0.28 0.97 0.95 0.95 0.95 0.45GLASS 0.00 0.00 0.00 0.75 0.98 0.98GUITAR 0.83 0.97 0.97 1.00 1.00 0.65GUN 0.78 0.85 0.75 0.78 0.60 0.53HAMMER 0.70 0.88 0.88 0.90 0.75 0.55HARP 1.00 1.00 0.98 0.98 1.00 0.98ICE DROP 0.00 0.28 0.90 0.98 1.00 0.98KEYBOARD 0.05 0.65 0.68 0.58 0.63 0.80LAUGH 0.98 1.00 1.00 1.00 1.00 1.00MATCH 0.35 0.42 0.70 0.83 0.83 0.88NEIGH 0.00 0.25 1.00 1.00 0.98 0.93PAPER 0.03 0.05 0.18 0.88 0.80 0.42PEELOUT 0.80 0.94 0.69 0.55 0.69 0.00PHONE 0.00 0.03 1.00 1.00 1.00 1.00PINGPONG 0.85 1.00 1.00 1.00 1.00 1.00POUR 0.00 0.60 0.68 1.00 0.83 0.65PRINTER 0.03 1.00 0.98 0.89 1.00 0.83PROJECTOR 0.33 0.14 0.53 1.00 0.97 0.93RAIN 0.00 0.25 0.28 0.53 0.88 0.93REGISTER 0.00 0.18 0.90 0.90 1.00 1.00ROCKING 0.33 0.65 0.93 0.95 0.98 0.98ROOSTER 0.03 0.83 1.00 1.00 0.98 0.93SAW 0.03 0.65 0.80 0.93 0.92 0.97SCISSOR 0.06 0.45 0.93 1.00 0.95 1.00SCREEN 0.15 0.69 1.00 0.98 0.98 0.89SHEEP 0.20 0.38 0.78 0.67 0.90 0.83SHOVEL 0.35 0.41 0.58 0.70 0.73 0.83SIREN 0.00 1.00 1.00 1.00 1.00 1.00SNEEZE 0.65 0.67 0.98 0.94 0.85 0.58SPLASH 0.08 0.08 0.28 0.30 0.58 0.43STAPLER 0.58 0.64 0.60 0.70 0.63 0.50TENNIS 0.13 0.08 0.75 0.68 0.70 0.65THUNDER 0.73 0.43 0.08 0.03 0.00 0.00TOILET 0.28 0.48 0.66 0.45 0.18 0.15TRAFFIC 0.43 0.92 1.00 1.00 1.00 1.00TRAIN 0.00 0.15 0.85 0.97 0.93 0.83TREE FALL 0.00 0.03 0.55 0.63 0.65 0.47TYPEWRITER 0.10 0.08 0.58 0.81 0.93 0.88WAVES 0.23 0.48 0.55 0.58 0.22 0.22WHISTLE 0.00 0.00 0.93 1.00 0.98 0.97WIPERS 0.30 0.95 0.78 0.65 0.58 0.63ZIPPER 0.13 0.00 0.05 0.48 0.73 0.73

Mean 0.31 0.57 0.75 0.81 0.81 0.73s.d. 0.32 0.35 0.28 0.25 0.25 0.28

1264 J. Acoust. Soc. Am., Vol. 115, No. 3, March 2004 Gygi et al.: Factors in the identification of environmental sounds

bution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22

Page 14: Spectral-temporal factors in the identification of environmental sounds

sheondan-ad

exye’’as

ign

,heo

thma

gi

eso

thva

nt

ea

m

ar

-

d

es,’’

et

l

l

ho-

, D.t.

.

ise

.

,’’

M.

m.

oc.

-

xp.

Redistri

from the National Institute of Mental Health. The bandpafilter and EMN conditions were conducted as part of tdoctoral dissertation at Indiana University of the first authNumerous conversations with Diane Kewley-Port aDonald Robinson contributed materially to this research,Eric Healey supplied theMATLAB code for the bandpass filters. Pierre Divenyi reviewed the finished product and mseveral helpful suggestions and comments.

APPENDIX A: EQUATING THE RMS OFENVIRONMENTAL SOUNDS

Obtaining SPL measurements for spectrally compltime-varying sounds such as the sounds used in this studproblematic. Various measures have been used in the spliterature for measuring speech level, such as ‘‘fast,’’ ‘‘slowand ‘‘impulse’’ detector–indicator characteristics, as welllong-term rms. Ludvigsen~1992!, comparing these varioumeasures of speech level, concluded that long-term rmstegration was the preferred method, except when the siconsisted of a series of words with pauses between themwhich case the long-term rms with a correction for tknown pauses, what he termed ‘‘pause-corrected’’ rms, maccurately represented the energy in the signal.

For the identification studies reported in this paper,rms was computed with gaps of silence longer than 50omitted from the calculation. The 50-ms value was arrivedpartially through intuition and partially through comparinthe rms when omitting pauses of different lengths, shownTable VIII for 16 of the sounds used.

1The use of headphones clearly limits the ecological validity of thstudies to some degree. However, it was decided that headphones wprobably yield stimuli with spectral characteristics closer to those ofrecorded sounds, than would loudspeakers in the multisubject facility aable for this research.

Attias, H., and Schreiner, C. E.~1997!. ‘‘Temporal low-order statistics ofnatural sounds,’’ inAdvances in Neural Info Processing Systems, edited byM. Mozer ~MIT Press, Cambridge, MA!, Vol. 9, pp. 27–33.

Ballas, J. A.~1993!. ‘‘Common factors in the identification of an assortmeof brief everyday sounds,’’ J. Exp. Psychol. Hum. Percept. Perform.19~2!,250–267.

Ballas, J. A., and Howard, J. H.~1987!. ‘‘Interpreting the language of envi-ronmental sounds,’’ Environ. Behav.19~1!, 91–114.

Ballas, J. A., and Mullins, T.~1991!. ‘‘Effects of context on the identificationof everyday sounds,’’ Hum. Perform.4~3!, 199–219.

Bell, T. S., Dirks, D. D., and Trine, T. D.~1992!. ‘‘Frequency-importancefunctions for words in high- and low-context sentences,’’ J. Speech HRes.35, 950–959.

Cabe, P. A., and Pittenger, J. B.~2000!. ‘‘Human sensitivity to acousticinformation from vessel filling,’’ J. Exp. Psychol. Hum. Percept. Perfor26~1!, 313–324.

Carello, C., Anderson, K. L., and Kunkler-Peck, A. J.~1998!. ‘‘Perception ofobject length by sound,’’ Psychol. Sci.9~3!, 211–214.

J. Acoust. Soc. Am., Vol. 115, No. 3, March 2004 G

bution subject to ASA license or copyright; see http://acousticalsociety.org/c

s

r.

d

e

,is

ech

s

n-alin

re

est

n

euldeil-

r.

.

Carrat, R.~1984!. ‘‘Analysis and synthesis of speech regarding cochleimplant,’’ Acta Oto-Laryngol., Suppl.411, 85–94.

Compton, A. J.~1963!. ‘‘Effects of filtering and vocal duration on the identification of speakers, aurally,’’ J. Acoust. Soc. Am.35~11!, 1748–1752.

Dudley, H. ~1940!. ‘‘Remaking speech,’’ J. Acoust. Soc. Am.11, 169–177.Dudley, H. ~1958!. ‘‘Phonetic pattern recognition vocoder for narrow-ban

speech transmission,’’ J. Acoust. Soc. Am.30, 733–739.Eremenko, Y. I., and Ermakov, V. P.~1985!. ‘‘Theoretical aspects of the

construction of reading devices for the blind using vocoder techniquDefektologiya1, 81–86.

Freed, D.~1990!. ‘‘Auditory correlates of perceived mallet hardness for a sof recorded percussive sound events,’’ J. Acoust. Soc. Am.87~1!, 311–322.

French, N. R., and Steinberg, J. C.~1947!. ‘‘Factors governing the intelligi-bility of speech sounds,’’ J. Acoust. Soc. Am.19, 90–119.

Gaver, W. W.~1993a!. ‘‘What in the world do we hear?: An ecologicaapproach to auditory event perception,’’ Ecological Psychol.5, 1–29.

Gaver, W. W. ~1993b!. ‘‘How do we hear in the world?: An ecologicaapproach to auditory event perception,’’ Ecological Psychol.5, 285–313.

Halpern, D., Blake, R., and Hillenbrand, B.~1986!. ‘‘Psychoacoustics of achilling sound,’’ Percept. Psychophys.39~2!, 77–80.

Kunkler-Peck, A. J., and Turvey, M. T.~2000!. ‘‘Hearing shape,’’ J. Exp.Psychol. Hum. Percept. Perform.26~1!, 279–294.

Lakatos, S., McAdams, S., and Causse´, R. ~1997!. ‘‘The representation ofauditory source characteristics: Simple geometric form,’’ Percept. Psycphys.59~8!, 1180–1190.

Lass, N. J., Eastman, S. K., Parrish, W. C., Scherbick, K. A., and Ralph~1982!. ‘‘Listeners’ identification of environmental sounds,’’ Percept. MoSkills 55, 75–78.

Lewicki, M. S. ~2002!. ‘‘Efficient coding of natural sounds,’’ Nat. Neurosci5~4!, 356–363.

Li, X., Logan, R., and Pastore, R.~1991!. ‘‘Perception of acoustic sourcecharacteristics: Walking sounds,’’ J. Acoust. Soc. Am.90~6!, 3036–3049.

Ludvigsen, C.~1992!. ‘‘Comparison of certain measures of speech and nolevel,’’ Scand. Audiol.21, 23–29.

Lutfi, R. A., and Oh, E.~1997!. ‘‘Auditory discrimination of materialchanges in a struck clamped bar,’’ J. Acoust. Soc. Am.102~6!, 3647–3656.

Lutfi, R. A. ~2001!. ‘‘Auditory detection of hollowness,’’ J. Acoust. SocAm. 110~2!, 1010–1019.

Repp, B.~1987!. ‘‘The sound of two hands clapping: An exploratory studyJ. Acoust. Soc. Am.81, 1100–1109.

Shannon, R. V., Zeng, F. G., Wygonski, J., Kamath, V., and Ekelid,~1995!. ‘‘Speech recognition with primarily temporal cues,’’ Science270,303–304.

Shannon, R. V., Zeng, F. G., and Wygonski, J.~1998!. ‘‘Speech recognitionwith altered spectral distribution of envelope cues,’’ J. Acoust. Soc. A104~4!, 2467–2476.

Slaney, M.~1995!. ‘‘Auditory Toolbox: A MATLAB toolbox for auditory mod-eling work,’’ Apple Computer Technical Report #45.

Studebaker, G. A., Pavlovic, C. V., and Sherbecoe, R. L.~1987!. ‘‘A fre-quency importance function for continuous discourse,’’ J. Acoust. SAm. 81, 1130–1138.

Vanderveer, N. J.~1979!. ‘‘Ecological acoustics: Human perception of environmental sounds,’’ Dissertation Abstracts International40: 4543B~Uni-versity Microfilms No. 8004002!.

Warren, W. H., and Verbrugge, R. R.~1984!. ‘‘Auditory perception of break-ing and bouncing events: A case study in ecological acoustics,’’ J. EPsychol. Hum. Percept. Perform.10, 704–712.

1265ygi et al.: Factors in the identification of environmental sounds

ontent/terms. Download to IP: 155.33.120.209 On: Mon, 01 Dec 2014 00:01:22