a perceptually grounded approach to sound analysis

133
POLITECNICO DI TORINO III Facoltà di Ingegneria dell’Informazione Corso di Laurea in Ingegneria Elettronica Tesi di Laurea Magistrale A perceptually grounded approach to sound analysis An application for Orchestra Meccanica Marinetti Relatore: prof. Marco Masoero Corrado SCANAVINO LUGLIO 2009

Upload: corrado-zenji-scanavino

Post on 16-Nov-2014

112 views

Category:

Documents


0 download

DESCRIPTION

Studying of an algorithm for real-time audio onset detection based on a constant-Q transform. Project developed inside the project Orchestra Meccanica Marinetti, which consists of two robots playing drums, controlled by human gestures via MIDI. The developed algorithm detects the perceived attack of the sound, so that the delay between MIDI note’s generation and the sound produced by the hit on the drum can be calculated and compensated, during a live performance.

TRANSCRIPT

Page 1: A Perceptually Grounded Approach to Sound Analysis

POLITECNICO DI TORINO

III Facoltà di Ingegneria dell’InformazioneCorso di Laurea in Ingegneria Elettronica

Tesi di Laurea Magistrale

A perceptually grounded approachto sound analysis

An application for Orchestra Meccanica Marinetti

Relatore:prof. Marco Masoero

Corrado SCANAVINO

LUGLIO 2009

Page 2: A Perceptually Grounded Approach to Sound Analysis

II

Page 3: A Perceptually Grounded Approach to Sound Analysis

Acknowledgements

Àlaleria.

III

Page 4: A Perceptually Grounded Approach to Sound Analysis

Contents

Acknowledgements III

1 Introduction 1

2 A Perceptually Grounded Approach... 52.1 Auditory Cognition (Reminding Psychoacoustics) . . . . . . . . . . . . . 5

2.1.1 Limits of Perception, Perception of Intensity. Loudness . . . . . . 62.1.2 The Human Ear . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.3 Perception of Time and Periods . . . . . . . . . . . . . . . . . . 102.1.4 Perception of Frequency, the Sensation of Pitch . . . . . . . . . 122.1.5 Perception of Timbre . . . . . . . . . . . . . . . . . . . . . . . 15

3 Digital Audio Concepts 173.1 Toward Digital Representation of Sound . . . . . . . . . . . . . . . . . . 173.2 Digital Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Filters Background . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.2 Introduction to Digital Audio Processing with Filters . . . . . . . 273.2.3 Digital implementation of filters . . . . . . . . . . . . . . . . . . 323.2.4 FIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.5 IIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 ...To Sound Spectrum Analysis 374.1 Introduction to Sound Analysis in the Frequency Domain . . . . . . . . . 384.2 Introduction to the Fourier Analysis . . . . . . . . . . . . . . . . . . . . 40

4.2.1 Fourier Transform (FT), Classic Formulation . . . . . . . . . . . 424.2.2 Discrete Fourier Transform (DFT) . . . . . . . . . . . . . . . . . 43

4.3 The Short Time Fourier Transform (STFT) . . . . . . . . . . . . . . . . 464.3.1 The Filterbank View . . . . . . . . . . . . . . . . . . . . . . . . 474.3.2 Windowing: Length and Shape of the Window Function . . . . . 494.3.3 Computation of the DFT (via FFT) . . . . . . . . . . . . . . . . 51

IV

Page 5: A Perceptually Grounded Approach to Sound Analysis

4.3.4 The Inverse Short Time Fourier Transform & Overlap-Add Resyn-thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Constant-Q analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4.1 Implementation of Constant-Q Analysis . . . . . . . . . . . . . . 55

5 Real-Time Audio Applications 595.1 Max/MSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.2 CSound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.3 Supercollider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.4 Chuck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Perceptual Onset Detection 696.1 The Curious Case of ·O M M· . . . . . . . . . . . . . . . . . . . . . . . . 696.2 From Transient to Attack and Onset Definitions . . . . . . . . . . . . . 716.3 General Scheme for Onset Detection . . . . . . . . . . . . . . . . . . . 74

6.3.1 Energy Based Approach . . . . . . . . . . . . . . . . . . . . . . 766.3.2 Phase Based Approach . . . . . . . . . . . . . . . . . . . . . . . 77

6.4 Introduction to the Perceptual Based Approach to Onset Detection . . . 796.5 Onset Detection in ·O M M· . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.5.1 The bonk∼ Method . . . . . . . . . . . . . . . . . . . . . . . . 826.5.2 Result of the Analysis in ·O M M· . . . . . . . . . . . . . . . . . . 83

6.6 From Onset Analysis to Sound Classification . . . . . . . . . . . . . . . 856.6.1 Learning Results . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7 Conclusion 87

A MSP, anatomy of the object 89

B bonk∼ source code 93B.1 The bonk∼ Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

C Writing External for Max/MSP with XCode 117

Bibliography 121

V

Page 6: A Perceptually Grounded Approach to Sound Analysis

List of Tables

5.1 Musical software for realtime synthesis and control . . . . . . . . . . . . 676.1 Filterbank design in our method based on bonk∼ . . . . . . . . . . . . 846.2 Results in detecting onset of the five soundtracks created for analysis

purpose, played at different bpm. ·O M M· . . . . . . . . . . . . . . . . . 856.3 Numerical result in detecting onset and recognizing the three sounds

(A/B/C) produced by the ·O M M· . . . . . . . . . . . . . . . . . . . . . 86

VI

Page 7: A Perceptually Grounded Approach to Sound Analysis

List of Figures

2.1 Winckel’s treshold of hearing [1967]. . . . . . . . . . . . . . . . . . . . 72.2 Equal-loudness contours for the human ear, determined experimentally by

Fletcher and Munson, published on Loudness, its definition, measurementand calculation [1933]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Peripheral auditory system. . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Part of the inner ear, the cochlea is shaped like a 32 mm longs nail and

is filled with two different fluids separated by the basilar membrane. . . . 122.5 Cochleagrams, expressed in bark unit as function of time. On the left

the spoken italian word "ape", on the right a short excerpt of Moondog’s“Pigmy pig”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Simpler digital audio system . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Amplitude (A) response versus frequency, for the four basic types of

filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 The pass-band or bandwidth of a band-pass filter is the difference between

the upper and lower cutoff frequency. The cutoff frequencies are definedas the frequency at which the amplitude, but energy would be betterto say instead, is half the pass-band amplitude. In the figure, 40 dB isassumed as the maximum level of amplitude in pass-band. . . . . . . . . 23

3.4 Example of application of a constant Q filter. Here the center frequenciesare tuned around generic musical octave. In music, an octave, is theinterval between one musical pitch and another with half or double itsfrequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Alteration of the envelope of a tone (INPUT) passed through a narrowfilter (OUTPUT). The output envelope has been stretched in time duringonset and offset components of the tone (initial and final portion). . . . 27

3.6 Echo and reverberation effects explained by convolution. . . . . . . . . . 303.7 Simple delay line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.8 Circular buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.1 Digital sound synthesis and sound analysis. . . . . . . . . . . . . . . . . 40

VII

Page 8: A Perceptually Grounded Approach to Sound Analysis

4.2 Two plots of static spectrum. The image represents the SPL againstfrequency of a drum hit played by a robot (on the left), and a note of aviolin (on the right). The difference is noticeable, while the robot hit hasapparently no harmonically related frequency components, in the violinnote this is clear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Basic operation of the STFT used for sound analysis. . . . . . . . . . . 46

4.4 Waterfall spectrum, a 3D representation os the STFT spectrum. Thegraph was obtained with Spectutils package for GNU Octave. The anal-ysis parameters of the STFT are shown above the figure, the audio sampleanalyzed is extracted from Laurie Anderson’s Violin Solo. . . . . . . . . 48

4.5 Types of windows used in STFT for audio analysis. No ideal windowexists, the term "optimal window" is preferred. Several types of windowsare used, for musical purpose the Kaiser window has usually a preferentialuse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.6 Spacing of filters for STFT (filterbank view) on the top and Constant-Qfilterbank on the bottom. It is clear the advantage of the Constant-Qfilterbank method, which places the filters linearly against log(frequency),which is similar to the frequency response of the human ear. . . . . . . . 54

4.7 Waterfall spectrogram of a Constant Q transform of violin glissando from578Hz to 880Hz (D5 to A5). Taken from Judith Brown’s Calculationof a constant Q spectral transform. [A glissando is a glide from one pitch to another.

It is an Italianized musical term derived from the French glisser, to glide, It is also where the pianist

slides up the piano with his or her hands. From Wikipedia.] . . . . . . . . . . . . . . . . 56

4.8 Waterfall spectrogram of a Constant Q transform of flute playing diatonicscale from 262 Hz to 523 Hz (C4 to C5). Taken from Judith Brown’sCalculation of a constant Q spectral transform. [In music theory, a diatonic scale

is a seven note musical scale comprising five whole steps and two half steps, in which the half steps

are maximally separated. From Wikipedia.] . . . . . . . . . . . . . . . . . . . . . . 57

5.1 Max 5 patcher window . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Max 5 window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.1 The two robots on the sides, SCS + the performer in the middle. . . . . 70

6.2 On the top, the waveform corresponding to a hit of a robot percussion-ist of ·O M M· . On the bottom, the intensity profile of the hit (usingPraat), where onset, attack and the transient/steady state separationare highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.3 From top to bottom: waveform, static spectrum (FFT) and time-varyingpectrum (STFT). From right to left: one hit of ·O M M· robot, one hit ofsnare drum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

VIII

Page 9: A Perceptually Grounded Approach to Sound Analysis

6.4 Unwrapped phase deviation between two adjacent analysis frames. ∆ϕn,k

is the unwrapped phase deviation. For the simpler case represented bya steady state sinusoid, the phase deviation is approximately 0 constantin-between the whole analysis frames, while, during transient the phasedeviation should be extremely large and easy to detect. . . . . . . . . . 79

6.5 Graphical representation of the bounded-Q filterbank. Only the octaveare geometrically spaced, in between the octave the spacing betweenanalysis bins is linear. This allows the application of FFT-like algorithmto calculate the spectrum of each component. . . . . . . . . . . . . . . 83

B.1 Max patcher window showing our test patch realized to analyze the ·O M M·sounds with bonk∼ 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . 116

C.1 XCode main window . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118C.2 a Bundle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

IX

Page 10: A Perceptually Grounded Approach to Sound Analysis

Chapter 1

Introduction

The sound analysis is a very wide area of research; its typical applications range fromstudies about the environment impact, the vibration models, bioacoustics to. . .Music.Each of these fields has its own specific characteristics, thus we need to identify everytime the best approaches.

The specific field of application of this thesis is musical, meaning that the method hasbeen tested inside the Orchestra Meccanica Marinetti project, or ·O M M· . This thesisis about a specific sound analysis approach, defined as perceptually grounded becauseit mimic s the human perception (auditory system) of sound. The perceptual soundanalysis is an ideal candidate for applications in this context. The idea was to extend orimprove some of the musical characteristics of the robotic Orchestra.

·O M M· is a project about a robotic orchestra, controlled in real-time by a performer. Theproject has been conceived by the programmer-digital artist Angelo Comino AKA Motor,and consists mainly of two robots, which play drums, conducted by a performer througha gestural controller, via MIDI. The two robots are more than 2 meter high, and thedrums consist of oil cans (such as the one used by petrol companies) of standard size.These devices were designed and built with industrial component, with special care inorder to emulate the movement of a real drummer, by the people of Mechatronic Lab(LIM) of Politecnico di Torino, thanks to the collaboration of local robotic companies:Prima Electronics, ERXA e ACTUA. Each robot has two, moved by two power electricengines, controlled by a FPGA-DSP dedicated hardware, while the interaction with theperformer is adjusted in real-time by the Show Control System, developed in Max/MSP(a typical development environment for this kind of application), running on a Applelaptop. A picture of the ensemble robots + performer can be seen in picture 6.1.

·O M M· was presented in October 2008 in Turin, arousing great visibility on nationalmedia (press and television) and it is actually completing the engineering developmentand starting the artistic deployment.

The idea of robot musician is not new, in early 80s at Waseda University (Japan)

1

Page 11: A Perceptually Grounded Approach to Sound Analysis

1 – Introduction

WABOT-2 was experimented: a robot keyboardist able to converse with a person, reada musical score with its eyes (a camera) and play it on an electronic organ. A robotdrummer has been developed, since 2005 to present, at the Georgia Tech College ofComputing programmed. Haile, that’s the name, is able to listen to live musicians andaccompany them, playing a drum. Haile’s output is based on live-analysis and process-ing of sounds produced by other musicians playing at the same time, not pre-recordedsequences. Other examples are the recent robotic trumpeter and violinist by the TMC(Toyota Motor Corporation).

One of the critical point within a mechanical orchestra is to sync the execution of themusical score among real and virtual instruments. The human ear is particularly carefulabout timing, but physical devices, as the electromechanical arms of the robots, havevariable delays when activated. These variations depend on the note intensity requested,or the execution rhythm. Each robot basically receives a message on a serial line (MIDI),stating what kind of hit should be executed. Besides the message generation and datatransmission delay, usually negligible from a human point of view, we have the delayintroduced by the physical movement of the arms. Thus, we need to measure the timeinterval between the digital command and the perceived strike on the can.

We propose a "perceptually grounded approach" to recognize the hit of the differentstrikes, in order to compute the delay matrix of a generic score. The robots can playapproximately two hits per second by each arms and the sounds they can play consists ofthree different variations. That is, the robot arms must be positioned to the correct levelin height and will hit the drum after a delay, which is primarily related to the distanceof the arm to the drum and the acceleration by which the arm is driven. Hence, thedelays are variable and not only, problems related to non-absorbed vibrations between ahit and another could cause unwanted change in loudness and pitch perceived. That’swhy a perceptual based approach was needed to characterize the robots performance’sbehaviour and its response to the applied digitally stimuli.

The thesis contains, in his first part, an overview to the phenomena occurring in ourauditory systems, in particular when a new musical event occurs (i.e. a robot hit inour case), in which our ear encodes both time and frequency phenomena related to thehuman perception of sound. In the specific case of percussions, the sound produced canbe for convenience subdivided in two parts, the transient (the origin of the sound) andthe steady state portions (could be intended as the extended support to transient). Twoimportant points should be considered. First, during transient, at the precise instantthe sound is originated (onset of a sound), corresponds a rapid increase in sound energywhich can reach its peak in less than 5 ms. Detecting onset, our first aim, is not atrivial task. Second, the time in which the onset occurs is the meaningful component ofa percussive sound; performing a sort of spectrum analysis at this point, the informationachieved are usually considered sufficient to predict the entire sound, i.e. the steadystate portion of the sound should be derived.

2

Page 12: A Perceptually Grounded Approach to Sound Analysis

Efforts have been then extended to the subfield of signal processing, for specific treat-ments of the musical audio signal (DSP). Therefore we present the digital audio repre-sentation and the realization of digital filters, basic (but still very advanced) componentof every digital signal processing task. For this purpose the books of Roads, Dodge,Rocchesso and Beauchamp were a good starting point. Later we introduce the methodsused for audio analysis, in particular those systems which perform harmonic analysis,usually represented under the name Fourier analysis. The Fourier transform operation,applied to a musical signal, can be viewed as a decomposition of the sound into a finitenumber of harmonics, each of which represented by a complex value. This value is suf-ficient to extract all the information needed to derive frequency, intensity and phase ofeach harmonic; but doesn’t represent the only solution for musical analysis. The methodwe suggest considering, under certain condition (in particular for music or speech), isConstant-Q Transform based, well approximating some features of the human auditorysystem.

Composers of the 20th century have contributed to the evolution of electronic music,in a way that even they wouldn’t be expected. From Luigi Russolo and the intonarumoriin 1918, mechanical instrument producing non harmonic sound (Art of Noise), musicalschemes have been continuously redefined. Russolo influenced even Stravinsky (Paris1921), and after Stravinsky and the expressivity and richness of his Music, musiciansbecame also technician and explored new electro-acoustic machines producing sound.Between the russian composer and the first Moog have passed several years, but theworks of other composers like Bartók, Varèse, Messiaen, Shaeffer, Ligeti, Cage havemaintained straightforward the state of innovation.

In parallel, the evolutionary studies of certain mathematicians and physicians of the19th century (primarily Helmholtz and Fourier) have lead technicians in the discoveriesthat made it possible the realizations of the first electronic instruments, become famouswith movie like "Forbidden Planet" (Louis and Bebe Barron) or "2001: A Space Odissey"(Ligeti and HAL voice inspired to computer synthesis experiment of Max Mathews). Orthe experimental works of Norman McLaren.

Other experiments in between art and technology had been proposed, such as theelectroacoustical compositions "A man sitting in a cafeteria" by Charles Dodge and "Iam sitting in a room" by Alvin Lucier; very attractive for their expressivity and theireducational approach. The first is one of the first experiment of reproducing speechwith computer and the second is a brilliant example of application of different impulseresponses of a room. This music come from 60s and 70s.

Grazie!

3

Page 13: A Perceptually Grounded Approach to Sound Analysis

1 – Introduction

4

Page 14: A Perceptually Grounded Approach to Sound Analysis

Chapter 2

A Perceptually GroundedApproach...

There are no theoretical limitations to the performance of the computer as a sourceof musical sounds, in contrast to the performance of ordinary instruments. At present,the range of computer music is limited principally by cost and by our knowledge ofpsychoacoustics. M V Mathews1

2.1 Auditory Cognition (Reminding Psychoacous-tics)

Some of the subjects treated in this section require notions of acoustic.Intensity, frequency, duration and spectrum are physical attributes used in literature todescribe the acoustical properties of a sound. These attributes do not form music itself,but they can vary the perception of each sound components of a musical flow. Theperceptual attributes, pitch, loudness and timbre, describe how the physical attributerelated to sound are perceived and interpreted as mental construct by the brain, throughour hearing system.

The composer needs to know how to construct and balance physical attributes of soundin a way that correspond, more or less, to the composer’s musical concept [44]

Since sound is supposed carried by vibrations2, propagating through a medium such as air,the detection of these vibrations constitute our sense of hearing. Physical informations

1Appeared on article called The Digital Computer as a Musical Instrument, on journal Science, dated1 Nov. 1963. Now computers costs little bit less. And some on psychoacoustics will follow.

2Other representation of sound are possible, a microscale was proposed, i.e. sound can be de-composed into smaller time unit called microsound or sound particle. See [45] and Gabor, Schaeffer,Shoenberg et al. for more on different sound decomposition.

5

Page 15: A Perceptually Grounded Approach to Sound Analysis

2 – A Perceptually Grounded Approach...

conveyed by sounds, with respect to the study of natural auditory systems (human ear),had been successfully applied to derive the relationship between physical stimuli and theinduced mental construct.

The subfield of psychophysics (the study of psychological responses to physical stim-uli) depicting this phenomena is the psychoacoustic.

2.1.1 Limits of Perception, Perception of Intensity. Loudness

Intensity is proportional to energy, i.e. the variance of air pressure, in a sound wave.Sound intensity is measured in terms of sound pressure level (SPL) on a logarithmicscale, thus the result can be expressed in dB:

SPL[ db] = 20 · log10

(p

p0

)

where p0 corresponds to the estimated threshold of hearing at 1KHz. The threshold ofhearing is generally reported as the RMS3 sound pressure of 20µPa, which is approxi-mately the quietest sound a young human with undamaged hearing can detect at 1KHz.SPL is inversely proportional to distance from the sound sources.

Loudness is the perceptual attribute related to changes in intensity, that is, increasein sound intensity are perceived as increase in the loudness mechanism. Unfortunatelythere is not a trivial relationship. Loudness also depends on other factors like spectrum,duration and presence of background sounds.

Winckel4 in 1967 proposed the range of hearing for a young adult human ear, shownin figure2.1, this range can vary with age and individual’s sensitivity. Winckel’s range ofhearing is valid for sustained sine tones. For shorter tones this threshold can raise, this isbecause, approaching to the borders of the threshold, the ear seems to integrate energyfor shorter tones, at leat for less than 200ms. Other studies have shown that humanbody can sense very low frequencies, although ears do not, and that the upper limit ofsensitivity may be well beyond 20 Khz.

Another useful tool are the Fletcher-Munson curves. They proposed a graph similar tothat of Winckel, introducing the concept of constant-loudness contour, easy to identify infigure 2.2. The meaning of this graph is that each curve has roughly the same loudness.

These constant loudness curves are called phons. A phon is intended as "numberof dB at 1KHz". In other words, a sine tone at 1KHz with intensity of 50 dB has aloudness level of 50 phons. Therefore, if we want to produce a sine tone at 300 Hz withthe same loudness as the 1KHz tone, it is necessary to follow the 50 phons curve until

3Roots mean square, a statistical measure of the magnitude of a varying quantity.4Fritz Winckel, austrian acoustician, is considered one of the pioneer of the electronic music. He

published in 1967 the book Music, Sound and Sensation: A Modern Exposition.

6

Page 16: A Perceptually Grounded Approach to Sound Analysis

2.1 – Auditory Cognition (Reminding Psychoacoustics)

Figure 2.1: Winckel’s treshold of hearing [1967].

300 Hz and use the corresponding value of SPL, then the two tones will sound equallyloud to the listener.

Obviously, the perfect sine wave is an artifact, no sound exists in nature as expressionsolely of a frequency. However, it is demonstrated that is possible to destructure5 thesound as a sum of perfect sine waves. Therefore we can assume that each of which,weighted per the FM curves and then summed, will contribute to total loudness. Butthis is another theoretical situation, since no linearity can be actually applied, at leastnot on the overall spectrum, because of the presence of critical bands6.

Before introducing the time and frequency perception of human hearing, the mostadvanced features of our auditory system, it maybe better to understand how the earworks.

7

Page 17: A Perceptually Grounded Approach to Sound Analysis

2 – A Perceptually Grounded Approach...

Figure 2.2: Equal-loudness contours for the human ear, determined experimentally byFletcher and Munson, published on Loudness, its definition, measurement and calculation[1933].

2.1.2 The Human Ear

The peripheral auditory system is the medium by which sound waves are detected,encoded, and retransmitted through nerve cells to the brain, where human can finallyrender sound. Although very sophisticated, the process can be intuitively subdivided intothree steps, each accomplished into different place in the ear.

• The outer ear: amplifies and conveys incoming sound waves such as air vibration.

Here the sound waves enter the auditory canal, which can amplify sounds con-taining frequencies in the range between 3Hz and 12 kHz. At the far end of the

5See chapter 4, Fourier Trasnform and Overlapp Add Resysnthesis, for explanation to the fact.6See section 3.1.3. for explanations of critical bands.

8

Page 18: A Perceptually Grounded Approach to Sound Analysis

2.1 – Auditory Cognition (Reminding Psychoacoustics)

Figure 2.3: Peripheral auditory system.

auditory canal is the eardrum (or tympanic membrane), which marks the beginningof the middle ear.

• The middle ear: transduces air vibrations into mechanical vibrations.

Sound waves, coming from the auditory canal, are now hitting the tympanic mem-brane. Here, three delicate bones, the malleus (hammer), incus (anvil) and stapes(stirrup), convert the low-level pressure eardrum sound vibrations into higher-levelpressure sound vibrations to another, smaller membrane, called the oval or ellipti-cal window. Finally, another ,The stapedius musclewhich has the role to preventdamages in the inner ear. The middle ear still contains the sound informationin wave form; it is converted to nerve impulses in the cochlea. Higher pressureis necessary because the inner ear beyond the oval window contains liquid ratherthan air.

9

Page 19: A Perceptually Grounded Approach to Sound Analysis

2 – A Perceptually Grounded Approach...

• The inner ear: processes mechanical vibration and transduce them mechani-cally, hydrodynamically and electrochemically. These are then transmitted throughnerves to the brain.

The inner ear consists of the cochlea and several non-auditory structures. Thecochlea has three fluid-filled sections, and supports a fluid wave driven by pressureacross the basilar membrane separating two of the sections. Strikingly, one section,called the cochlear duct or scala media, contains an extracellular fluid similar incomposition to endolymph, which is usually found inside of cells. The organ ofCorti is located at this duct, and transforms mechanical waves to electric signalsin neurons. The other two sections are known as the scala tympani and the scalavestibuli, these are located within the bony labyrinth which is filled with fluidcalled perilymph. The chemical difference between the two fluids (endolymph &perilymph) is important for the function of the inner ear.

Additional processes occur at the brain level, for example, other neural encoded in-formations are used in order to combine signals coming from both ears and fuse theminto one sensation. However, although complex, the mechanism do not yield necessaryinformation to the brain to understand, for example a single note, an harmony, a rhythm,or higher-level musical structures. It appeared that also the low-level time and frequencyperceptual mechanisms, operate both on the musical signal in parallel. Thus the de-termination of the nature of sound is not only determined by the physical properties ofsound and human ear, but all these informations will be combined at high-level (i.e. inthe brain) where the sound takes its musical form.

2.1.3 Perception of Time and Periods

Higher level perceptual processes can be obtained only because other mechanisms, inthe inner ear, encode both time and frequency. In this section, we look at temporalfeatures, two of them seems to be the most prominent: period detector and temporalintegration.

Period detector

The mechanism of period detector inside auditory system, operates on the fine structureof the neurally translated incoming waveform. The neural pattern is obtained by nervecells (in the organ of Corti) firing individually or in group, at a rate which correspondsto the wave’s period. Individually, each cells can operate in this manner only up to acertain period, if this is too small, they cannot recover quickly enough. However, groupof cells can rotate or stagger their firing, so that they, in effect, follow submultiples ofsound period.

10

Page 20: A Perceptually Grounded Approach to Sound Analysis

2.1 – Auditory Cognition (Reminding Psychoacoustics)

A special feature is that the ear can encode variation in the envelope of the wave,studies have demonstrated the existence of a mechanism in the central auditory systemto detect amplitude modulation (AM), although in a small range of frequencies (75 to500 Hz) and only for significant depth of modulation.

Event detector

Another time-related mechanism, deep inside the human ear, is the perception of event.Musical event occurs every time there is a variation of the vibration pattern, that is,something is happen nearby and we hear a new sound. Sound onset 7 is the perceptionof new sound is born. At onset time other nerve cells fire, and different cells operateon different onset slopes. A model for onset detection, developed by Gordon in 1984[26], showed that the moment of perceptual onset of musical event can be significantlydelayed from the physical onset. Another problem is that is not possible to establishunequivocally the threshold over which an event becomes audible to the ear, that is,the definition of the threshold over which the ear recognizes the onset. What does thehuman ear consider as audible event? Bilmes proposed these questions: does it referto the time when physical energy in a signal increases infinitesimally? the time of peakfiring rate of the cochlear nerve? the time when we first notice a sound? the timewhen we first perceive a musical event? or something else? [8] Whatever it meansit is demonstrated that perceptual onset time is not necessarily coincident with initialincrease in physical energy.

Again, since other cells respond to temporal interval between events, this means thathuman auditory system is able to connect single events into rhythmic stream.

Temporal integration

Temporal integration is another important feature in perception of time. Human earseems to integrate two or more event, if they are too close together. This is the principallimit to the resolution of perceived rhythm. The minimal time between frames thathuman ear can sense separately is variable, depending on the duration of each events.For example this period can be few milliseconds if the events are very short, but can alsobe much greater than 50ms.

What happens when succession of sound events cannot be perceived separately intime by human ear? They smear together to form one sensation, in other words, tem-poral resolution is lost. Therefore, human ear has no fixed “time resolution”. A lotof phenomenons are related to temporal integration, one for all, the effect (sometimesdesired) of reverberation.

7Onset is the point at which a musical event becomes audible. For percussive sound is consideredto be the same of attack time, the instant in which the stick strokes the drum.

11

Page 21: A Perceptually Grounded Approach to Sound Analysis

2 – A Perceptually Grounded Approach...

The case of reverberation

Reverberation is different from echo, also from sequences of echoes. However, if a soundis reflected by a surface we hear the sound and its echo, if the surface is irregular orother surfaces are present in the room, several echoes can be heard. The number ofechoes per second is normally referred to as echo density. When echo density is greaterthan 30, individual echoes are separated by less than 35 ms and the ear cannot perceivethem separately. The fusion of echoes in a unique sensation, that is, reverberation.

Not only short time rates between events affects the probability of smearing, but alsothe frequency. If two following notes in a musical stream have similar frequencies theywill probably smear together.

2.1.4 Perception of Frequency, the Sensation of Pitch

Figure 2.4: Part of the inner ear, the cochlea is shaped like a 32 mm longs nail and isfilled with two different fluids separated by the basilar membrane.

Frequency is a physical parameter associated to each wave that carries the sound energyto the ear. Pitch is the perceived parameter related to frequency, it can be thought asthe quality of a sound, governed by the rate of vibrations produced by the sound [44].

In the inner ear, the oscillations of the oval window assume the form of travelingwaves which move along the basilar membrane, ie. along the entire length of the cochlea.

12

Page 22: A Perceptually Grounded Approach to Sound Analysis

2.1 – Auditory Cognition (Reminding Psychoacoustics)

The mechanism for detecting frequencies is located in the basilar membrane. A simplecorrespondence occurs: when a single sine tone excites the ear, a region of the basilarmembrane oscillates around its equilibrium position. Since real sounds have no singlefrequency, this region will show a place where excitation has a maximum, correspondingto the fundamental frequency. The distance of this maximum from the end of the basilarmembrane is directly related to frequency, so that each frequency is mapped in a preciseplace along the membrane. The mechanical properties of the cochlea (wide and stiffat the base, narrower and much less stiff at the apex) denotes a roughly logarithmicdecrease in bandwidth as we move linearly away from the cochlear opening (the ovalwindow), as shown in figure 2.4. Thus, the auditory system acts as a spectrum analyzer,detecting the frequencies in the incoming sound at every moment in time. In the innerear, the cochlea can be understood as a set of band-pass filters, each filter letting onlyfrequencies in a very narrow range pass. This mechanism could be associated to afilterbank of constant-Q filters8, because of their property to be linearly spaced on alogarithmic scale9.However, the sensation of pitch is not only related to the fundamental frequency per-ceived. Other contributes, related to the temporal mechanism encoded in the ear, suchas period detection, can alter the sensation of pitch.

The sounds that ear can sense, have wide frequency range, approximately from 20 to20 KHz. The perceived pitch, also expressed in Hz, has a limited range, approximatelyfrom 60 to 5 KHz.

Critical Bands

Since each frequency stimulates a region of the basilar membrane, a limit to frequencyresolution of the ear is imposed. This limit is reflected to another characteristic ofperception, known as critical band.

A simple example to understand how the ear works in the critical band is necessary.Think, or better listen, two sine waves very close in frequency, they have a total loudnesswhich is less than the sum of the two loudness we would hear if they were separatedin frequency. Now, if we slowly separate each other in frequency, we perceive the sameloudness up to a point, then, over a certain frequency the total loudness increasesapproximately to the value of the sum of individual loudness. The frequency difference,needed to perceive loudness as sum of individual loudness is the critical band.

The ear behavior in this region, can be thought as a kind of frequency integration,because it is similar to the temporal integration we have seen earlier. Inside the criticalband resides other important factors of perception, roughness and beating. Roughness

8A constant-Q filterbank is a set of bandpass filters, which fit their bandwidths according to centralfrequencies, to maintain a fixed ratio (Q).

9See chapter 4, the section on Constant-Q analysis, which base his benefit on the similarity withthe human pitch detector mechanism occurring in the basilar membrane.

13

Page 23: A Perceptually Grounded Approach to Sound Analysis

2 – A Perceptually Grounded Approach...

Figure 2.5: Cochleagrams, expressed in bark unit as function of time. On the left thespoken italian word "ape", on the right a short excerpt of Moondog’s “Pigmy pig”.

is a sensation of dissonance, its presence is particularly strong in the lower and upperbound of the critical band, where the two tones are almost separated but not yet ready tobe perceived as two sounds. In the middle of the critical band the two tones are heard asone with a frequency that lies between the two frequencies, where we can clearly perceivethe sensation of beating. When the two tones are separated by 1 Hz we perceive a singlebeating per second. The width of critical bands (bandwidths) increase in frequency.

The Bark scale was proposed to represent the human ear behavior inside the crit-ical bands. An example of such a representation is proposed in figure 2.5, where thespectrogram produced is plotted against frequency on a Bark scale; in this case refer-ring to cochleagram is appropriate. The Bark scale (of human hearing) ranges from1 to 24 Barks, corresponding to the first 24 critical bands. The proposed Bark centerfrequencies, in Hz, are:

50, 150, 250, 350, 450, 570, 700, 840, 1000, 1170, 1370, 1600, 1850, 2150, 2500,2900, 3400, 4000, 4800, 5800, 7000, 8500, 10500, 13500

while their corresponding bandwidths, are:

100, 100, 100, 100, 110, 120, 140, 150, 160, 190, 210, 240, 280, 320, 380, 450, 550,700, 900, 1100, 1300, 1800, 2500, 3500, 5000

These center-frequencies and bandwidths should be interpreted as being associated witha specific fixed filter bank in the ear. Note that since the Bark scale is defined only up

14

Page 24: A Perceptually Grounded Approach to Sound Analysis

2.1 – Auditory Cognition (Reminding Psychoacoustics)

to 15.5 kHz, the highest sampling rate for which the Bark scale is defined up to theNyquist limit, 31 KHz.

When many frequencies are present (fundamental tones and harmonics) the auditorysystem works on all of them simultaneously, with the limit of resolution introduced bycritical band. This effect on the overall spectrum is another contribute to the perceivedpitch. Not only evitabile, the pitch is also influenced by inharmonic spectra, which is acharacteristic of noise.

Perception of noise

White noise does not affect pitch because it is completely random and has a flat spectrumthat doesn’t evoke any sensation, if not trouble. Since colored noise are created bymodulating white noise, some of them can yield a vague pitch sensation, depending onthe modulation applied. For example, for an AM modulation of white noise, there maybe a pitch corresponding to the modulation’s frequency. Other sensation of pitch canbe achieved by filtering or applying digital effects to white noise.

We have seen the most important factors characterizing the sensation of pitch, nowwe can introduce the last perceived attribute of a sound, the timbre.

2.1.5 Perception of Timbre

The generic definition of timbre is this: the attribute by which we can distinguish twosound with the same loudness and pitch. Thus timbre is the character or quality of amusical sound, distinct but influenced from its pitch and loudness. Sometimes, in moreecstatic way, is also referred to as the color of sound.

The characteristics determining timbre reside in the constantly changing spectrumof a musical sound, produced for example by an instrument. The steady-state spectrumis not enough to distinguish a sound produced by an instrument to another, but also theattack and decay portion of the spectrum are very important. Therefore, timbre has tobe more than one dimension, because involves temporal envelope and evolution of thespectral distribution over time.[44]

15

Page 25: A Perceptually Grounded Approach to Sound Analysis

2 – A Perceptually Grounded Approach...

16

Page 26: A Perceptually Grounded Approach to Sound Analysis

Chapter 3

Digital Audio Concepts

Prior to the study of specific aspects in sound analysis (chapter 4 and 6), it is better toclarify the basic concept behind the audio representation on digital computers. In thefollowing paragraph, the main attributes of a digitized sound are dealt with from basicterms (sampling and quantization) to advanced applications (digital filters).

This chapter is therefore divided into two sections: the fist contains a brief illustrationof the theories behind digital representation of music, while the second gives a deeperexplanation of how the filters are implemented on digital computers.

3.1 Toward Digital Representation of Sound

Analog audio signal

Sounds come in the form of vibrations, carried to our ears by a physical medium suchas air. Referring to an electrical system, thus replacing ears with a microphone, soundsare transduced into a time-varying voltage in accordance to the vibration’s patternspresent in the air. This is what is called the analog audio signal, a continuous signalwhich consists of a continuum of values. In physic, an analog sound signal is usuallyconsidered a mono-dimensional signal representing the air pressure over the microphonemembrane.

The Sampling Theorem (Nyquist/Shannon)

In order to perform any sort of sound processing on digital computer, the analog signalmust be reduced to digital data, each representing a discrete value of the signal’s in-stantaneous voltage. The operation that transforms the analog signal into digital signalis, ladies and gentlemen, the sampling. The sampling theorem states that, in order toaccurately represent sound digitally, the sampling rate, defined as the frequency in Hz

17

Page 27: A Perceptually Grounded Approach to Sound Analysis

3 – Digital Audio Concepts

at which the sampling operation is performed, has to be at least twice the frequencyband of the analog signal. The frequency band is determined by the maximum frequencycontained in the signal. Since the (average) upper frequency limit to human hearing isconsidered to be 20KHz, sampling rate higher than 40KHz must be choosen. This isenough to allow reconstruction of the original signal, starting from samples, in a waythat human ear cannot distinguish from the original.

The device performing this operation is called analog-to-digital converter (ADC).At each period (i.e. the inverse of the sampling rate), the ADC produces a string ofbinary numbers, called sample, which are stored in memory in the exact order they arereceived. The inverse operation, from digital-to-analog, is realized by the digital-to-analog converter, DAC.

The sampling rate normally used in computer to represent digital audio signal is44,1KHz or 48KHz. The frequency halves the sampling rate, is called the Nyquistfrequency. The faster the sampling rate, the greater the Nyquist frequency and conse-quently the frequencies that can be represented (but also the demands on speed andpower consumption of the hardware).

Aliasing

Like any other analog to digital conversions, also the audio conversion may be affected bythe problem of aliasing. Aliasing occurs cause frequencies, higher than half the samplingrate (Nyquist frequency), may be present at input of the ADC. This results as distortionof the original signal and it can be heard, in acoustical term, as an unwanted changein pitch1, because frequencies over Nyquist are probably converted at low frequencies.The problem can be easily overcome by placing an anti-aliasing filter 2 before the ADC,which ensures that only signals below Nyquist enter the converter. This system is alsoreplicated at the end of the audio chain, in between the DAC and the speaker, for thesame reason. In figure C.2 is proposed a generic audio system.

Dynamic Range and Signal-to-Noise Ratio

The dynamic range is the difference between the softer and the louder sound thatcan exist in the system. It is expressed in terms of decibels, because of their useful logcompression for large numbers (e.g. doubling a number will reflects in a 3 dB increment).Since decibel is a unit of measurement for ratio, in acoustic, dB is used to representthe ratio between the actual level of intensity to a reference level. As reference level itis normally considered the threshold of hearing, 10−12 W/m2. The maximum value of

1Pitch is a sound attribute, perceptually related to frequency. Pitch and the other musical perceptualattribute will be presented in chapter 2.

2The simpler realization of anti-aliasing filter is obtained by a low-pass filter with cutoff frequencyequal to Nyquist frequency. See next section for filter’s explanation.

18

Page 28: A Perceptually Grounded Approach to Sound Analysis

3.1 – Toward Digital Representation of Sound

low passfilter ADC DAC

digitalsamples

low passfilter

nyquistfrequency

MEMORY

nyquistfrequency

analogaudioinput

analogaudiooutput

Figure 3.1: Simpler digital audio system

dynamic range for human hearing, is called threshold of pain3 and it is estimated above120 dB. Hiroshima explosion was 180 dB. If the sound is particularly short the thresholdof pain can increase, but is better not to try...

While recording music, it is important to capture the wide-as-possible dynamic range,in order to reproduce music in its fully expressive way. For example, recording an orchestrawill require wider dynamic range than a solo instrument[44]. The number of bits (Nbit),used to represent each sample4, has a direct influence on the maximum dynamic rangeof digital audio systems. The following simple formula can be used for this purpose:

(DR)Max = Nbit · 6.11 [ dB]

Therefore, a 24 bit system may reach 147 dB, much more than the threshold of pain.Considerations should be given on noise, when speaking of dynamic range, because

noisy sound components (not only the noise introduced by all electronic devices, butreal noisy sounds), are always present in the proximity of the audio system and theycan alter, for example, the minimum of the dynamic range. Signal-to-noise ratio (SNR)compares the level of a given signal to the level of noise in the system. Noise can havea wide variety of meanings and also depends on the environment and sensibility of thelistener. SNR is also expressed in dB so that a great value of dB means a clear sound.SNR of a good audio system is often higher than 90 dB. Dynamic range and SNR aregood indicator of the quality of any audio system, but not the only.

Quantization Error

Now, we present the last (and one of the most) important factor, determining digitalaudio quality, the quantization. How many bits are needed to represent the sampledamplitude of the signal? Normally the answer is given by the maximum resolution of the

3Level higher to this threshold can seriously damage the human hearing system.4That is, the quantization, explained in the next section.

19

Page 29: A Perceptually Grounded Approach to Sound Analysis

3 – Digital Audio Concepts

ADC used to compute sampling. Obviously, the higher the resolution of the converter,the better the quality of the digitized sound. Since the number of bits is a finite integer n,only 2n values can be used to represent the original value, these are called quantizationlevels. When the system has to convert a value, which is not integer, a round off isnecessary. The quantization error is the difference between the real value and the binarystrings used to represent it, that is typical of almost all the samples and introduce thequantization noise. In 16 and 24 bit ADC the quantization noise is negligible.

Digital Audio Signal, File Format and Perceptual Codec

Digital audio samples are finally grouped into files, to be stored on hard drive of digitalcomputer. In this case we can distinguish between two different format for files, thoseobtained after an algorithm for compression is applied and those who not. In the non-compressed case, all the samples are stored without applying changes and a preambleis added to the begin of the file, which includes the information required by a playerin order to correctly reproduce the musical content (sampling frequency, number ofchannels, modulation etc...). Compressed files, were obviously introduced to save theamount of space required to store the files. The goodness of the algorithm determinesthe quality of the digital compressed file (against the original, non-compressed). In MP3,for example, the algorithm takes into account phenomena occurring in the perception offrequency (frequency masking) by the human ear. Therefore, what it does is a decimation(and adaptive bit re-allocation) of the samples used to represent frequencies that humanear cannot sense or distinguish with the resolution imposed by the sampling. For thisreason, compressors such as MP3, are sometimes called perceptual codec ; under certaincondition, e.g. bitrate higher than 128Kbps, perceptual codecs ensure a remarkablelevel of transparency, thus their quality should not be distinguished from the original.

20

Page 30: A Perceptually Grounded Approach to Sound Analysis

3.2 – Digital Filters

3.2 Digital Filters

The most general definition of a digital filter is: a computational algorithm that convertsone sequence of numbers into another[18]. Thus, any digital device with an input andan output is a filter[44]. The advanced design of filters is beyond the aim of this thesis,but the basics have to be understood with respect to chapter 5 and 6, where digitalfilters are implemented for specific purposes in sound analysis. In the following pages,instead of "implement", a term specifically used in computer language, we may preferthe use of "design", to point out more or less the same information, that is, the way thefilters are realized.

Digital filters began part of the integrated circuit since 1950. From the 60s, the socalled Z-transform5 was introduced to standardize the mathematical representation offilter’s behavior. In sound synthesis programming language, argued in the next chapter,digital filters appears in the early 60, with MUSIC IV6. Only later in the 80, when thecost of hardware architecture fell down, real-time digital filter played the most importantrole on the widespreading low-cost applications, such as synthesizer, effects unit anddigital mixer.

Historically, the most common use of filters, at least in computer music, was that ofboosting, attenuating or separating regions of the sound spectrum. All these operationsimply processing in the frequency domain. However, since filters also carry out otherimportant sound processing techniques, such as reverberation and delay, the effect offiltering should not be intended as to be frequency-domain-only related. As we’ll see verysoon, also the time structure of the signal can be altered by means of filtering operation.

3.2.1 Filters Background

The frequency response

All filters may be characterized by the frequency response. The well-known frequencyresponses are: low-pass, high-pass, band-pass and band-reject.

The frequency response consists of two parts: amplitude response, shown in figure3.2 for the four basic types of filters, and phase response. The amplitude response is theratio of the amplitude of output signal to the input signal, varying along frequency range.The phase response (also varying with frequency) is the amount of phase alteration inthe signal passing through the filter. Sometimes it is defined in terms of phase delay,that is, the amount of phase change from the original phase, expressed in ms.

5The Z-transform converts a discrete time-domain signal, a sequence of real or complex numbers,into a complex frequency-domain representation. See later in the text for more details.

6The 4th release of the MUSIC saga, the last developed by Max Mathews at Bell Labs.

21

Page 31: A Perceptually Grounded Approach to Sound Analysis

3 – Digital Audio Concepts

A[dB]

f[Hz]LOW-PASS FILTER

A[dB]

f[Hz]HIGH-PASS FILTER

A[dB]

f[Hz]BAND-PASS FILTER

A[dB]

f[Hz]BAND-REJECT FILTER

Figure 3.2: Amplitude (A) response versus frequency, for the four basic types of filters.

Pass-band and stop-band, cutoff and center frequency

The pass-band or bandwidth of a filter is defined as the frequency region in which thefilter has no effect (or at most a little attenuation) over the signal, while the stop-band is the frequency region where a great attenuation is applied to the signal. Inall kind of filters, there’s always a smooth transition between the pass-band and thestop-band (and viceversa), which is normally called the transition band. The mostimportant characteristic associated to the transition band, is the cutoff frequency fc.Conventionally, fc is chosen as the frequency at which the power transmitted is reducedto one-half (i.e. -3 dB) of the maximum power in the pass-band.

In low-pass and high-pass realization, the cutoff frequency determines the bandwidthof the filter, e.g., the extension over a frequency range of the signal passed throughthe filter. In band-pass and band-reject filters, the bandwidth is limited by two cutofffrequencies, fu and fl, which stands for the upper and lower limit of the bandwidth.Consequently, the center frequency fo of a band-pass (and band-reject) filter is definedas 1

2· (fu − fl). These characteristics are shown in figure 3.3 for the case of band-

pass filter. The rate at which the attenuation slope increases in the stopband, is called

22

Page 32: A Perceptually Grounded Approach to Sound Analysis

3.2 – Digital Filters

A [dB]

f [Hz]

Figure 3.3: The pass-band or bandwidth of a band-pass filter is the difference betweenthe upper and lower cutoff frequency. The cutoff frequencies are defined as the frequencyat which the amplitude, but energy would be better to say instead, is half the pass-bandamplitude. In the figure, 40 dB is assumed as the maximum level of amplitude in pass-band.

rolloff. In musical application’s filters, the rolloff is frequently measured in dB/octave7

of decrement during the transition band. The slope of the transition band is determinedby the order8 of the filter. In the analog counterpart, the order determined by summingall the electronic components used to realize the filter. Whereas, for digital filters ismuch more elaborate, as explained later in the chapter.

Selectivity and quality factor (Q)

The bandwidth of the band-pass filter, is also called the selectivity of the filter, and isuseful in quantifying the quality factor, Q.

7An octave is the interval between two points where the frequency at the second point is twice thefrequency of the first.

8The order is the mathematical measure of complexity.

23

Page 33: A Perceptually Grounded Approach to Sound Analysis

3 – Digital Audio Concepts

Figure 3.4: Example of application of a constant Q filter. Here the center frequenciesare tuned around generic musical octave. In music, an octave, is the interval betweenone musical pitch and another with half or double its frequency.

The Q, in musical purpose filters, can be interpreted as the resonance degree of aband-pass[18] and is given by:

Q =fc

BW

Here, BW is an acronym for bandwidth. Hence, higher Q implies narrower bandwidth. Asan example, high Q is needed when the target is to extract a single frequency componentfrom a signal. A solution to this task will be presented further in this thesis.

A special type of band-pass filter is the constant-Q filter, which is widely used insound analysis. This filter allows variation of the bandwidth as a function of the centerfrequency, by maintaining fixed its Q. For example, with fixed Q = 5 and fc = 300 Hzthe bandwidth is 300/5 = 60 Hz. Shifting the center frequency around 6 Khz, the band-width becomes much wider, 6000/5 = 1200 Hz. This type of filter has two fascinatingcharacteristics, in musical application. Since the energy of a sound is normally concen-trated at low frequencies and spreads towards the high frequency, a band-pass constantQ filters allows extraction of narrower bandwidth while its center frequency is set aroundlow frequency region, and wider, as the center frequency moves toward high frequencies.This is only the first, the second is that, the bandwidths of this kind of filter, plottedagainst log frequency, appears to be constant. This behavior, presented in figure 4.6,seems to be quite similar to the perception of frequency of the human auditory system,as we aimed to explain in first part of chapter 2.

Filter combination, parallel and cascade

The four basic types of filters can be combined to form more complex filter design. Twofundamental methods of combinations are possible: parallel and cascade.

24

Page 34: A Perceptually Grounded Approach to Sound Analysis

3.2 – Digital Filters

Parallel connection allow the filters to operate on the same signal at the same time.The output signal will be given by the sum of all the filters’ output; that means, thefrequency response of a parallel connection is the sum of all the frequency responses.For instance, a band-reject filter can be obtained by connecting a low-pass and a high-pass filter in parallel. An interesting example of parallel connection is represented bythe contant-Q filterbank, which consists of an array of constant-Q band-pass filters thatseparates the input signal into several components, each one carrying a single frequencysubband9 of the original signal. For musical purpose, these subbands are normally non-overlapping and exponentially distributed e.g. in the whole frequency range of humanhearing, between 20 Hz and 20 KHz. A special type of constant-Q filterbank, had beenhistorically represented by the octave filterbank, in particular, the third-octave filterbankshave also been standardized for use in audio analysis10[12]. In a third-octave filter bank,the center frequencies of the various bands are exponentially spaced along frequencyaxis, in a way described by the formula:

fc[k] = 2 k/3 · 1000 Hz

where fcc[k] are the center frequencies of an array of k filters, the first of which is cen-tered at fc[0] = 1000 HZ, as an example. The bandwidth of the kth filter is proportionalto the kth center frequency, as the following formula states:

BW [k] = fc[k] · 2 1/3 − 1

2 1/6

Therefore, since the expression of bandwidth contains the center frequency, the qualityfactor Q[k] = fc[k]/BW [k], is constant for all k filters.

Cascade connection, also called series connection, is the other way to connect filterseach others. In this case, the signal will pass through a series of filters, one by one,respecting the linking order. The direct consequence is that: the overall amplituderesponse becomes the multiplication (thus sum in dB) of the individual filter responses,while, the overall filter order, becomes the sum of the individual filter orders [18]. Forinstance, cascading two or more low-pass filters with the same frequency response, makesit easy to obtain higher rolloff i.e. greater attenuation around crossover frequency.Cascade connection of filters, may be critical in some cases, for example much caremust be taken, when designing series of filters with different bandwidths. Each filter ofthe cascade, must guarantee that significant energy will pass at least through a commonrange of frequencies, otherwise the output could be inaudible.

9Submultiple of the signal’s bandwidth.10Third-octave filters are useful because they have a good correlation to the subjective response of

the human ear.

25

Page 35: A Perceptually Grounded Approach to Sound Analysis

3 – Digital Audio Concepts

Impulse response and time-domain effect of filtering

As well as being characterized by the frequency response, every filter has an impulseresponse.

Impulse response, is the time-domain description of the filter’s response for very shortpulses, approximations of the mathematical unit impulse function (or Dirac delta11).Studying the filter’s response to a unit impulse, could be useful to determine the filter’sresponse to any short-time change of the input signal. Therefore, the impulse responsecould be interpreted as the adaptive response of a filter, the most related to the time-varying characteristics of the signal. That’s why, filters are sometimes designed to havea specific impulse response, instead of the frequency response. What the frequencyresponse shows us, is the filter’s response after a stable output is reached, that is,obtained after a long enough time is elapsed from the beginning of the filter’s operation.However, in order to understand the effect of filtering, a time-domain point of view isequally important.

To advance what follows, which includes musical-related content, figure 3.5 is pro-posed as example. What we’ll say about sound, in the figure is referred to a puresinusoidal tone, the simpler sound produced by an oscillator 12.

By definition, every sound could be delimited between the so called attack and decayportions, which roughly correspond to the sudden start and the softer end of a sound.It is demonstrated that their role leads one the most important time-related feature,typical in human auditory system, the perception of event. We recall this observationhere, because it would be important in proceeding of this thesis. Thus, an alternativeway to design a filter, is deriving its impulse response, and one way to do this is to observethe way in which the filter reacts at the beginning and at the end of the signal passingthrough it. These are two particular case of the so called transient response, whichintroduce another fundamental aspect related to sound. Transient will be discussedmore detail in chapter 6. However, a close relation exists between impulse and transientresponse [18].

Still refer to figure 3.5, the transient response appears evident, acting as a time-stretching operation applied to the initial portion of the sine wave, and similarly, atthe end. Transient response, at least in this simple case, could be associated to thetwo time-intervals required by the filter, in order to produce the steady-state outputafter the attack transient occurs, and the time spent before the sound die, after thedecay transient. This behavior becomes very dangerous when large number of filters

11The Dirac delta definition is: δ(t) =

{∞, if t=0;0, otherwise.

, with the constraint∫∞−∞ δ(x) dx = 1.

12The digital oscillator is the simpler sound generator in computer music. It represents the propagation(in the air) of a single sound wave at a certain frequency and amplitude. It is still present in currentcomputer synthesis programs and was introduced in 1960 as the fundamental Unit Generator in Music3 by Max Mathews.

26

Page 36: A Perceptually Grounded Approach to Sound Analysis

3.2 – Digital Filters

Figure 3.5: Alteration of the envelope of a tone (INPUT) passed through a narrow filter(OUTPUT). The output envelope has been stretched in time during onset and offsetcomponents of the tone (initial and final portion).

are connected in cascade, since everyone will affect the time-duration of the sound,unwanted distortions can occur.

3.2.2 Introduction to Digital Audio Processing with Filters

In the first part of the chapter, we described the process by which a generic audio signalis transformed into digital samples and stored onto hard disk drive of a digital computer.Thus, the simplest use of the computer in audio system, can be addressed to digitalrecording and reproduction of sound files. The consequent step is represented by allthe useful and/or aesthetic operations that could be done over the digital audio signals.This is the vast area of digital audio processing, in which filters play a determinant role.In digital audio processing, four fields of particular interest can be isolated, because oftheir extensive treatment in literature: sound mixing([44]), delays and effects([63] forall), sound spatialization and reverberation13([63] and [41][15][55] for implementation’sexamples) and sound modeling 14([44][52][33]). Useful applications of sound processingcan be found in all the widespread digital audio systems, e.g. in digital music player (mp3,

13Sound spatialization is essentially the movement of sound through space. Dolby Digital Surroundis one of the best known techniques.

14An example of sound modeling, sound by applying physical knowledge.

27

Page 37: A Perceptually Grounded Approach to Sound Analysis

3 – Digital Audio Concepts

cd, miniDisc etc). Compressor, limiter, expander, noise gates, and noise reduction, areonly few examples of the sound processing treatment normally applied to the dynamicrange of the music we hear, coming from almost every digital music medium.

Digital filters assume a primary role in quite all sound processing applications. Afterclassical filter theory had been quickly exposed in the previous sections, the hard taskshould be that to transpose some of those concepts, to the discrete world of quantizedsamples.

For that purpose, a general and expressive definition coming from LTI system theory,will be generalized for the case of digital filters, but before, some clarifications mustprecede it. Systems who don’t change their behavior in time and fulfill the superpositionproperty15 are called linear time-invariant (LTI) and the most important property is thatthose systems can be completely characterized by the impulse response. The impulseresponse, is the behavior of those systems for short impulse input. Hence, the generalthe definition we’ll adopt to explicit filter realizations, is the following: the output signalof an LTI system is given by the convolution of the impulse response with the inputsignal.

Thence, the assumption here is that to consider filters as LTI systems.

Impulse Response

The general definition of impulse response of a filter is the response of such a filter, fedwith a short pulse. The short pulse can be considered as a test signal, through whichthe characteristics of the filter are bared. The common test signal used in LTI systems,such as in filters, is the unit impulse, defined as:

UI(t) =

{1, if t=0;0, otherwise.

In the case of discrete systems, such as digital filters, the unit impulse is obtained substi-tuting t with n, and delimiting the sample index in between brackets [·], for unambiguity.Therefore, for discrete LTI systems, the unit impulse could be rewritten as follows:

UI[n] =

{1, if n=0;0, otherwise.

which can be seen as one-sample impulse. In digital terms, the briefest signal possible(the approximation of the unit impulse) is exactly a single sample, which contains energy

15In the case of filters, the superposition property states: when two signals are added together and fedto the filter, the filter’s output is the same as if the two signal were putted through the filter separatelyand then added the outputs.

28

Page 38: A Perceptually Grounded Approach to Sound Analysis

3.2 – Digital Filters

at all the frequencies below Nyquist16. By definitions, the output signal of a filter fedby unitary impulse is the impulse response, henceforth simply called IR.

Since we can say that unit impulse contains all the frequencies of the signal, IRcan also be seen as the time domain representation of the amplitude-versus-frequencyresponse, earlier presented as the frequency response. The bridge between the twodomain is represented by the convolution.

Convolution

Convolution is a generic signal processing operation, like addition or multiplication, buthas a lot of more interest because convolving two signals in the time domain is equal tomultiply them in the frequency domain. That’s why convolution operation is consideredthe bridge between the two domain. Convolution is a fundamental operation in digitalaudio processing as well as in filters. Let’s see how it works, starting from the formularepresenting the previous definition given for LTI systems, now generalized for the caseof filters: the output signal y[n]of every digital filter is given by the convolution of theimpulse response of the filter with the input signal x[n]. Here it is:

y[n] = x[n] ∗ h[n]

where ∗ is the convolution and h(t) is the impulse response. When the impulse responseh(t) is obtained through the one-sample impulse, acting as unit impulse, convolutionproves to be an identity operation:

y[n] = x[n] ∗ UI[n] = x[n]

That is, every function convolved with the unit impulse remains the same.

While speaking of convolution in terms of signal processing, a certain regard to othertwo properties, is necessary. Convolving the input signal with scaled version of the unitimpulse:

y[n] = x[n] ∗ (c · UI[n]) = c · x[n]

and convolving the input signal with a delayed copy of the unit impulse, by means oftime-shifting:

y[n] = x[n] ∗ UI[n− t] = x[n− t]

16According to Fourier’s theories, an inverse relationship exists between the duration of a signal andits frequency content: the shorter the signal, the wider the spectrum.

29

Page 39: A Perceptually Grounded Approach to Sound Analysis

3 – Digital Audio Concepts

Figure 3.6: Echo and reverberation effects explained by convolution.

That is, the result of the convolution between input signal and scaled or time-shiftedunit response is the same as to scaling or time shifting the input signal. Consequently:any input signal can be represented by a sequence of scaled and delayed unit impulsefunctions. Not only, easily recognizable effect in sound systems, such as echo andreverberation can be recreated by really simple but appropriate design of IR function, asshowed right in figure 3.6.

In the case of reverberation effect, showed in right side of figure 3.6, the time-smearing 17 effect occurs when the two time-shifted unit impulse functions are tooclose, relatively to the duration of the sounds. Thus the first sound cannot be separatedfrom its following replica. Those effects, when thick, assumes the form of reverberation.

The law of convolution, applied to computer music, affirms that the convolution oftwo waveforms 18 in the time domain, is equal to the multiplication of the two spectrain the frequency domain. This is fundamental concept in sound processing techniques,because any of the transformations applied to sound in the time domain have a directcorrespondence in the frequency domain, and vice versa.

Finally, the mathematical definition of discrete convolution, applied over two generic

17See chapter 3. Time-smearing is a phenomena which occurs when two close-in-time sounds cannotbe separated by the time-resolution of the ear.

18From chapter 3, waveform will be used to define the analogue sound signal in the time-domain

30

Page 40: A Perceptually Grounded Approach to Sound Analysis

3.2 – Digital Filters

finite-length input signals a[n1] and b[n2], is:

a[n1] ∗ b[n2] =

n1−1∑m=0

a[m] · b[n2 −m] = y[k]

To enhance the analogy with filters, the formula may be interpreted by this way: a[n1]acts as a weighting function (such as the IR) for each delayed copy of b[n2] (i.e. theinput signal). The result of the operation y[k] is k sample long, with respect to:

k = length(a[n]) + length(b[n])− 1

That way to compute convolution, a sum for each value of k, is called direct convolution.The direct form is computationally intensive, requiring N2 operations, where N is thelength of the longest of the two input. A faster solution to implement convolution ondigital computers was founded. It works with the FFT19 algorithm, applied to both theconvolutional operands. The results are multiplied, and finally reversed to time-domainthrough the IFFT20 algorithm, to be finally summed. The cost of the fast convolutiondrastically reduces the computational complexity to N logN operations.

Transfer function and frequency response

The frequency domain description of a digital filter reflects its ability to pass, reject orenhance certain frequencies included in the input signal spectrum. The common termsused to describe the characteristics of filters in the frequency domain are the transferfunction H(z) and the frequency response H(f). Both can be obtained by means ofmathematical transforms applied to the impulse response. Since the transfer functionis considered a useful tool when designing filter, especially in electronic literature, theZ-trasform must be been introduced. Here is the Z-transform definition, of a genericdigital signal x[n]:

X[z] =∞∑

n=−∞

x[n] · z−n

Which is related to the Discrete Fourier transform, introduced later, by substitutionz = ejω. The transfer function is achieved by applying the Z-transform to the IR h[n]of the filter, as follow:

19Fast Fourier Transform20Inverse Fast Fourier Transfor

31

Page 41: A Perceptually Grounded Approach to Sound Analysis

3 – Digital Audio Concepts

X[z] =∞∑

n=−∞

h[n] · z−n

while the frequency response can be achieved by applying the DFT to the IR of the filter.

3.2.3 Digital implementation of filters

Traditionally, digital filter realizations have been classified into two large families:

• non-recursive filters

• recursive filters

These names came from the nature of algorithms used to design those filters. From thetransfer function point of view, these two categories can be reformulated as follows:

• filters whose transfer function doesn’t have the denominator

• filters whose transfer function have the denominator

In both cases, every output sample is calculated as a combination of the previous inputand/or output samples. Respecting the order of previous classification, such possiblecombinations are:

• current input samples with past input samples

• both present and past output samples and sometimes past input samples

Since digital filters base their output on combinations of past input/output samples, theyimply the concept of "memory". In computers, the first realization of filter is easier thanthe second. The concept of delay line21 should be observed. A generic scheme, used torepresent delay line is presented in figure 3.7. The amount of delay required depends onthe memory we need to have, for the specific design of filter. This delay determines thenumber of memory cells dedicated for storage of the delayed samples. As an obviousconsequence, the storage space required is greater and the computational cost is higherwhen we take in account a lot of past samples.

Finally, since the two possible realizations of filters depend on the nature of their IR,the last and commonly used subdivision of filters is that:

• Finite Impulse Response filter (FIR filter)

• Infinite Impulse Response filter (IIR filter)21Recirculating memory unit, whose purpose is that of delaying the incoming signal by an established

number of samples. See [41] for more on delay line and digital implementation.

32

Page 42: A Perceptually Grounded Approach to Sound Analysis

3.2 – Digital Filters

Figure 3.7: Simple delay line.

3.2.4 FIR Filters

In FIR filter, the response due to an impulse input will decay within a finite time.Conversely, in FIR filter realizations, the impulse response will theoretically never die. Incomparison, implementation of FIR filters are easier, but slower, when compared to IIRfilters. Though IIR filters are fast, practical implementation is a bit tough compared toFIR filters.

A FIR filter is a linear combination of a finite number of samples of the input signal.

y[n] =M∑

m=0

h[m] · x[n−m]

In the equation above the convolution formula given above, can be easily recognized.Here h[m] is the finite impulse response, typical of FIR filter realizations. The timeextension of the impulse response determines the lenght of the filter, which is N + 1.As introduced above, the transfer function can be achieved by applying the Z-transformto the impulse response, which result as:

H[z] =N∑

m=0

h[m] · z−m = h[0] + h[1] · z−1 + h[2] · z−2 + . . .+ h[N ] · z−N

The simpler example of FIR filter is the first order low pass filter, which takes intoaccount only the first previous input sample. The formula of this kind of filter is thefollowing:

y[n] = 0.5(x[n] + x[n− 1])

33

Page 43: A Perceptually Grounded Approach to Sound Analysis

3 – Digital Audio Concepts

Besides, to obtain an high pass filter, again of the first order, we must simply changethe operand, like this:

y[n] = 0.5(x[n]− x[n− 1])

Follows the general formula for FIR filters:

y[n] = a0 · x[n]± a1 · x[n− 1]± . . .± ai · x[n− i]

In order to run an N order FIR filter we need to have, at any instant, the currentinput sample together with the sequence of the N preceding samples. These N samplesconstitute the memory of the filter. In practical implementations, it is customary toallocate the memory in contiguous cells of the data memory or, in any case, in locationsthat can be easily accessed sequentially. At every sampling instant, the state must beupdated in such a way that x(k) becomes x(k + 1), and this seems to imply a shift ofN data words in the filter memory. Indeed, instead of moving data, it is convenient tomove the indexes that access the data.

Such as an example, three memory words are put in an area organized as a circularbuffer (see figure 3.8). The input is written to the word pointed by the index and thethree preceding values of the input are read with the three preceding values of the index.At every sample instant, the four indexes are incremented by one, with the trick ofbeginning from location 0 whenever we exceed the length M of the buffer (this ensuresthe circularity of the buffer). The counterclockwise arrow indicates the direction takenby the indexes, while the clockwise arrow indicates the movement that should be doneby the data if the indexes would stay in a fixed position.

As a matter of fact, an FIR filter contains a delay line since it stores N consecutivesamples of the input sequence and uses each of them with a delay of N samples at most.The points where the circular buffer is read are called taps and the whole structure iscalled a tapped delay line.

3.2.5 IIR Filters

The filters of the second family admit only recursive realizations; thus the impulse re-sponse of these filters is infinitely long, justifying their name, Infinite Impulse Response(IIR) filters. In general, an IIR filter is represented by a difference equation where theoutput signal at a given instant is obtained as a linear combination of samples of theinput and output signals at previous time instants. The simplest, nontrivial, IIR filterthat can be conceived: the one-pole filter having coefficients a1 = 1/2 and b0 = 1/2, isdefined by:

y[n] = 0.5(y[n− 1] + x[n])

34

Page 44: A Perceptually Grounded Approach to Sound Analysis

3.2 – Digital Filters

Figure 3.8: Circular buffer.

and the transfer function of this filter is:

H[z] =1/2

1− 12z−1

Due to the advanced forms in which digital filters can be designed, the result obtainedcould be even more precise than the analog counterpart.

35

Page 45: A Perceptually Grounded Approach to Sound Analysis

3 – Digital Audio Concepts

36

Page 46: A Perceptually Grounded Approach to Sound Analysis

Chapter 4

...To Sound Spectrum Analysis

37

Page 47: A Perceptually Grounded Approach to Sound Analysis

4 – ...To Sound Spectrum Analysis

4.1 Introduction to Sound Analysis in the FrequencyDomain

In this chapter when we speak of analog signal we will usually refer to waveforms.The spectral analysis of musical sounds is the legacy of Fourier analysis. Althoughother methods had been explored, Fourier concepts are still applied to every digitalsound applications. The soundness of Fourier analysis resides in its representation whichhighlight similar characteristics the auditory system by psychoacoustical knowledge.

We mentioned before, at the end of chapter 2, that all the operation could be doneover the sampled signal are performed to achieve goals coming from the necessity tohave different output (i.e. sound processing, in which filters have a dominant role),to generate output from nothing (i.e. synthesis) or to analyze the digital sample topredict some of the sound characteristics. The principal effort in sound analysis arepitch recognition, timbre perceptio, rythm recognition and bpm extraction. In somecases analysis is only the first stage in order to perform reconstruction of the originalsignal, that is, resynthesis. Audio synthesis can be definitively different from the above,because is the process in generating sounds, and this can be obtained in so many ways,that a suitable discussion in this thesis would be too long.

In general, in digital music system, we can distinguish between:

• Audio synthesis, the process of generating stream of audio samples by algorithmicmeans.

• Audio analysis, takes digital signal (but leaves unaltered the stream of samples)and mathematically determines its characteristics.

Those systems are represente in figure 4.1.

Digital audio synthesis

Audio synthesis is the other great challenge in digital audio performance. Since no soundis generated by synthesis for the purpose of this thesis, we will only treat marginally theargument.

Everything started with the necessity to let the computer speaks in human(and nothumanoid) way. Since psychoacoustic and vocoder techniques1, several studies made itpossible to implement the physical studies behind human auditory system. Hence why,why do not try to replicate with computer, the most sophisticated sound produced byhuman body, the speech.

1A phase vocoder is a type of vocoder which can scale both the frequency and time domains of audiosignals by using phase information. The computer algorithm allows frequency-domain modifications toa digital sound file (typically time expansion/compression and pitch shifting).

38

Page 48: A Perceptually Grounded Approach to Sound Analysis

4.1 – Introduction to Sound Analysis in the Frequency Domain

Sound synthesis made by computer, starts in 1957 with Max Mathews’ MUSIC 1.With MUSIC 3 in 1960 was introduced the concept of Unit Genertor (UG), the simplerinstrument for the computer, the greatest change in the way to the computer soundprogrammer’s approach. With a UG, one can create a sine wave to produce an oscillator,with logical and arithmetic UG one can multiply two oscillator to produce another sound,design filters’ frequency and impulse response, combine filters with oscillators to createnew more complex sounds, and so on, to infinity. The UG so created, quickly increasedin complexity, in parallel to the rapid rise of microelectronics, becoming one of typicalfeatures in most music programming language. With the consequently development offaster algorithm, music synthesis has been widely extended in many research areas.

Curtis Roads says about synthesis:After Max Mathews in 1957, dozens of sound synthesis techniques have been invented.As in the field of computer graphics, it is difficult to say at any time which techniques willflourish and which will fade over time. This situation is fueled by competitive pressure inthe music industry, making it inevitable that synthesis methods fall in and out of fashion,because no one of these methods can satisfy [44].As just a souvenir, here are reported some of the synthesis methods, in no precise order:

• wavetable synthesis

• sampling synthesis

• additive synthesis

• subtractive synthesis

• sinusoidal synthesis

• granular synthesis

• modulation synthesis

• physical modeling synthesis

• formant synthesis

• residual synthesis

• graphic synthesis

• stochastic synthesis

39

Page 49: A Perceptually Grounded Approach to Sound Analysis

4 – ...To Sound Spectrum Analysis

DACcomputer low passfilter

analogaudiooutput

low passfilter ADC computer

analogaudioinput

loudness

pitch (rhytm) SOUNDANALYSIS

bpm

algorithms

wavetables

oscillators

SOUNDSYNTHESIS

(...)

(...)

Figure 4.1: Digital sound synthesis and sound analysis.

Digital audio analysis

Any sound can be interchangeably represented in the time domain by a waveform or inthe frequency domain by a set of spectra[54]. Thus, in the following pages waveform isthe term adopted to to audio signal Three main aspects are treated in analyzing sounds:pitch recognition, rhythm detection and spectrum analysis (also helpful for the othertwo). A lot of synthesis technique are based on data outputted by analyzing sound, inthese cases we can speak of analysis/resynthesis or analysis/synthesis techniques. Tounderstand this purpose imagine a peak detector, applied to a musical flow, and thinkas using frequencies and amplitude of every peak detected (derivative of the analysis) todrive digital oscillators. analysis/synthesis techniques analysis/resynthesis techniques

4.2 Introduction to the Fourier Analysis

Spectral modelling techniques are the legacy of the Fourier analysis theory. Originallydeveloped in the nineteenth century, Fourier analysis considers that a pitched sound ismade up of various sinusoidal components, where the frequencies of higher componentsare integral multiples of the frequency of the lowest component. The pitch of a musicalnote is then assumed to be determined by the lowest component, normally referred to asthe fundamental frequency. In this case, timbre is the result of the presence of specificcomponents and their relative amplitudes, as if it were the result of a chord over aprominently loud fundamental with notes played at different volumes. Despite the fact

40

Page 50: A Perceptually Grounded Approach to Sound Analysis

4.2 – Introduction to the Fourier Analysis

that not all interesting musical sounds have a clear pitch and the pitch of a sound maynot necessarily correspond to the lower component of its spectrum, Fourier analysis stillconstitutes one of the pillars of acoustics and music. [36]

Origin

Since Jean Baptiste Joseph, Baron de Fourier, in 1822 published his evolutionary theory,we can be traced back to the events that made the history, in rapid succession:

• 1870 first mechanical harmonic analyzer,

• 1898 first mechanical harmonic analyzer that could be reversed to waveforms syn-thesizer,

• 1930 advent of analog filters made it possible spectrum analysis,

• 1940 first digital implementation on computers,

• 1960 advent of FFT algorithm reduced enormous calculus computing fourier trans-form,

• 1977 advent of STFT, short-time fourier transform, widely used in music systems.

What Fourier stated, in a few mathematical words, was that complex but periodicsignal can be seen as a sum of simple signals. In musical context this was intendedas periodic waveform that can be deconstructed in a combination of simple sinusoidalwaves, each one with its own amplitude, frequency an phase. On digital computersa sine wave is generated by an oscillator (first UG was an oscillator) able to producesounds by a sine wave with only three parameters: amplitude, frequency and phase. Inengineering mathematic an oscillator is normally expressed in another form through theEuler’s relations, which allow to express sine and cosine functions by means of complexexponential.

In this chapter we will introduce a particular Fourier-based analysis and synthesissystem, called the short-time Fourier transform, STFT, due to Allen and Rabiner (1977).This is a very general technique, useful in the study of time-varying signals such as musicalsounds, that can be used as the basis for more specialized techniques. In the followingchapters the STFT is accounted as the basis for several analysis/synthesis systems.

In musical contexts, Fourier Transform is applied to analog signals (FT) having alimited bandwidth or to a finite number of digital samples (DFT or STFT).

We can summarize here the techniques used to compute FT over analog and digitalinput signals:

• FT, time-continuous signal input, frequency-continuous spectrum output.

41

Page 51: A Perceptually Grounded Approach to Sound Analysis

4 – ...To Sound Spectrum Analysis

• DTFT, time-discrete signal input, frequency-continuous spectrum output.

• DFT, time-discrete input signal, frequency-discrete spectrum output.

• STFT, short-time input signal (time-continuos inside short-time periods), time-varying frequency-discrete spectrum output.

4.2.1 Fourier Transform (FT), Classic Formulation

In signal processing, the Fourier Transform is intended as the mathematical transforma-tion by which a time-domain signal is converted into its frequency spectrum. Usually, theresult of the FT is just called the spectrum of the input signal. In its original formulation,the FT extends the spectrum of the signal to the whole frequency range, from 0 to ∞.Here the definition:

X(ω) =

∫ ∞−∞

x(t) · e−jωt dt

where x(t) is a generic waveform and t and ω are the continuous time index and thecontinuous frequency index. ω is the angular frequency, expressed in radians per second.The simple relationship with the correspondent frequency in Hz is f = ω/(2π).

The FT could also lead to another interpretation, more interesting in musical con-text, that is the decomposition of the waveform into an infinite number of sinusoidalcomponents.

The result of the FT is a complex value X(ω) for every values of ω, but X(ω) isusually considered the whole spectrum of x(t). Each complex value, expressed in theform (a + jb), with a and b the real and imaginary part, reveals the three fundamentalcomponents of a sinusoid: frequency, amplitude and phase. Obviously, ω is the frequencyand the other two can be computed with the following simple formulas:

amplitude =⇒ |X(ω)| =√a 2 + b 2

phase =⇒ arg[X(ω)] = arctan

(b

a

)X(ω), again the whole spectrum of x(t), is a periodic function of ω with period 2π andthe original signal x(t) can be recontructed by means of the Inverse Fourier Transform,defined as follows:

x(t) =1

∫ ∞−∞

X(ω) · ejωt dt

The Fourier Transform is valid only applied over time-continuous signals, e.g. waveforms.Let’s see how it works with digital signals.

42

Page 52: A Perceptually Grounded Approach to Sound Analysis

4.2 – Introduction to the Fourier Analysis

4.2.2 Discrete Fourier Transform (DFT)

Figure 4.2: Two plots of static spectrum. The image represents the SPL against fre-quency of a drum hit played by a robot (on the left), and a note of a violin (on theright). The difference is noticeable, while the robot hit has apparently no harmonicallyrelated frequency components, in the violin note this is clear.

In digital computers, waveforms are transformed into discrete samples by means ofsampling, therefore in this case the Discrete Fourier Transform is computed in placeof FT. A signal that has discrete value representation in frequency-domain is called aperiodic signal, that means that the spectrum shows isolated spectral lines.

The DFT formula can be written as follows:

X[k] =

N/2−1∑n=−N/2

x[n] · e−jωkn

where x[n] is the nth value of discrete-time signal N samples long . That’s the motivebecause the integral in the formula goes to −N

2to N

2). ωk = 2π · ( k

N) is the discrete

angular frequency, k is an integer number going from 0 to N-1 and N must be choseneven.

While X(k) is called the discrete spectrum, the k-th X(k) discrete frequency sampleis called the k-th frequency bin. In DFT the relationship between discrete angularfrequency and frequency in Hz is:

f = fs ·ωk

where fs = 1Tis the sampling frequency and T the period between samples.

43

Page 53: A Perceptually Grounded Approach to Sound Analysis

4 – ...To Sound Spectrum Analysis

Due to discrete value of k, DFT assumes that x[n] can be represented by a finitenumber of sinusoids, this means that the signal x[n] is band-limited in frequency. Besides,the frequencies of the sinusoids are equally distributed between 0 Hz and the samplingrate fs, or, in radians, between 0 and 2π.

The DFT internally masks the frequency-domain sampling function, because thereis a direct correspondence between the number of input samples and the number ofoutputted frequencies.

The inverse DFT is defined as:

x[n] =1

N

N/2−1∑k=−N/2

X[k] · ejωkn

There is a faster computational version of the DFT, which is called the FFT. The algo-rithm used to compute FFT allows the substitution of complex products with weightedsum so that the computational cost is reduced from N2 to N · logN . This is still oneof the most used and advanced technique of implementing DFT on digital computers,especially where real-time DFT is needed or the space in memory is a critical point (i.e.on chips).

Unfortunately both FT and DFT work only for periodic signals: in music only anaccurate note coming out from a tuned musical instruments can be treated as a periodicwaveform, while most of the sounds are non-periodic and time-varying waveforms.

So, let’s now introduce to the most used FT technique for musical purpose on digitalcomputers, the Short Time Fourier Transform.

44

Page 54: A Perceptually Grounded Approach to Sound Analysis

4.2 – Introduction to the Fourier Analysis

this page is intentionally left blank,

take a breath.

45

Page 55: A Perceptually Grounded Approach to Sound Analysis

4 – ...To Sound Spectrum Analysis

4.3 The Short Time Fourier Transform (STFT)

One of the main problems with the original Fourier transform theory is that it does nottake into account that the components of a sound spectrum vary substantially duringits course. In this case, the result of the analysis of a sound of, for example, 5 minutes’duration would inform the various components of the spectrum but would not informwhen and how they developed in time.[36]

The Short Time Fourier Transform represents a solution to this problem. It splitsthe sound into short-time segments performing an operation called windowing, and se-quentially analyses each segment. Normally the FFT technique is applied in order tocompute FT on each windowed portion, because the computational cost of this operationis extremely lower. The reasons of the wide use of this technique can be summarizedas follows: the spectrum derives from a sequence of individual analysis windows thatcan trace the time evolution of the sound. Thus the spectrum can be seen as a set ofspectra equally spaced in time, one for each windowed portion of the waveform, giving amore sophisticated and convenient representation than DFT. This could be seen in figurerefSTFTspectrum STFT is also a focal point in sounds analysis, because a time-varyingspectrum is more similar to the human auditory system, therefore this could be helpfulin determining perceptual attributes, overall pitch and timbre.

How short should be chosen the short-time segment? Less than 1/10 of a second,tipically.

rectangularto

polarcoordinates

wave formphase spectra

magnitude spectra

window

F F T

Figure 4.3: Basic operation of the STFT used for sound analysis.

The STFT operation over a waveform can be interpreted in two ways:

• windowed DFT, where the DFT (FFT) is computed over each windowed segment;windows may overlap

• filterbank view, a bank of bandpass filters equally spaced across the frequencydomain (i.e. from 0Hz to Nyquist frequency)

46

Page 56: A Perceptually Grounded Approach to Sound Analysis

4.3 – The Short Time Fourier Transform (STFT)

Windowed DFT (realized by FFT)

Windowing means that incoming signal is segmented in temporal windows, each of whichhas the same duration. but windows may overlap. Then the segmented portions of signalare analyzed with DFT (or FFT) separately. The general formula of STFT is:

X[n,k] =∞∑

m=−∞

{x[m] h[n−m]·}e−jωkn

where the output, X[n,k], is the DFT of the windowed input at each discrete-timen for each discrete frequency bin k. h[n−m] is the time-shifting window function thatfollows the signal. m, in the general formulation can vary from −∞ to +∞ but can besubstituted with the appropriate length of the window. N is the number of points in thespectrum.

The given angular frequency, for each bins k, is that:

ωk =k · fs

2π ·N

Another formulation of STFT, is that of X.Serra, where H, the hop size, is the timeadvance of the incoming signal, substituting the time-shifting window function.

It is also a function of two variable, follows the definition:

X[l,k] =N−1∑n=0

{w[n] x[n+ lH]} · e−jωkn

now, w(n) is a real window, l indicates the frame to pass through window and again,the same exponential function. X[l,k], the spectrum, is the DFT of the sequence ofw(n)x(n + lH) for 0 < n < N − 1. The spectrum is computed at every frame l,advancing with H along the input signal x(n).

4.3.1 The Filterbank View

The other possible view of the STFT, is represented by a group of band-pass filters(filterbank2), equally spaced in order to cover the whole frequency band of the inputsignal. This method is similar to the one used in spectrum equalizer, in which the shapeof the spectrum can be modeled by the user controlling each level of the filters. Buthere, all filters have the same bandwidth and the center frequencies are equally spacedup to Nyquist.

2see first chapter 3 for more detail on filterbank.

47

Page 57: A Perceptually Grounded Approach to Sound Analysis

4 – ...To Sound Spectrum Analysis

Figure 4.4: Waterfall spectrum, a 3D representation os the STFT spectrum. The graphwas obtained with Spectutils package for GNU Octave. The analysis parameters of theSTFT are shown above the figure, the audio sample analyzed is extracted from LaurieAnderson’s Violin Solo.

STFT interpreted by this way, can be seen ad a filterbank which perform analysisin parallel on each windowed segment of the input signal. For every frame of the inputsignal, a complex value returned by the n filters can describe n sinusoids. Filterbankview was the base for phase vododer analysis/resynthesis technique, and inspired theconstant Q method later proposed. The filterbank view is therefore an abstraction, usedin computing STFT with programming language on digital computer.

The STFT output (both the two views) is a series of spectra, one for each frame ofinput signal. Each spectrum has a real and imaginary part, which can be easily convertedinto magnitude and phase value. In the filterbank view, frequency covers most importantrole than phase. Istantaneous frequency is therefore calculated by converting the phasevalue by the method of phase unwrapping3, to obtain a sinusoid with the obtained

3Phase unwrapping ensures that all appropriate multiples of 2π have been included in Θ(ω)

48

Page 58: A Perceptually Grounded Approach to Sound Analysis

4.3 – The Short Time Fourier Transform (STFT)

frequency and magnitude given by the derived spectrum magnitude.Some steps are necessary in the calculation of the STFT on digital computers. They

are presented in the following sections.

4.3.2 Windowing: Length and Shape of the Window Function

The meaning of this operation is that every input waveforms must be time-limited (win-dowed) in order to calculate its digital FT. Windowing operation is so called becauseit consists in the multiplication of the input waveform for a window function, whichallows us to extract values from a segment (frame) of the waveform, depending on thewindow function and window length. Since multiplication in time-domain correspondto convolution in frequency domain, the product operation inside the Fourier transformoperation, give a resulting spectrum that will be the convolution of the spectrum of thewaveform and the spectrum of the window.

The window is mathematical function, which has non zero values over a limited timerange. The simpler one is the rectangular window, which assumes 1 inside window lengthand 0 outside.

The choice of window length is very important because determines the frequencyresolution and the time resolution of the analysis. In the filterbank view of the STFT,the frequency resolution is the frequency band of each band-pass filters. It can be derivedby the ratio (Fs/N samples in window length), for example, for a fs=44.1KHz and 1024samples per window length, we obtain a frequency resolution of 43 Hz. Thus, in thiscase the waveform is decomposed in 1024 sinusoids having frequencies integer multipleof 43Hz (harmonics). The analysis of the frequencies in between the analysis bands isobviously as much important, but this is beyond the scope of this thesis. We recommend[3] for further readings.

The wave shape is the other most important criterion while choosing the appropriatewindow function. All the standard windows adopted for computing STFT on waveforms,are real and symmetric function and their spectra look like the sinc function sin(t)/t. Infigure 4.5 are presented the most commonly used window function and their spectra. Inthe spectrum characteristics of the window we can see two features: the main lobe andthe side lobes. The width of the main lobe, defined as the extended bins across a period,determines the frequency resolution, i.e. narrower lobe allows better resolution. Theattenuation of the side lobe from the main, the difference in dB between the height ofthe main lobe and the height of the adjacent side lobe, determines the level of cross-talkinterference between two adjacent analyzing window. Typically reduce the side lobe isreflected in an increase of the width of the main lobe, so a compromise must be chosen.

A generic rule could be the following: when the waveform is mainly composed bydistinct number of sine wave a narrow main lobe is preferable, when the waveform ismade of noisy like waves a wide main lobe is preferred.

49

Page 59: A Perceptually Grounded Approach to Sound Analysis

4 – ...To Sound Spectrum Analysis

types.png

Figure 4.5: Types of windows used in STFT for audio analysis. No ideal window exists,the term "optimal window" is preferred. Several types of windows are used, for musicalpurpose the Kaiser window has usually a preferential use.

Another point in the choice of the window length is between odd and even length.For phase detection a zero-phase window is better because the windowing process won’tmodify the phase of the analysis waveform. Therefore an odd window length is preferred,with the middle sample centered at the time origin of the analysis window.

There are a lot of standard window function, used for STFT purpose:

• rectangular

• Hamming

• Hanning or Hann

• Gaussian

• Blackman

• Blackman-Harris

• Kaiser

Gaussian, Hamming and Kaiser are the more often used. The kaiser window has char-acteristics which are well tuned around musical context.

50

Page 60: A Perceptually Grounded Approach to Sound Analysis

4.3 – The Short Time Fourier Transform (STFT)

4.3.3 Computation of the DFT (via FFT)

The discrete spectrum of each portions of the windowed waveform can now be calculatedusing the DFT. In practice, when possible, the FFT algorithm is used for this purpose.The implementation of FFT algorithm will be discussed below.

FFT Size, Zero Padding

The problem with FFT is that requires that the analyzed signal must be N long, with Na power of two number, called the FFT size. Therefore, since the signal length is fixedby the window length, and this is chosen to obtain a desired frequency function (andcan also be variable, i.e. expanded for high frequency) the analysis window hardly everfit the FFT size. This problem in practice can be easily overcome by adding 0 to therest of the length required to match the FFT size, this operation is called zero-padding.

This method has also other benefits, since zero-padding in time means interpolatingin frequency domain, therefore the spectrum obtained will be sharper (more spectral lines,oversampled spectrum). Note that zero-padding does not increase frequency resolutionin the spectrum, because the analysis window length remains the same, but can makeeasy to track spectral peak which are not exact bin frequencies.

Usually the FFT size is chosen to be the first power of two, at least twice the windowlength M, therefore M-N points will be forced to 0. N/M will be called the zero-paddingfactor. This factor should be large enough to enable estimation of the maximum of themain lobe, that is, the spectral peak. Since the window length is not an exact number ofperiods for every frequency, the center frequency of the spectral peaks will rarely matchthe frequency bins, hence the zero-padding factor chosen appropriately can resolve thisproblem and the peak can be found.

As said before, choosing an odd length window, help us in phase detection. Theodd length window means that the windowed waveform will be centered at time origin,thus half of the samples will be positioned before the time origin (negative-time value)and the other half after the analysis time origin (positive-time value).The FFT inputbuffer of this windowed waveform, will contain the positive-time values at the beginningand the negative-time value at the end, the rest of the length (in the middle) will bezero-padded.

Hop Size, Overlap Factor

The hop size determines the time advance of the analysis window, i.e. the time difference(in samples) between two adjacent analysis windows. Hop size is normally referred toassume the unitary value when hop size is equal to window length (in samples).Theanalysis windows could be overlapped in order to have more analysis point and thereforemore time resolution over the input waveform.

51

Page 61: A Perceptually Grounded Approach to Sound Analysis

4 – ...To Sound Spectrum Analysis

The inverse of the hop size is called overlap factor (if H > M the analysis windowwill not overlap). For example, if H=M=1024 and fs=44100Hz, the time resolution overthe input waveform is 1024/44100 = 23 ms , if the overlap factor is 8 the time resolutionbecomes 2,9ms.

Greater overlap factor will generally give better analysis results, but also greatercomputational cost. Hence, overlap factor has to be chosen whereas the input waveformcharacteristics, i.e. fast-changing waveforms need more overlap. There are some generalcriterion for determining an efficient overlap factor, the more general is to choose overlapin a way that all the data are equally weighted as in the case of overlap-add synthesispresented later.

Other criterion is too chose overlap factor according to the nature of window function,that is, overlapping windows should add perfectly to a constant value, i.e. 1. Fora rectangular window this is easy to obtain, hop size can be simply M/i, with i anypositive integer. If consecutive analysis windows are added each others to a constant, noamplitude deviation is possible, hence successive windowing operation will not performamplitude modulation to the input waveform.

To summarize, the STFT operation is applied to a stream of input samples andresults in a series of frames that one after another produce a time-varying spectrum,thus the impression is to see a continuous spectrum.

The four parameters to choose in designing efficient STFT, can be summarized asfollows:

• window shape

• window length

• hop size / overlap factor

• FFT size

4.3.4 The Inverse Short Time Fourier Transform & Overlap-Add Resynthesis

The ISTFT is the process by which the original waveform can be reconstructed startingfrom the frequency-domain analysis data produced by the STFT. This is the typicalfeature of a synthesis process, thus what will be presented in this section is the use ofISTFT in the particular synthesis method called overlap-add resynthesis.

But before, the definition of the inverse STFT:

Xm(ωk) = e−jωkmR

N/2−1∑n=−N/2

x(n+mR)w(n)e−jωkn

52

Page 62: A Perceptually Grounded Approach to Sound Analysis

4.4 – Constant-Q analysis

The overlap-add resynthesis method, due to Allen and Rabiner (1977), says that we canreconstruct each windowed segment of the original waveform starting from the spectrumcomponents by the use of ISTFT over each frames. It takes the magnitude and phasevalue of each spectrum to generate a time-domain waveform using the same envelope ofthe analysis window used to compute the STFT. Then each resynthesized time-domainsegment is overlapped and added to reconstruct the original waveform.

In theory, the overlap-add process is an identity operation (i.e. the reconstructedsignal equal the original) by mathematical mean, only if the overlapped and addedwindows sum to a contant. That means that we can pass countless times the signal intoSTFT and back to the original with ISTFT, however, even good implementations of theSTFT, lose even a small amount of information, demonstrating that this is impossible.

OA resysnthesis is not the only method to do resysnthesis of the orignal waveformbased on STFT, many others are possible. Weighted overlap-add method is similar toOA, the difference resides in the transformation applied to the window function beforeresynthesis. The analysis window and the synthesis window must maintain the identityproperty, this is achieved by respecting the relationship:

∞∑n=−∞

w[m− nH] = c

where c is a constant.The synthesis window is needed when, before resynthesis, a transformation is applied

to the phase spectrum, which can create phase discontinuities at the frame boundaries.Oscillator-bank resynthesis (also called sinusoidal additive resynthesis, SAR) in an-

other method, in which analysis data (magnitude and phase) are converted into synthesisdata (amplitude and frequency) in order to drive one oscillator for each frames whichare then summed to recreate the original signal. This method frees from the add toconstant rule of the OA, because the converted spectrum is more robust against digitalprocessing transformation eventually applied before synthesis.

SAR method can be applied to the filterbank interpretation of the STFT, by matchingeach frequency bins to a sine wave and then sum all the sine waves for synthesis.[54]

4.4 Constant-Q analysis

The constant Q filterbank analysis is another technique descendant of harmonic analysis,such as the Fourier transform. Its use was adopted since late 70 and inspired other tech-niques such as bounded-Q transform, auditory transform and wavelet transform. Q, theratio between center frequency and bandwidth of a filter, and filterbank, a connection offilters in parallel, were introduced in chapter 2. Now the use of both these characteristicswill be addressed to the case of constant-Q analysis, as alternative to the strict Fourier

53

Page 63: A Perceptually Grounded Approach to Sound Analysis

4 – ...To Sound Spectrum Analysis

Figure 4.6: Spacing of filters for STFT (filterbank view) on the top and Constant-Qfilterbank on the bottom. It is clear the advantage of the Constant-Q filterbank method,which places the filters linearly against log(frequency), which is similar to the frequencyresponse of the human ear.

analysis. Our aim in this thesis, will be that to demonstrate its goodness against aspecific task of sound analysis, the onset detection, applied to the case of ·O M M· (nextchapter) and its perceptually grounded approach applied to the recognition of the soundhas been played (final chapter).

The constant-Q transform has advantages over the Fourier transform, which lie inmusical aspects. The STFT computes frequency components (frequency bins) on alinear scale, that means, it expresses frequencies with fixed resolution or bandwidth. Thismethod has an inconvenient, because it frequently results into inadequate resolution forlow musical frequencies and exaggerated resolution for high frequencies. The choice ofthe frequency resolution is addressed to the choice of the appropriate window length,which best fits the resolution needed (i.e. the lower frequency content which can solve).Moreover, high frequency resolution means poor time resolution and vice-versa, that is,there is always a tradeoff between time/frequency resolution in STFT based analysis.

Such a problem will be discussed through an example: suppose the sampling fre-quency to be fs=44100Hz and N=1024 samples the window length. The frequency

54

Page 64: A Perceptually Grounded Approach to Sound Analysis

4.4 – Constant-Q analysis

bins that can be analyzed will be 512, equally spaced over the bandwidth, i.e. from 0 to20KHz. Increasing the sample rate, e.g fs=96KHz will not increase the frequency res-olution of analysis but will only widen the bandwidth up to 48KHz. To get an increasein frequency resolution, one must choose a larger window length. The limit exampleis that to obtain a frequency resolution of 1Hz, a window length up to 44100 samplesmust be chosen, by sacrificing the time resolution to 1 s! Conversely, if time resolutionneeded is 1ms, i.e. to analyze 1000 events per second, the window length should be44 samples, thus the frequency resolution will be about 1000Hz!

Now let’s see a practical example, introducing the need of a different tool for anal-ysis. Suppose the task is to solve (to analyze separately) the frequencies correspond-ing to the fundamental frequencies of notes in a piano. Now, suppose the two lowernotes being spaced, for example, 2.5Hz apart. The analysis window must be chosenN=16384 samples long, that is, for fs = 44100 Hz the frequency resolution will befs/N ' 2.5Hz. This would not only result in a bad time resolution (400ms), but thereal problem consists in the extremely useless frequency resolution used to solve higherfrequencies notes, because here the spacing between notes is much more than 2.5Hz.Not only, what we said in previous chapter about perception of frequencies, here is com-pletely neglected. However, since STFT is performed via FFT, the time required foroutput is extremely low and implementation in real-time do not constitute a problem,although lots of data are useless and must be discarded after analysis. Therefore, thecomplexity reduction achieved by applying FFT algorithm, is the principal reason behindthe wide use of STFT in sound analysis purposes.

Constant-Q transforms constitutes an alternative to the fixed frequency representa-tion of Fourier transform. In a constant-Q transform the bandwidth of each frequencybins, varies proportionally with frequency. In the next section, we’ll see a typical imple-mentation of constant Q for musical analysis purpose, applied to the case of a piano.But first, we should take a look at the waterfall spectrograms represented in figures 4.7and 4.8, taken from Brown [10]. The two pictures clearly point out the advantaged inrepresenting musical signal with the constant Q transform, which lies in musical aspect.It is especially clear if compared to the previous image (4.4) showing STFT waterfallspectrum.

4.4.1 Implementation of Constant-Q Analysis

Such an example of constant Q transform implementation is represented by the 1/24th-octave bank of filters. The constant Q filter bank and its similarity to the auditorysystem has been explored in [42] and [38]. Various schemes for implementing constantQ spectral analysis outside a musical context have been published, for example that ofGambardella, which proposed an inverse funcion to reverse the constant-Q method backto the time domain. This is of importance if manipulation of the signal in the spectraldomain followed by transformation back to the time domain is desired.

55

Page 65: A Perceptually Grounded Approach to Sound Analysis

4 – ...To Sound Spectrum Analysis

Figure 4.7: Waterfall spectrogram of a Constant Q transform of violin glissando from578Hz to 880Hz (D5 to A5). Taken from Judith Brown’s Calculation of a constant Qspectral transform. [A glissando is a glide from one pitch to another. It is an Italianized musical term derived

from the French glisser, to glide, It is also where the pianist slides up the piano with his or her hands. From Wikipedia.]

For musical analysis, we would like frequency components corresponding to quarter-tone spacing of the equal tempered scale4. The frequency of the kth spectral componentis thus:

fk = (21/24)k fmin

where f will vary from fmin to an upper frequency chosen below the Nyquist frequency.The minimum frequenc fmin can be chosen to be the lowest frequency about which

4Equal temperament is a musical temperament, or a system of tuning in which every pair of adjacentnotes has an identical frequency ratio. In equal temperament tunings an interval, usually the octave, isdivided into a series of equal steps (equal frequency ratios).

56

Page 66: A Perceptually Grounded Approach to Sound Analysis

4.4 – Constant-Q analysis

Figure 4.8: Waterfall spectrogram of a Constant Q transform of flute playing diatonicscale from 262 Hz to 523 Hz (C4 to C5). Taken from Judith Brown’s Calculation of aconstant Q spectral transform. [In music theory, a diatonic scale is a seven note musical scale comprising five

whole steps and two half steps, in which the half steps are maximally separated. From Wikipedia.]

information is desired, e.g. a frequency just below that of the G string for calculationson sound produced by a violin. The resolution or bandwidth δf for the discrete Fouriertransform is equal to the sampling rate divided by the window size (the number ofsamples analyzed in the time domain). In order for the ratio of frequency to bandwidthto be a constant (constant Q), then the window size must vary inversely with frequency.More precisely, for quarter-tone resolution required is:

Q = f/δf = f/0.029f = 34

where the quality factor Q is defined as δf = f/Q. We note that the bandwidthδf = f/Q. With a sampling frequency fs = 1/T where T is the sample period, thelength of the window in samples at frequency fk,

N [k] = S/δfk = (S/fk)Q

57

Page 67: A Perceptually Grounded Approach to Sound Analysis

4 – ...To Sound Spectrum Analysis

Note also from this equation that the window contain Q complete cycles for each fre-quency fk, since the period in samples is fk. This have physical means since, in orde todistinguish between fk+1 and fk when their ratio is, e.g 21/24 = 34/33, we must lookat at least 33 cycles. It is also interesting for comparison, to consider the conventionaldiscrete Fourier transform in terms of the quality factor Q = f/δf . We find that f/δfis equal to the number of the coefficient k, and this is the number of periods in the fixedwindow for that frequency.

The constant-Q transform have demonstrated good result as approaching to the taskof sound analysis; especially regarding the identification of musical notes, this transformshows to be a more appropriate spectral representation due to its geometrically spacedfrequency channels. Although it should not be considered a good starting point formusical synthesis, because of the controversial inverse function. Inverse function hadbeen proposed but not successfully implemented for musical purpose.

This method will be proposed, in the final chapter, to achieve the goal of onsetdetection.

58

Page 68: A Perceptually Grounded Approach to Sound Analysis

Chapter 5

Real-Time Audio Applications

To generate1 and process acoustical signals is to compose music, more directly thaninscribing ink on paper. Curtis Roads[44].

For our purpose, that is to analyze the musical flow of the robotic orchestra, we need aflexible and extensible platform. For this reason we discarded a priori the use of numericalanalysis software, like Matlab/Octave (we used them but for other purposes). Althoughthey offer lot of (free for GNU Octave) packages tuned around sound analysis andprocessing ([63][47] and [32]), our need is to integrate the analysis onto the Show ControlSystem2 of the Orchestra, thus, for musical context other softwares are recommended.One of the pillar in this category is represented by Max/MSP, the software already usedfor the development of ·O M M· SCS.

In this chapter the main softwares for real-time audio applications are treated frombasic, to advanced (in the case of Max/MSP, the software we choose for our purpose)features. Later, at the end of chapter, an overview of the typical application, which arethese software designed for, and a comparison between the textual based (unix style,terminal-like window) and graphical based software.

1Musical sound synthesis.2Show control system (entertainment) is a generic for a system (could be very complex) whose main

feature is that to coordinate all the different systems (audio, video, MIDI, OSC....) controlling thehardware, by which a show is formed. In our case, the SCS, coordinates the robots musician and thegestural controller and possible other (to be experimented) features.

59

Page 69: A Perceptually Grounded Approach to Sound Analysis

5 – Real-Time Audio Applications

5.1 Max/MSP

Max/MSP devotes his first part of the name to Max Mathews3, who wrote in 1957, thefirst ever computer program, specific for sound generation4. Max was also the originalname of the software, developed by Miller S. Puckette at IRCAM in the mid 80s and firstcommercially distributed since early 90s. MSP, is a package for real-time DSP (standingfor Max Signal Processing or the initials of Miller S. Puckette), added to the softwaresince 1997. Due to its graphical (but minimal) nature, Max/MSP differs from the mostMUSIC-N languages, Max can be considered a visual programming language. Visualprogramming let you graphically connect objects together with patch cords to designinteractive software. This is normally the attitude of designing programs, think at theflowchart or most modern techniques such as UML. But the difference resides here, withflowcharts the blocks represent code that will be written, in Max, the code is writtenalready.

Since Max uses icons to represent objects written in high level language, Max is ameta-language, responding to the paradigm "programs can write programs". Max/MSPdistinguishes between two levels of timing: that of an "event" scheduler, and that of theDSP (similar to the distinction between control-rate and audio-rate processes in Csound,direct descendant of MUSIC.)

With Max you can also control external hardware, read data from sensors, inter-change audio and data with other software other than generate and analyze sounds,create musical intruments, video and animation. All these features let Max be a popularchoice for composing interactive media works. Most of all for the approachable graphi-cal interface, extensive bindings to media processes and protocols, and the open-endedphilosophy.

Follows a short description of principal Max features:

The patcher windows

MAX is designed to look familiar to composers who have worked with patchable synthe-sizers. What you see on the screen is a lot of boxes with lines connecting them. Theboxes are called objects and the lines are patch cords. What happens is that each objectrepresents a process. The results of the process are passed along the patch cord to thenext object. Ultimately, there is an object that sends MIDI data, audio or video out.

Each window full of objects is called a patcher. Several patchers may be open at onceand they can all be active, even if their window is hidden. Patchers can be saved, andthen entered as an object in another patcher. There is also a patcher object, that can

3Max Vernon Mathews is considered unequivocally one of the pioneer in the world of computermusic. Sc.D. in 1954 at Massachusetts Institute of Technology (MIT), while working at Bell Lab hewas founder of MUSIC, the first computer based programming language for sysnthesis (1957).

4MUSIC 1

60

Page 70: A Perceptually Grounded Approach to Sound Analysis

5.1 – Max/MSP

Figure 5.1: Max 5 patcher window

be opened up and filled with objects which will continue to work after the patcher objectis folded up again. The action flows from the top down. When an object is tweakedby the user or MIDI comes in, a message is sent to any connected objects, which reactwith messages of their own. Only one thing happens at a time, but it’s all so fast itseems instantaneous. When a pathway branches, messages are sent to right destinationsbefore left.

Objects

The name of the object represents what it does. There are a few hundred objectsincluded with Max, ranging in complexity from simple math to full featured sequencers.Arguments, if present, specify initial values for the object to work with. Data comes intothe object via the inlets, and results are put out the outlets. Each inlet or outlet on anobject has a specific meaning. This will be displayed in a flag as the mouse passes by(further details are in the manual). Usually, input to the left inlet triggers the operationof the object. For instance, the delay object (as shown) will send a bang message outthe outlet 500 milliseconds after a bang is received in the left inlet. Data applied to theright inlet will change the delay time.

Messages

Data bytes sent down the patch cords are called messages, which fall into one of thefollowing types:

• int A number without a decimal point.

61

Page 71: A Perceptually Grounded Approach to Sound Analysis

5 – Real-Time Audio Applications

• float A number with a decimal point.

• symbol A character string1 such as “stop” that may be understood by certainobjects. A symbol may be followed by further information in the message.

• list Several of the above, separated by spaces. The first element of a list must bea number.

• bang A message that triggers the action of an object.

Audio signals are sent in yellow patch cords. These are little packets of data, butsent so fast as to be effectively continuous. Jitter signals are sent via green patch cords.Jitter messages are names of matrices that hold data for jit.objects to process. Everyobject responds to a variety of messages. If a message won’t work, a warning will appearin the Max window.

Max windows

The Max window contains information sent from Max (like error messages) or thingsyou might like to print. It’s sort of a terminal window.

Max runtime

Max is not required to run a finished patch. Anyone can download Max/MSP Runtimefor free, which will run patches but not edit them. There is also a process for convertingpatches into stand-alone applications.

Pure Data

PD is the open source twin of Max released under a BSD license, developed by the sameauthor, Miller Puckette, since 1996. It show off the same potentiality of Max, with littledifferences explicated by the author in [42] and [38].

5.2 CSound

CSound is one of the better-known textual interfaces for computer music composition.CSound was originally written by Barry Vercoe at MIT in 1985, based on languagesof the Music-N family, and continues to be developed today. At its core, CSound is“designed around the notion that the composer creates a synthesis orchestra and a scorethat references the orchestra.”

62

Page 72: A Perceptually Grounded Approach to Sound Analysis

5.2 – CSound

Figure 5.2: Max 5 window

Csound files were originally processed in non real-time to render sonic output, in a“process referred to as ‘sound rendering’ as analogous to the process of ‘image rendering’in the world of computer graphics.” [55]. Csound instruments are defined in the orchestrafile as directed graphs of unit generator types (called ‘opcodes’). Flexible sound routingcan be achieved using control and audio busses via the Zak objects. Control rate isevident in CSound through the a-rate and k-rate notations.

The strong separation of synthesis and temporal event definition imposes a strictlimitation on the scope for algorithmic composition: new synthesis processes cannotbe defined in response to temporal events, and new temporal events cannot occur inresponse to the synthesis output. “Csound is very powerful for certain tasks (soundsynthesis) while not particularly suited to others (data management and manipulation,etc.).” [55]

63

Page 73: A Perceptually Grounded Approach to Sound Analysis

5 – Real-Time Audio Applications

5.3 Supercollider

SuperCollider is a high-level programming music language, designed specifically for dy-namic and generative structures and synthesis of computer music. It can be generallyapplied to many different approaches to composition and improvisation rather than anyparticular preconceived model. It features an application-specific high-level programminglanguage SCLang (inspired to C++) with extensive data-description and functional pro-gramming capabilities, and support functions for common musical needs.

SuperCollider has also features as several library of unit generators for signal pro-cessing. Sample-rate and control-rate distinctions are made explicit via the .ar and .krnotation. A key distinction from CSound is that code can be evaluated in real-time asthe program runs.

SuperCollider is ideal for algorithmic composition. Since version 3.0 (the currentlyavailable version), graphs of unit generators are defined textually and compiled at run-time into dynamic libraries (‘SynthDefs’) to be loaded as instruments (‘synths’) by thesynthesis engine (‘SCServer’), all under control of the language.

The separation of language and synthesis into distinct processes in version 3.0 in-troduces compilation and performance optimizations, but also implies limitations in thedegree of temporal control: “Because instruments are compiled into code, it is not pos-sible to generate patches programmatically at the time of the event as one could in SC2.In SC2, an audio trigger could suspend the signal processing, run some compositioncode, and then resume signal processing. In SC Server, messaging between the enginescauses a certain amount of latency.”[34]

SuperCollider 3.0 therefore represents a return to the CSound model of orchestraand score, in which however the score is procedural rather than declarative.

5.4 Chuck

ChucK represents one of the only contemporary options that avoids latency in the pro-cedural control of synthesis. It also provides a library of unit generators to be freelyinstantiated and connected into graphs within ChucK scripts. The authors refer toChucK as ‘strongly timed’, which can be defined as follows:

• supports sample accurate events

• defines no-control-rate (or supports dynamically arbitrary control-rates)

• supports concurrent functional logic

• control logic can be placed at any granularity relative to synthesis

• supports run-time interaction and script execution

64

Page 74: A Perceptually Grounded Approach to Sound Analysis

5.4 – Chuck

Like SuperCollider’s SCLang, the ChucK language was written especially for the ChucKsoftware. It is a high-level interpreted programming language.

65

Page 75: A Perceptually Grounded Approach to Sound Analysis

5 – Real-Time Audio Applications

Comparison between Graphical User Interface andTextual Language Interface Musical Softwares

GUI + Easier to view and input quantitatively rich data such as control envelopesCommon tasks can be immediately and intuitively representedInteraction can be more rapid

GUI - Interfaces tend to be more specificComplex data-structures, if made visible, can be visually overwhelmingPrecise qualitative specification can be difficult at fine granularity

TLI + Compact description of complex data-structuresHigh degree of precision & controlTextual elements may more easily refer to or embed each other

TLI - Tiresome to specify by data-entry when precision is not requiredSimple tasks may require detailed codeInteraction can be time-consuming, particularly if text must be compiled

66

Page 76: A Perceptually Grounded Approach to Sound Analysis

5.4 – Chuck

Table5.1:

Musical

softwareforrealtim

esynthesis

andcontrol

Nam

eCreator

Typicalp

urpo

ses

Firstreleasedate

Recentrelease

(2009)

License

Develop

ment

status

Max/M

SPMiller

Puckette

Realtime-

synthesis,

hardware-control

mid-1980s

Max

5.0.7

Com

mercial

software

(Cycling’74)

Mature

PureData

Miller

Puckette

Realtime-

synthesis,

hardware-control

1990s

pd-extended

(0.41.4),

pd-vanilla

(0.42.5)

BSD

-like

Stable

Csoun

dBarry

Vercoe

Realtime-

synthesis,

algorithm

iccompo

sitio

na,

audio-rend

ering

1986

Csoun

d5.10

LGPL

Mature

SuperCollider

James

McC

artney

Realtime-

synthesis,

live-coding

b,

algorithm

iccompo

sitio

n

1996

SC3.3.1

GPL

Stable

Chu

cKGeWangand

PerryCoo

kRealtime

synthesis,

live-coding

,algorithm

iccompo

sitio

n

2004

Chu

cK1.2.1.2

(dracula)

GPL

Immature

a Algorith

mic

compo

sitio

nisthetechniqu

eof

usingalgorithm

sto

create

music.

bLive-cod

ingisthenamegivento

theprocessof

writingcode

tomod

ifysoftwarein

realtim

eas

part

ofaperformance.Mostgenerally,

writing(parts

of)programswhile

they

run.

It’som

etim

eskn

ownas

"interactiv

eprogramming"

or"on-the-fly

programming".

67

Page 77: A Perceptually Grounded Approach to Sound Analysis

5 – Real-Time Audio Applications

68

Page 78: A Perceptually Grounded Approach to Sound Analysis

Chapter 6

Perceptual Onset Detection

In the first part of chapter 2, we presented the attributes of sound perceived through thehuman auditory system, now we are going to focus the discussion over a particular aspectof such a mechanism, the perception of time events, especially the ones related to theinitial portion of sounds. By implementation of the skills of computer music discussedin chapter 3 and 4, in the detail, digital filters and constant Q analysis, we’ll try tosimulate the functionalities of auditory system to achieve the goal of event detection.The interest in our research is largely treated in literature and must be anticipated bysome fundamental definitions, in primis attack and onset of a sound. Then, some of themost advanced techniques for onset and attack detection will be proposed and finally,our method based on bonk∼ for Max/MSP will complete this chapter.

Therefore, and first of all, the project on which we are working, will advance at aglance.

6.1 The Curious Case of ·O M M·Two robots, loud sounds, and one computer. The reason I decided to make a thesis.The project ·O M M· consists of a show in which a performer conduct two robot drummers.Many people worked for more than one year, developing the electromechanical parts ofthe robots, at LIM laboratories, and developing the software, in Max/MSP, to controlthe robots. The two robotic percussionists (were ten in the original concept) are directedby the performer though a gestural controller, the GipsyMIDI1 exoskeleton. The ·O M M·orchestra, called after italian futurist Filippo Tommaso Marinetti2 in the centenary of

1http://www.sonalog.com/framesets/gypsymidi_frame.htm2Filippo Tommaso Marinetti (1876 – 1944) was an Italian poet and musician, founder of the Futurist

movement. It is responsible of the Futurist Manifesto, published on the french journal Le figaro, in1908. He had also introduced the presence "on stage" of humanoid form of life, mechanical bodiesconsidered primordial example of "robot" (ten year before Karel Capek introduced the concept onto hisnovel Rossum’s Universal Robots)

69

Page 79: A Perceptually Grounded Approach to Sound Analysis

6 – Perceptual Onset Detection

Figure 6.1: The two robots on the sides, SCS + the performer in the middle.

the Futurism, is ready for the launch, in november, after had been presented in october2008.

In the organization of this thesis, this chapter is the right place to introduce theproject we are working on, before going into the discussion on the recognition of particularsound onset. Figure 6.1, representing the ensemble of the show control system of·O M M· , is proposed to lighten the visual imagination. The two robotic drummers,both having two arms, can play at maximum 120 bpm through each arms (i.e. untilfour contemporary3 hits), a score, sent them via MIDI4 from a computer. On thesame computer, a complex Max/MSP patch, elaborates input data received from theexoskeleton, calibrate it, and let the output modify the rhythmic pattern in real-time5.Since the robots play big and cumbersome drums, two oil bins of regular dimension, themusic produced is kind of loud percussive sounds, on the style of the french band Lestambours du Bronx.

3Perceptually contemporary. A minimum delay (but less than 15 ms) between the mechanically ofthe two arms of a robot, must be ensured between two electrical transmissions.

4Musical Intrument Digital Interface.5Quasi-realtime, to be precise.

70

Page 80: A Perceptually Grounded Approach to Sound Analysis

6.2 – From Transient to Attack and Onset Definitions

6.2 From Transient to Attack and Onset Defini-tions

Every musical signal can be subdivided6 into smaller unit of sound. The two subdivisionsconsidered here are the transient and the steady-state portions of a sound. The transientportion is located at the origin of a sound, originated by a stimulus (e.g. a chord isplayed on guitar, a stick strikes the drum), causing a sudden change in the perceptualattributes. The duration of transient is assumed to be very short, substantially relatedto the duration of the stimulus. The steady-state portion, coming after the transient, isa kind of support to the sound, it can be considered as the natural evolution of transientaccording to the instrument design and environment. Figure 6.2 shows this separationfor the case of a drum hit.

To give more precise definitions, regarding the terms used in figure 6.2, the words ofBello [4] are proposed:

• transient: period during which the excitation is applied and then damped

• attack: time-lag during which the amplitude of a sound increases

• onset: single instant chosen to mark the start of the transient

Another interesting (and more intuitive) definition of onset is that given by Bilmes in[8]: onset is the point when a musical event becomes audible. He also called onset withthe name attack time. Actually, sometimes attack and onset are used as synonyms torepresent the same information. To circumvent this ambiguity, we propose what follows,again from Bilmes: “each musical event in percussive music is called a drum stroke orjust a stroke, it corresponds to a hand or stick hitting a drum, two sticks hitting together,two hands hitting together, a stick hitting a bell, etc. The attack time of a drum strokeis its onset time, therefore the terms attack and onset have synonymous meanings, andare used interchangeably.” This is not an hazard, because it is demonstrated [26] thatthe time between zero and maximum energy in a percussive musical event is almostinstantaneous.

From the onset recognition point of view, the simpler case is therefore representedby the drum, where the sounds generated are kinds of percussive events, well charac-terized by the noticeable change of sound parameters associated to the strokes. Hence,

6This subdivision is normally indicated with the name "segmentation". Segmentation (also presentin many other different applications, as well as in sound) is an operation which preserve the temporalstructure of the signal but can be used to identify, separate and organize the smallest rhythmic eventfounded in the audio signal. Is a complex operation and can be done in several ways (choose of thesmallest unit) and several domains (time, frequency, time-frequency, complex), according to the targetto achieve. The main applications are transcription, onset detection and BPM and tempo calculation.Transcription allows the musical flow to be represented by the note from which is generated, for thisreason is one of the most explored and advanced area of studies.

71

Page 81: A Perceptually Grounded Approach to Sound Analysis

6 – Perceptual Onset Detection

Figure 6.2: On the top, the waveform corresponding to a hit of a robot percussionistof ·O M M· . On the bottom, the intensity profile of the hit (using Praat), where onset,attack and the transient/steady state separation are highlighted.

transients are normally considered the principal part of a percussive sound, thus char-acterizing them the steady state portions can be, although approximately, derived byapplying an appropriate synthesis stage which recreates the slow decay at the estimatedresonance of the drum.

The actual meaning of onset, coming from psychoacoustics knowledges, allow us todefine transient in according to the behavior of the three perceptual attributes presentedin chapter 2. In correspondence of a transient, we can denote the following behavior:

• abrupt change in perceived loudness

• abrupt change in perceived pitch

• abrupt variation of the perceived timbre

For the purpose of this thesis, the transient/steady-state separation is left apart fromthe following discussions (see [56]for deepening) and attention will be focused on thetransient region, in particular to onset and attack recognition. But first of all, let’s spend

72

Page 82: A Perceptually Grounded Approach to Sound Analysis

6.2 – From Transient to Attack and Onset Definitions

Figure 6.3: From top to bottom: waveform, static spectrum (FFT) and time-varyingpectrum (STFT). From right to left: one hit of ·O M M· robot, one hit of snare drum.

a few words on the importance of onset detection, often kept secret, in most musicalsoftware applications.

Importance of onset detection in musical applications

Several commercial software for musical applications, exploit the onset detection formany special purposes, very often not explicitly. With a rapid research, we found that theuse of onset detection’s technique is particularly important in all these applications: cut’npaste (audio editing), audio/video synchronization, estimation of rythm and temporal

73

Page 83: A Perceptually Grounded Approach to Sound Analysis

6 – Perceptual Onset Detection

features (audio analysis), compression, content delivery7, indexing8, music informationretrieval9, time-stretching10 and pitch-shifting11(audio processing). Besides, as if notenough, in other musical applications, detected and modeled onsets can be used forfurther musical processing or even synthesis [50], [3]. Detected onsets can, for instance,trigger a synthesis algorithm in order to create a new musical piece from the percussiverhythm founded in a recording.

6.3 General Scheme for Onset Detection

Several studies and different approaches had been proposed for the solution of the onsetdetection task. Here are presented in order the three general steps, which can be foundin almost all the techniques[4]:

1. preprocessing: before analysis sometimes transformations are applied to sound(e.g. compression and limiter) in order to emphasize or de-emphasize some of thecharacteristics of the signal. Usually preprocessing is considered an optional, butcould be extremely helpful, for example to accentuate the sudden changes in thewaveform. Normally, when high speed performance is required by the system, i.e.for real-time applications, preprocessing is avoided.

2. reduction: reduction is a process through which the complex signal, here consid-ered as sum of sinusoids or oscillators, is simplified for analysis (e.g. subsampling).The simplified signal (must reflect the local structure of the original) should en-hance the transient characteristic while de-emphasize the steady state. This oper-ation is critical and has been proposed in many ways, but can be summarized intwo categories: the methods that make use of explicitly signal features (i.e energy,frequency, phase) and methods based on probabilistic models, which approximatethe signal’s behavior. The function obtained, after reduction of the original signal,can be called detection function[4] or observation function[50].

3. pick-peaking: the final stage of onset detection is historically entrusted to peak-picking algorithm, which localize onsets as local maxima (i.e. peak) of the detec-tion function. This stage is also critical, because depends on the "goodness" ofthe detection function and must be "robust" against possible misunderstanding.

7Content delivery describes the delivery of audio "content" through a delivery medium such asInternet, onset detection may improve for example the organization of sound into UDP packets.

8Indexing is feature of database. While creating a database of sounds, onset detection may beimportant to determine similar characteristics to be used to subdivide the database into categories.

9MIR, See http://en.wikipedia.org/wiki/Music_information_retrieval10Time stretching is a way to change the speed or duration of an audio signal without affecting its

pitch.11Pitch Shifting is a way to change the pitch of a signal without changing its length.

74

Page 84: A Perceptually Grounded Approach to Sound Analysis

6.3 – General Scheme for Onset Detection

Is not difficult to imagine that overlapping sounds, noise, musical effects (e.g.vibrato and tremolo12) or modulations, are just some examples of the difficultiesthat a peak-picking stage can encounter. That’s why the final decision of thepick-peaking algorithm may be, in certain cases, anticipated by a post-processingand thresholding stages.

In the next sections, we will adopt the term detection function, described in point 2;every detection scheme has its own detection function. Our method based on bonk∼ willfollow, after a discussion on the modern techniques, largely treated in literature and wellsummarized in [4][50] and [31].

Choice of the Appropriate Detection Function

A brief guideline is proposed here, before entering the discussion of our method forrecognizing onsets in ·O M M· . What follows could be useful in determining the appropriatemethods for approaching the onset detection task. The choice strongly depends on thesound object of the analysis.

The general, good practice, usually requires a balance of complexity between prepro-cessing, construction of the detection function, and peak-picking.[4]In this summary, the methods proposed (in order of complexity) are followed by thecorresponding text where it’s possible to find implementation details.

• If the signal is strongly percussive (e.g. drums), time-domain methods are alsoadequate (i.e. method based on thresholding the amplitude).

• Spectral methods, such as those based on phase distributions and spectral differ-ence perform relatively well on strongly pitched transients. [4]

• The complex-domain spectral difference seems to be a good choice in general, atthe cost of a slight increase in computational complexity. [21][56]

• When precise time localization is required, then wavelet methods can be useful,possibly in combination with another method. [21][22]

• If a high computational load is acceptable, and a suitable training set is available,then statistical methods give the best overall results, and are less dependent on aparticular choice of parameters. [4] for introduction and [24][2] for more detail.

12Vibrato and tremolo are two important musical effects. Vibrato is produced, in singing and musicalinstruments, by a regular pulsating change of pitch, and is used to add expression and vocal-like qualitiesto instrumental music. Tremolo usually refers to periodic variations in the amplitude of a musical note(or in singing). Depth and speed of vibrato/tremolo determine the amount and speed of pitch/amplitudechanges. It is difficult to achieve, with singing voice, separated variations in pitch and amplitude, theywill usually be achieved at the same time; that’s why the two terms are sometimes confused. In digitalsignals processing, vibrato and tremolo are easier to achieve separately.

75

Page 85: A Perceptually Grounded Approach to Sound Analysis

6 – Perceptual Onset Detection

6.3.1 Energy Based Approach

Log Energy Approach

Since the energy of a sound is the most prominent variation in a musical flow, occurringduring transients, energy based approach could be considered the easier way to achievethe goal of event detection. Usually, the introduction of a new generic note (e.g. that ofa piano) leads to an increase in the energy of the signal and for the specific case of strongpercussive note attacks (e.g. the case of ·O M M· ), this increase in energy will be verysharp. For this reason, energy method proves to be a useful and efficient approach forlot of onset detection’s applications, in particular detecting percussive transients. Thelocal energy of a frame x(n)of the signal is defined as:

E[n] =1

N

N2−1∑

m=−N2

(x[n+m])2 h(m)

where h(m) is a window of lenght N, centered at m = 0. Taking the first difference13

of E(n) produces a detection function, in which peaks could be localized in time bya pick-picking algorithm, to find onset locations. An improvement to this equation,follows from psychoacoustical knowloedge, is to consider that loudness is perceived log-arithmically. Hence, computing the first difference of logE(n) roughly simulates theear’s perception of loudness. These are usually considered the simplest approaches tonote onset detection.

Spectral Difference

This idea can be extended to reach a more appropriate detection function, that is,considering frames of the STFT. We recall that a generic STFT frame of a waveform,is given by:

X[n,k] =∞∑

m=−∞

{x[m] h[n−m]·}e−jωkn

where k = 0,1,...,N − 1 is the frequency bin index and h the finite-length sliding win-dow14. As previous method, if we now take into account the first difference between themagnitude of consecutive STFT frames, that is:

δX =N∑

k=1

|X[n,k]| − |X[n− 1,k]|

13The first difference is the difference between two consecutive samples, in this case each sampledescribes the energy content of the sampled waveform.

14see chapter 5 for detail on STFT

76

Page 86: A Perceptually Grounded Approach to Sound Analysis

6.3 – General Scheme for Onset Detection

this measure, known as the spectral difference, can be used to build an efficient onsetdetection function. Energy-based algorithms are fast and easy to implement, decreasetheir effectiveness when approaching to nonpercussive sounds or when transient energyis given by overlapping (and more complex, e.g. strongly pitched) sounds.

High Frequency Content

This technique is very interesting and successfully applied by Masri in [33]. The consider-ation over which is grounded, are proposed by Rodet and Jaillet: energy increases linkedto transient tend to appear as a broadband event and since the energy of the signal isusually concentrated at low frequencies, changes due to transients are more noticeableat high frequencies. To emphasize this, the spectrum obtained by the STFT, can beweighted preferentially toward high frequencies, before summing to obtain a weightedenergy measure. The following formula was proposed:

E[n] =1

N

N2−1∑

k=−N2

Wk |X[n,k]|2

where Wk is the (frequency dependent) weighting function. Masri proposed Wk =|k|, called high frequency content (HFC), a linear weighting function, by which eachfrequency bin gives a contribute proportional to its frequency.

Energy increment related to transient component are more noticeable at higher fre-quencies (although the total energy is usually concentrated at lower frequencies), there-fore HFC should be considered one of the pillar in the energy based onset detection task.Besides, HFC method do not take into account temporal features of the waveform, whichcould be equally important. That’s why, alternative methods, (e.g. considering phasespectrum information), should also be considered.

6.3.2 Phase Based Approach

The use of phase spectra in approaching the onset detection task, is relatively a recentintroduction [20]. Let’s see how it works.

The starting points are the definition of unwrapped phase15 , and its application15The unwrapped phase is the phase which allow variation into the limited range between 0 and 2π.

It is calculated from the instantaneous frequency value in the following manner:

ω(t) = ϕ′(t) =d

dtϕ(t), instantaneous angular frequency

f(t) =1

2πφ′(t), instantaneous frequency in Hz

φ(t) = 2π∫ t

0

f(τ) dτ + φ(0), the 2π − unwrapped phase.

77

Page 87: A Perceptually Grounded Approach to Sound Analysis

6 – Perceptual Onset Detection

in the case of a given stationary sinusoid (i.e. extracted from steady state portion ofthe signal). In a steady state sinusoid, extracted from a single frame of the STFT, thephase, as well as the phase in the previous frame, are used to calculate a value for theinstantaneous frequency. An estimate of the instantaneous frequency of the STFT framewithin this window, is that:

fk(n) =

(ϕk(n)− ϕk(n− 1)

2πh

)fs

where h is the hop size between windows and fs the sampling rate.What is expected, for a stationary sinusoid, is that the instantaneous frequencies

should be approximately constant over adjacent windows. Furthermore, this is equiva-lent to say that the phase increment from adjacent windows remaining approximatelyconstant. This is expressed in formula as follows:

ϕk(n)− ϕk(n− 1) ' ϕk(n− 1)− ϕk(n− 2)

Equivalently, the phase deviation can be defined as the second difference of the phase,which is:

∆ϕk(n)− 2ϕk(n− 1) + ϕk(n− 2) ' 0

During a transient region, the instantaneous frequency is not usually well defined, andhence will tend to a large value. This is illustrated in figure 6.4, Bello proposes a methodthat analyzes the instantaneous distribution (in the sense of a probability distribution orhistogram) of phase deviations across the frequency domain.

During the steady-state part of a sound, deviations tend to zero, thus the distributionis strongly peaked around this value. During attack transients, values increase, wideningand flattening the distribution. However, this method, although showing some improve-ment for complex signals, is susceptible to phase distortion and to noise introduced bythe phases of components with no significant energy. Finally, why do not mix phaseand energy approaches? Again Bello gives the answer, proposing this solution to thedetection task in [5].

Phase, energy, and phase/energy approaches do not represent the only methodsapplied to the solution of the onset detection task. Several other methods had beenproposed, with particular regard to stochastic and statistical methods ([2] for exampleand [4] for comparison), very different from the ways above. A Deterministic PlusStochastic model (such as described by Serra in [54]) for specific onset detection, hadbeen recently presented by Gifford and Brown in [24].

But let’s now introduce the perceptual based approach, we used to reach the taskof onset detction.

78

Page 88: A Perceptually Grounded Approach to Sound Analysis

6.4 – Introduction to the Perceptual Based Approach to Onset Detection

Figure 6.4: Unwrapped phase deviation between two adjacent analysis frames. ∆ϕn,k

is the unwrapped phase deviation. For the simpler case represented by a steady statesinusoid, the phase deviation is approximately 0 constant in-between the whole analysisframes, while, during transient the phase deviation should be extremely large and easyto detect.

6.4 Introduction to the Perceptual Based Approachto Onset Detection

Perceptual based approach have demonstrated their strength in the detection task of bothpitched and non-pitched sounds16, as summarized in [13]. Since perceptual attributesof sound are usually subjected to judgment from person to person, the perceptual on-set detection should be always considered a good method, when it reaches the aim ofapproximate the human ear onset detection mechanism. The different approach at thebase, may justify the possible divergence of results as compared with other methods;usually the aim here is not that to find the best and efficient algorithms which recognizeevery onset, but only the perceptually meaningful onsets. At the human auditory level,

16Pitched and non-pitched sound is a common way to describe sound with strong harmonic relatedcomponent frequencies (pitched sound, e.g. piano, violin...) and sparse and not related harmonics(non-pitched sound, e.g. drums...). At a perceptual level, the difference is that when a pitched soundoccurs, a pitch is clearly associated to the fundamental frequency of the sound. Non-pitched soundstrigger different mechanisms in human auditory system, however, sometimes a pitch is associated, butthis won’t actually correspond to the fundamental frequency.

79

Page 89: A Perceptually Grounded Approach to Sound Analysis

6 – Perceptual Onset Detection

mechanisms, which encodes both time and frequency effects, determines the subjectiveperception of sound onsets. The principal limits are imposed by time resolution and fre-quency masking effects (both explained in chapter 2). Moreover, overlapping pitched andnon-pitched sounds (even percussive pitched sounds) could obfuscate the perception ofpitch and also delay or obscure one or more adjacent onsets. Let’s see an example, beforeintroducing the perceptual method applied to ·O M M· , based on bonk∼ for Max/MSP.

Band-Wise Processing

Scheirer in [51] was the first to clearly demonstrate the fact that an onset detection algo-rithm should follow the human auditory system, by treating frequency bands separatelyand then combining the results at the end. An earlier system described by Bilmes, wassimilar to the way above, but his system only used a high-frequency and a low-frequencybands, which himself judged not so effective [8].

Scheirer in [51] described a psychoacoustic demonstration on beat perception, whichshows that certain kinds of signal simplifications can be performed without affectingthe perceived rhythmic content of a musical signal. When the signal is divided into atleast four frequency bands and the corresponding bands of a noise signal are controlledby the amplitude envelopes of the musical signal, the noise signal will have a rhythmicpercept which exploits significant similarities to the original signal. On the other hand,this does not hold if only one band is used, in which case the original signal is no morerecognizable from its simplified form (detection funtion).

The method proposed by Klapuri [30], is the most significant example of succesfullapplication by applying psychoacoustic knowledge to the onset detection task. It utilizesthe band-wise processing principle as introduced by Bilmes and Scheirer. The procedureis the following:

1. the overall loudness of the signal is normalized to 70 dB level (pre-processing)using the model of equal loudness contour17,

2. a filterbank divides the signal into 21 non-overlapping bands and, at each band,the onset components are detected and their time and intensity is determined,

3. in final phase, the onset components are combined to yield onsets.

This method uses psychoacoustic models both in onset component detection, inits time and intensity determination, and in combining the results. The design of thefilterbank is the core of this system, Klapuri proposed a filterbank which approximatesthe critical bands displacement and covers the frequencies from 44Hz to 18KHz. Thisis obtained with 21 filters, where the lowest three are one-octave band-pass filters and

17See chapter 2 for more on equal loudness contour.

80

Page 90: A Perceptually Grounded Approach to Sound Analysis

6.5 – Onset Detection in ·O M M·

the remaining eighteen are third-octave band-pass filters18. All subsequent calculationscan be done one band at a time. This reduces the memory requirements of the algorithmin the case of long input signals, assumed that parallel processing is not desired.

The output of each filter is full-wave rectified and then decimated by factor 180 toease the following computations. Amplitude envelopes are calculated by convolving theband-limited signals with a 100ms half-Hanning (raised cosine) window. This windowperforms much the same energy integration as the human auditory system, preservingsudden changes, but masking rapid modulation [30].

6.5 Onset Detection in ·O M M·From a Listener Point of View

One of the critical point within a mechanical orchestra is to sync the execution of themusical score among real and virtual instruments; this problem is addressed to the ShowControl System of ·O M M· . We remind that human ear is particularly careful about timing,hence, with physical devices such as the electromechanical arms of the robots, we haveto ensure correct timing, as expected by the listener. In our system, the variable delaysafflicting the orchestra while playing, must be overcome to made possible a synchronizedexecution.

The variations of the delays, strongly depends on the note intensity requested andthe execution rhythm imposed. Each robot basically receives a message on a serialline (MIDI), stating what kind of hit should be executed. In addition to the messagegeneration and data transmission delay, usually negligible from a human point of view,we have the delay introduced by the physical movement of the arms. That is, the robotarms must be positioned to the correct level height and will hit the drum after a timeelapsed, proportional to the distance of the arm from the drum and the acceleration bywhich is driven. Furthermore, the delays are not only variable in a predictable manner,but problems related to non-absorbed vibrations between a hit and another can alsooccur. This could cause unwanted change in loudness and pitch19 perceived by thelistener. That’s why we need a perceptual based approach to analyze (in real-time) thesound produced by the robot, and its response to the different, digitally applied, stimuli.

What we need is to measure the time elapsed between the digital command and theperceived strike on the can, and try to compensate it during the robotic performance. Atthis purpose we thought to create a delay matrix, where all the fields represent a delayof a note to be executed. Therefore, once having this delay matrix we can take into

18Octave and third-octave filters are presented in chapter 3.19Since sound produced by ·O M M· can be considered non-pitched, we do not provide a pith detector

stage in the orchestra, although we might think that a pitch will be perceived. We tried to understandthis behaviour with fiddle∼ and pitch∼ external objects for Max/MSP

81

Page 91: A Perceptually Grounded Approach to Sound Analysis

6 – Perceptual Onset Detection

account the delay required for the note to be completed, and anticipate the executionof such a note, for the exact time of the delay.

To calculate the delay we needed to provide an onset detector stage in our ShowControl System. We exactly know the time of the MIDI event, hence what is missing isthe detection of the note executed.

Integration in the SCS

Our need was that to integrate the onset detection stage into the SCS, developed inMax/MSP, and running on an Apple laptop. We initially tried to implement a newmethod, to familiarize with development in Max. It was based on an envelope followerof the signal with a variable threshold applied to it. When the amplitude envelope ofthe signal exceed the threshold, an onset is detected. Since its practical inefficiency, thismethod was immediately left apart, and other methods were explored.

We founded a very interesting approach in bonk∼ object, an external library availableopen source on the web, for Max/MSP. The original code was written by the same authorof Max, Miller Puckette, for Pure Data in 1989. Then it has been revised by other peopleduring the years: Ted Apel ported bonk∼ to Max/MSP platform and later Barry Threwapplied the latest modification (2008). The version we have used is called bonk∼ 1.4,founded in M.Puckette repository, with permission to apply changes.

6.5.1 The bonk∼Method

The bonk∼ method works essentially on a specialization of the constant-Q filter bankanalysis, called emphbounded-Q analysis. This method has the advantage to drasticallyreduce the complexity of the constant-Q transform, such as well described in [17] after[29] and [11]. In this kind of analysis the value of Q is limited (bounded) to approximately5 and a few number of filters could be used to obtain a filterbank which give us at leastthe same results of a constant-Q analysis. In addition, the bounded-Q analysis, takes theadvantages of a FFT-like algorithm, applied in between each frequency channel. Thisis possible because the octaves are geometrically separated, but within each octave,the frequency bins are equally spaced, as shown in figure 6.5. This channel distributionbecomes a good approximation for the geometric scale with a proper number of channelsper octave.

Puckette in [38] says: the bonk∼ object was written for dealing with sound sourcesfor which sinusoidal decomposition breaks down; the first application has been to drumsand percussion. That is, our case. The bandwidths of the filters subdivide the soundspectrum into regions which are approximately tuned around the critical bands, in asimilar manner to the above Klapuri approach. This should well mimics the auditorysystem behavior.

82

Page 92: A Perceptually Grounded Approach to Sound Analysis

6.5 – Onset Detection in ·O M M·

Figure 6.5: Graphical representation of the bounded-Q filterbank. Only the octave aregeometrically spaced, in between the octave the spacing between analysis bins is linear.This allows the application of FFT-like algorithm to calculate the spectrum of eachcomponent.

We found that the implementation of 15 (non overlapping) filters was successful forour case. See table 6.1 for detail on the filters used for the band-wise analysis. In thistable can be easily recognized the filter spacing with two filters per octave, except whereprohibited (the first two filters do not respect this spacing20). The details of filterbank’simplementation can be founded in appendix of this thesis.

The final stage, what we have called before the pick-picking stage, in bonk∼ worksessentially with the definition of a growth function.

6.5.2 Result of the Analysis in ·O M M·After several months spent in debugging, optimizing and adapting the source code toour needs, we have subjected it to several tests. These tests, performed with the soundsrecorded at LIM laboratories in Verres (AO), have demonstrated the soundness of thisapproach. We recorded 3 track, including 300 sounds sounds each, then we made a lotof cut/n/paste to obtain five scores, called soundtracks in the result summary. Everysoundtrack, very different each others, have been realized at different bpm21, from 100to 120 (which is the maximum values that a single robot of ·O M M· can reach). Thesetracks, created with Ableton Live, were sent to bonk∼ by a dedicated Max patch,realized for testing. The sounding objects are sent to bonk∼ via a special object forMax, called Elastic, which allows variation of the pitch end tempo of the execution. Thetesting patch we realized for that purpose is presented in appendix of the thesis.

20See chapter 2 for critical band description.21Beats Per Minute.

83

Page 93: A Perceptually Grounded Approach to Sound Analysis

6 – Perceptual Onset Detection

Table 6.1: Filterbank design in our method based on bonk∼Filternumber

fc [Hz] Bandwidht[Hz]

FilterPoints

Number ofhops

Hop size

1 86 32.25 1024 1 512

2 150.5 32.25 1024 1 512

3 220.59 37.84 873 1 436

4 312.18 53.32 617 2 308

5 441.18 75.68 436 3 218

6 623.93 107.07 308 5 154

7 882.36 151.36 218 8 109

8 1247.86 214.14 154 12 77

9 1764.72 302.72 109 17 55

10 2495.72 428.28 77 25 39

11 3529.44 605.44 55 36 27

12 4991.44 856.56 39 52 19

13 7059.31 1211.31 27 72 14

14 9983.31 1712.69 19 101 10

15 14118.19 2422.19 14 145 7

The final configuration of bonk∼ gave great results: all the hits produced by theorchestra can be located in time and the value of CDR (Correct Detection Result),proposed in [30], was very easy to calculate.

The CDR is given by:

CDR =total − undetected− false detected

total· 100%

and the result of the analysis are the following:Only at more than 110 bpm, bonk fails in some cases, that is, spurious onset are

reported (but all the provided onsets are recognized).

84

Page 94: A Perceptually Grounded Approach to Sound Analysis

6.6 – From Onset Analysis to Sound Classification

Table 6.2: Results in detecting onset of the five soundtracks created for analysis purpose,played at different bpm. ·O M M··O M M·

soundtrack(bpm)

TotalOnsets

Detected Undetected FalseDetected

CDR [%]

100 120 120 0 0 100

105 80 80 0 0 100

110 90 90 0 0 100

115 120 120 0 2 98

120 60 60 0 5 95

6.6 From Onset Analysis to Sound Classification

In this last section, we aim to demonstrate that the use of the bonk∼ analysis indetecting onset, could be extended to the aspiring task of sound classification. Asmentioned before, ·O M M· robots are able to produce three different sounds, substantiallywith different intensity and duration. What we are gonna present in this section is theability of bonk∼ to predict which of the three sounds has been played, only using theloudness content of each frequency bands during onset detection. This, if true, wouldmean that the spectral analysis performed by bonk∼ , especially when it detects anonset, is enough to predict which kind of sound had been produced by the orchestra.

Puckette has provided another tool in bonk∼ , called learn mode, which allows tostore into a template the pattern originated by the output of each filters used for theanalysis. The learn mode works essentially by this way:

1. enter learn mode and choose the number of equal

2. start the bonk∼ analysis and stop it after all the onsets provided are recognized

3. store the spectral template into file and read it

4. exit learn mode and continue to analyze the signal

5. bonk∼ will report the number corresponding to the sound which best fit thespectral template read from file

6.6.1 Learning Results

Results of bonk∼ learning for the case of ·O M M· :

85

Page 95: A Perceptually Grounded Approach to Sound Analysis

6 – Perceptual Onset Detection

Table 6.3: Numerical result in detecting onset and recognizing the three sounds (A/B/C)produced by the ·O M M·

·O M M·soundtrack(bpm)

Total Onsets Total A/B/CNotes

Total A/B/CNotes

Recognized

Note A/B/CNotes

Confused100 120 30/60/30 25/64/31 5/5/0

105 80 30/30/20 25/32/23 5/6/4

110 90 33/26/31 32/28/30 2/1/1

115 120 100/12/8 97/16/7 4/1/3

120 60 25/20/15 30/15/15 4/0/7

The results is quite unexpected, more than the 80% (on average, in particular caseswere higher than 95%) of correct correspondence had been found, simply looking at theonset analysis results.

86

Page 96: A Perceptually Grounded Approach to Sound Analysis

Chapter 7

Conclusion

A specific component of the human ear, the basilar membrane inside the cochlea, locatedin the inner ear, is responsible of the detection of frequency components of a sound.In this thin membrane, 32 mm long, the frequencies cause oscillations around specificpoints of the basilar membrane. The mechanical properties of the cochlea (wide and stiffat the base, narrower and much less stiff at the end), in which the basilar membrane islocated, denotes a roughly logarithmic decrease in bandwidth as we move linearly awayfrom the cochlear opening (the oval window).

Therefore we propose a different approach to sound analysis, which is known as theConstant-Q filterbank method. This method is typically implemented with a bank ofband pass filters (filterbank) with constant Q ratio. We recall the definition of Q which isthe ratio between the center frequency and the bandwidth of a filter. This method mimicsthe behaviour of the auditory system in detecting frequency, i.e the filter are linearlyspaced along a logarithmic frequency axis. This can be obtained with filters maintainingconstant their Q ratio. It is established that the Q must be chosen approximately equalto 37, to perform a rigorous scan over the frequency range from 20Hz to 20KHz andseveral filters (at least one hundred) must be used.

Our task had been compared to the one provided by an external library for Max/MSP,called bonk∼ , developed by the same author of Max/MSP, Miller Puckette. We foundin it a very interesting approach, very helpful for our purpose. The code has been revisedby other people during the years (the original code was 1989) and the version we havetaken into account is bonk∼ 3.0. The code was found on M.Puckette repository, withpermission to apply changes.

The bonk∼ method works essentially on a specialization of the constant-Q filter bankanalysis, called bounded-Q analysis. This method has the advantage to reduce thecomplexity of the constant-Q transform. In this kind of analysis the value of Q is limited(bounded) to approximately 5 and a few number of filters should be used to obtain atleast the same results. We found that the implementation of 15 non overlapping filters

87

Page 97: A Perceptually Grounded Approach to Sound Analysis

7 – Conclusion

was successful for our case. The bandwidths of the filters subdivide the sound spectruminto regions which are approximately tuned around musical octave, thus respecting whatseems to be the auditory system response. This method, implemented for the first timein the late 80s, has showed good results in various fields of musical analysis, in particularsegmentation and transcription.After several months spent in debugging, optimizing and adapting the source code toour needs, we have subjected it to several tests. These tests, performed with the soundsrecorded at LIM laboratories in Verres (AO), have demonstrated the soundness of thisapproach. All the hits produced by the orchestra can be located in time. Not only, wealso trained bonk∼ to recognize which kind of sound have been produced, and we canobtain more than 85% of successful correspondences.The percentage of recognized hits indicates the approach validity; possible other musicalapplications can be foreseen in or outside the ·O M M· .

88

Page 98: A Perceptually Grounded Approach to Sound Analysis

Appendix A

MSP, anatomy of the object

Source 1: main.c1 /∗∗

@page chapter_msp_anatomy Anatomy o f a MSP Object

An MSP ob j e c t t ha t hand l e s aud io s i g n a l s i s a r e g u l a r Max ob j e c t w i th a fewe x t r a s . Re f e r to the <a h r e f="p l u s s z ~_8c−s ou r c e . html">p l u s s z~</a> examplep r o j e c t s ou r c e as we d e t a i l t h e s e a d d i t i o n s . p l u s s z ~ i s s imp l y an o b j e c tt ha t adds 1 to a s i g n a l , i d e n t i c a l i n f u n c t i o n to the r e g u l a r MSP +~ ob j e c t

i f you were to g i v e i t an argument o f 1 .

6 Here i s an enumerat ion o f the b a s i c t a s k s :

1) a d d i t i o n a l heade r f i l e s

A f t e r i n c l u d i n g ex t . h and ext_obex . h , i n c l u d e z_dsp . h11 @code

#i n c l u d e "z_dsp . h"@endcode

2) C s t r u c t u r e d e c l a r a t i o n16

The C s t r u c t u r e d e c l a r a t i o n must beg in w i th a #t_pxobject , not a #t_ob jec t :@code

t y p ed e f s t r u c t _mydspobject{

21 t_pxob jec t m_obj ;// r e s t o f the s t r u c t u r e ’ s f i e l d s

} t_mydspobject ;@endcode

26 3) i n i t i a l i z a t i o n r o u t i n e

When c r e a t i n g the c l a s s w i th c lass_new ( ) , you must have a f r e e f u n c t i o n . I f youhave no th i ng s p e c i a l to do , use dsp_free ( ) , which i s d e f i n e d f o r t h i s

pu rpose . I f you w r i t e your own f r e e f un c t i on , the f i r s t t h i n g i t shou l d doi s c a l l d sp_free ( ) . Th i s i s e s s e n t i a l to avo i d c r a s h e s when f r e e i n g youro b j e c t when aud io p r o c e s s i n g i s tu rned on .

@codec = class_new (" mydspob ject " , ( method ) mydspobject_new , ( method ) dsp_free ,

s i z e o f ( t_mydspobject ) , NULL , 0) ;31 @endcode

89

Page 99: A Perceptually Grounded Approach to Sound Analysis

A – MSP, anatomy of the object

A f t e r c r e a t i n g your c l a s s w i th c lass_new ( ) , you must c a l l c l a s s_ d s p i n i t ( ) ,which w i l l add some s t anda rd method h and l e r s f o r i n t e r n a l messages used bya l l s i g n a l o b j e c t s .

@codec l a s s_ d s p i n i t ( c ) ;

36 @endcode

Your s i g n a l o b j e c t needs a method tha t i s bound to the symbol " dsp " −− we ’ l ld e t a i l what t h i s method does below , but the f o l l o w i n g l i n e needs to beadded wh i l e i n i t i a l i z i n g the c l a s s :

@codeclass_addmethod ( c , ( method ) mydspobject_dsp , " dsp " , A_CANT, 0) ;

41 @endcode

4) new i n s t a n c e r o u t i n e

The new i n s t a n c e r o u t i n e must c a l l dsp_setup ( ) , p a s s i n g a p o i n t e r to the newlya l l o c a t e d o b j e c t p o i n t e r p l u s a number o f s i g n a l i n l e t s the o b j e c t w i l lhave . I f the o b j e c t has no s i g n a l i n l e t s , you may pas s 0 . The p l u s z~ ob j e c t( as an example ) has a s i n g l e s i g n a l i n l e t :

46 @codedsp_setup ( ( t_pxob jec t ∗) x , 1) ;

@endcode

dsp_setup ( ) w i l l make the s i g n a l i n l e t s ( as p r o x i e s ) so you need not make themy o u r s e l f .

51I f your o b j e c t w i l l have aud io s i g n a l outputs , they need to be c r e a t e d i n the

new i n s t a n c e r o u t i n e w i th out let_new ( ) . However , you w i l l n eve r a c c e s s themd i r e c t l y , so you don ’ t need to s t o r e p o i n t e r s to them as you do wi th

r e g u l a r o u t l e t s . Here i s an example o f c r e a t i n g two s i g n a l o u t l e t s :@code

out let_new ( ( t_ob jec t ∗) x , " s i g n a l ") ;out let_new ( ( t_ob jec t ∗) x , " s i g n a l ") ;

56 @endcode

5) The dsp method and per fo rm r o u t i n e

The dsp method s p e c i f i e s the s i g n a l p r o c e s s i n g f u n c t i o n your o b j e c t d e f i n e sa l ong wi th i t s arguments . Your o b j e c t ’ s dsp method w i l l be c a l l e d when theMSP s i g n a l c omp i l e r i s b u i l d i n g a sequence o f o p e r a t i o n s ( known as the DSPChain ) tha t w i l l be per fo rmed on each s e t o f aud io samples . The op e r a t i o nsequence c o n s i s t s o f a p o i n t e r s to f u n c t i o n s ( c a l l e d per fo rm r o u t i n e s )f o l l ow e d by arguments to tho s e f u n c t i o n s .

61The dsp method i s d e c l a r e d as f o l l o w s :

@codevo i d mydspobject_dsp ( t_mydspobject ∗x , t_ s i g n a l ∗∗ sp , s h o r t ∗ count ) ;

@endcode66

To add an en t r y to the DSP cha in , your dsp method u s e s dsp_add ( ) . The dspmethod i s pas sed an a r r a y o f s i g n a l s (#t_s i g n a l p o i n t e r s ) , which c on t a i np o i n t e r s to the a c t u a l sample memory your o b j e c t ’ s pe r fo rm r o u t i n e w i l l beu s i n g f o r i n pu t and output . The a r r a y o f s i g n a l s s t a r t s w i th the i n p u t s (from l e f t to r i g h t ) , f o l l ow e d by the ou tpu t s . For example , i f your o b j e c thas two i n p u t s ( because your new i n s t a n c e r o u t i n e c a l l e d dsp_setup ( x , 2) )and t h r e e ou tpu t s ( because your new i n s t a n c e c r e a t e d t h r e e s i g n a l o u t l e t s ) ,the s i g n a l a r r a y sp would c on t a i n f i v e i t ems as f o l l o w s :

@codesp [ 0 ] // l e f t i n pu tsp [ 1 ] // r i g h t i n pu t

90

Page 100: A Perceptually Grounded Approach to Sound Analysis

71 sp [ 2 ] // l e f t outputsp [ 3 ] // midd le outputsp [ 4 ] // r i g h t output

@endcode

76 The #t_s i g n a l data s t r u c t u r e ( d e f i n e d i n z_dsp . h ) , c o n t a i n s two impor tan te l ement s : the s_n f i e l d , which i s the s i z e o f the s i g n a l v e c to r , and s_vec ,which i s a p o i n t e r to an a r r a y o f 32− b i t f l o a t s c o n t a i n i n g the s i g n a l data

. A l l t_ s i g n a l s your o b j e c t w i l l r e c e i v e have the same s i z e . Th i s s i z e i snot n e c e s s a r i l y the same as the g l o b a l MSP s i g n a l v e c t o r s i z e , because youro b j e c t might be i n s i d e a pa t che r w i t h i n a po l y~ ob j e c t t ha t d e f i n e s i t s

own s i z e . The r e f o r e i t i s impo r tan t to use the s_n f i e l d o f a s i g n a l pas sedto your o b j e c t ’ s dsp method .

You can use a v a r i e t y o f s t r a t e g i e s to pas s arguments to your per fo rm r o u t i n ev i a dsp_add ( ) . For s imp l e u n i t g e n e r a t o r s t ha t don ’ t s t o r e any i n t e r n a ls t a t e between computing v e c t o r s , i t i s s u f f i c i e n t to pas s the i npu t s ,outputs , and v e c t o r s i z e . For o b j e c t s t ha t need to s t o r e i n t e r n a l s t a t ebetween computing v e c t o r s such as f i l t e r s o r ramp gene r a t o r s , you w i l l pa s sa p o i n t e r to your ob j e c t , whose data s t r u c t u r e shou l d c on t a i n space to

s t o r e t h i s s t a t e . The p l u s 1~ ob j e c t does not need to s t o r e i n t e r n a l s t a t e .I t p a s s e s the input , output , and v e c t o r s i z e to i t s pe r fo rm r o u t i n e . Thep l u s 1~ dsp method i s shown below :

@code81 vo i d plus1_dsp ( t_plus1 ∗x , t_ s i g n a l ∗∗ sp , s h o r t count )

{dsp_add ( plus1_perform , 3 , sp [0]−>s_vec , sp [1]−>s_vec , sp [0]−>s_n) ;

}@endcode

86The f i r s t argument to dsp_add ( ) i s your pe r fo rm rou t i n e , f o l l ow e d by the number

o f a d d i t i o n a l arguments you wish to copy to the DSP cha in , and then thearguments .

The per fo rm r o u t i n e i s not a "method" i n the t r a d i t i o n a l s en s e . I t w i l l bec a l l e d w i t h i n the c a l l b a c k o f an aud io d r i v e r , which , u n l e s s the u s e r i semp loy ing the Non−Rea l Time aud io d r i v e r , w i l l t y p i c a l l y be i n a high−p r i o r i t y t h r ead . Thread p r o t e c t i o n i n s i d e the per fo rm r o u t i n e i s min imal .You can use a c lock , but you cannot use qe lems or o u t l e t s . The d e s i g n o fthe per fo rm r o u t i n e i s somewhat u n l i k e o t h e r Max methods . I t r e c e i v e s ap o i n t e r to a p i e c e o f the DSP cha i n and i t i s e xpec t ed to r e t u r n thel o c a t i o n o f the next pe r fo rm r o u t i n e on the cha i n . The next l o c a t i o n i sde te rm ined by the number o f arguments you s p e c i f i e d f o r your pe r fo rmr o u t i n e w i th your c a l l to dsp_add ( ) . For example , i f you w i l l pa s s t h r e earguments , you need to r e t u r n w + 4 .

91 Here i s the p l u s 1 per fo rm r o u t i n e :

@codet_in t ∗ p lus1_per fo rm ( t_int ∗w){

96 t_ f l o a t ∗ in , ∗ out ;i n t n ;

i n = ( t_ f l o a t ∗)w [ 1 ] ; // ge t i n pu t s i g n a l v e c t o rout = ( t_ f l o a t ∗)w [ 2 ] ; // get output s i g n a l v e c t o r

101 n = ( i n t )w [ 3 ] ; // v e c t o r s i z e

wh i l e (n−−) // per fo rm c a l c u l a t i o n on a l l samples∗ out++ = ∗ i n++ + 1 . ;

91

Page 101: A Perceptually Grounded Approach to Sound Analysis

A – MSP, anatomy of the object

106r e t u r n w + 4 ; // must r e t u r n next DSP cha i n l o c a t i o n

}@endcode

111 6) Free f u n c t i o n

The f r e e f u n c t i o n f o r the c l a s s must e i t h e r be dsp_free ( ) o r i t must be w r i t t e nto c a l l d sp_free ( ) as shown i n the example below :

@code116 vo i d mydspob jec t_f ree ( t_mydspobject ∗x )

{dsp_free ( ( t_pxob jec t ∗) x ) ;

// can do o th e r s t u f f he r e121 }

@endcode

/

92

Page 102: A Perceptually Grounded Approach to Sound Analysis

Appendix B

bonk∼ source code

No substantial modification has been applied to the original bonk∼ code. Previousmodification to the original has been done by Barry Threw for the latest version ofbonk∼ , the what we used1.

B.1 The bonk∼Method

Source 2: main.c1 /∗

############################################################################ bonk~ − a pd and Max/MSP e x t e r n a l# by m i l l e r pucke t t e and ted appe l# ht tp : // c r c a . ucsd . edu/~msp/

6 # Max/MSP po r t by ba r r y threw ( me@barrythrew . com)# ht tp ://www. ba r r y t h r ew . com# San F ranc i s co , CA 2008# f o r Kesumo − ht tp : //www. kesumo . com# Max 5 op t im i z ed v e r s i o n f o r l oud p e r c u s s i v e sounds , by Zeng i . BETA v e r s i o n

11 # Turin , June 2009###########################################################################// bonk~ d e t e c t s a t t a c k s i n an aud io s i g n a l###########################################################################This s o f twa r e i s c o p y r i g h t e d by M i l l e r Pucket te and o t h e r s . The f o l l o w i n g

16 te rms ( the " Standard Improved BSD L i c e n s e ") app l y to a l l f i l e s a s s o c i a t e d wi ththe s o f twa r e u n l e s s e x p l i c i t l y d i s c l a im e d i n i n d i v i d u a l f i l e s :

R e d i s t r i b u t i o n and use i n s ou r c e and b i n a r y forms , w i th or w i thoutmod i f i c a t i o n , a r e p e rm i t t ed p r o v i d ed tha t the f o l l o w i n g c o n d i t i o n s a r e

21 met :

1 . R e d i s t r i b u t i o n s o f s ou r c e code must r e t a i n the above c o p y r i g h tno t i c e , t h i s l i s t o f c o n d i t i o n s and the f o l l o w i n g d i s c l a i m e r .2 . R e d i s t r i b u t i o n s i n b i n a r y form must r ep roduce the above

26 c o p y r i g h t no t i c e , t h i s l i s t o f c o n d i t i o n s and the f o l l o w i n g

1Bonk3 can be found in M. Puckette repository (ask to him) or in other non precise location onthe web. The one we used was found on Barry Threw website, but is now no more longer available.

93

Page 103: A Perceptually Grounded Approach to Sound Analysis

B – bonk∼ source code

d i s c l a i m e r i n the documentat ion and/ or o t h e r m a t e r i a l s p r o v i d edwi th the d i s t r i b u t i o n .3 . The name o f the autho r may not be used to endo r s e or promotep roduc t s d e r i v e d from t h i s s o f twa r e w i thout s p e c i f i c p r i o r

31 w r i t t e n p e rm i s s i o n .

THIS SOFTWARE IS PROVIDED BY THE AUTHOR ‘ ‘AS IS ’ ’ AND ANYEXPRESS OR IMPLIED WARRANTIES, INCLUDING , BUT NOT LIMITED TO,THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A

36 PARTICULAR PURPOSE ARE DISCLAIMED . IN NO EVENT SHALL THE AUTHORBE LIABLE FOR ANY DIRECT , INDIRECT , INCIDENTAL , SPECIAL ,EXEMPLARY, OR CONSEQUENTIAL DAMAGES ( INCLUDING , BUT NOT LIMITEDTO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES ; LOSS OF USE ,DATA, OR PROFITS ; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND

41 ON ANY THEORY OF LIABILITY , WHETHER IN CONTRACT, STRICTLIABILITY , OR TORT ( INCLUDING NEGLIGENCE OR OTHERWISE) ARISINGIN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OFTHE POSSIBILITY OF SUCH DAMAGE.∗/

46/∗d o l i s t :decay and o th e r t imes i n msec // s t i l l to do∗/

51#inc lude <math . h>#inc lude <s t d i o . h>#inc lude <s t r i n g . h>

56 //#i f d e f NT//#pragma warn ing ( d i s a b l e : 4305 4244)//#e n d i f

#inc lude " ex t . h"61 #inc lude "z_dsp . h"

#inc lude "math . h"#inc lude " ext_suppor t . h"#inc lude " ext_proto . h"#inc lude " ext_obex . h"

66typedef double t_ f l o a t a r g ; /∗ from m_pd. h ∗/

#def ine f l o g l o g#def ine f e xp exp

71 #def ine f s q r t s q r t#def ine t_ r e s i z e b y t e s ( a , b , c ) t_ r e s i z e b y t e s ( ( char ∗) ( a ) , ( b ) , ( c ) )

vo id ∗ bonk_class ;#def ine g e t b y t e s t_getby te s

76 #def ine f r e e b y t e s t_ f r e e b y t e s

//BONK ATTRIBUTE SETTINGS , YOU CAN OVERRIDE THEM IN MAX PATCH#def ine DEFNPOINTS 1024#def ine MAXCHANNELS 2 /∗ 8 ∗/

81 #def ine MINPOINTS 64#def ine DEFPERIOD 256 /∗ 128 ∗/#def ine DEFNFILTERS 15#def ine DEFHALFTONES 6#def ine DEFOVERLAP 1

86 #def ine DEFFIRSTBIN 2 /∗ mod i f i c a t o , e r a 1∗/#def ine DEFHITHRESH 10 /∗ 5 ∗/#def ine DEFLOTHRESH 5 /∗ 2 .5 ∗/

94

Page 104: A Perceptually Grounded Approach to Sound Analysis

B.1 – The bonk∼ Method

#def ine DEFMASKTIME 4#def ine DEFMASKDECAY 0 .7

91 #def ine DEFDEBOUNCEDECAY 0#def ine DEFMINVEL 7#def ine DEFATTACKBINS 1#def ine MAXATTACKWAIT 4

96 //DATA STRUCTUREStypedef s t r u c t _ f i l t e r k e r n e l{

i n t k_ f i l t e r p o i n t s ;i n t k_hoppoints ;

101 i n t k_sk i ppo i n t s ;i n t k_nhops ;f l o a t k_cen t e r f r eq ; /∗ c e n t e r f r equency , b i n s ∗/f l o a t k_bandwidth ; /∗ bandwidth , b i n s ∗/f l o a t ∗ k_s tu f f ;

106 } t _ f i l t e r k e r n e l ;

// f i l t e r b a n k s t r u c t u r e , implements a l i n k e d l i s t o f f i l t e r w i th s t r u c _ f i l t e r b a n l ∗b_next .

typedef s t r u c t _f i l t e r b a n k{

111 i n t b_n f i l t e r s ; /∗ number o f f i l t e r s i n bank ∗/i n t b_npoints ; /∗ i n pu t v e c t o r s i z e ∗/f l o a t b_ha l f t one s ; /∗ f i l t e r bandwidth i n h a l f t o n e s ∗/f l o a t b_over lap ; /∗ o v e r l a p ; d e f a u l t 1 f o r 1/2−power p t s ∗/f l o a t b_ f i r s t b i n ; /∗ f r e q o f f i r s t f i l t e r i n b in s , d e f a u l t 1 ∗/

116 t _ f i l t e r k e r n e l ∗b_vec ; /∗ f i l t e r k e r n e l s ∗/i n t b_refcount ; /∗ number o f bonk~ o b j e c t s u s i n g t h i s ∗/s t r u c t _f i l t e r b a n k ∗b_next ; /∗ next i n l i n k e d l i s t ∗/

} t_ f i l t e r b a n k ;

121 /∗ 1 .3 r e v i ew ∗/#def ine MAXNFILTERS 50#def ine MASKHIST 8

s t a t i c t_ f i l t e r b a n k ∗ b o n k_ f i l t e r b a n k l i s t ;126

typedef s t r u c t _his t{

f l o a t h_power ;f l o a t h_before ;

131 f l o a t h_outpower ;i n t h_countup ;f l o a t h_mask [MASKHIST ] ;

} t_h i s t ;

136 typedef s t r u c t t emp la t e{

f l o a t t_amp [MAXNFILTERS ] ;} t_template ;

141 typedef s t r u c t _in s i g{

t_h i s t g_his t [MAXNFILTERS ] ; /∗ h i s t o r y f o r each f i l t e r ∗/vo id ∗ g_out l e t ; /∗ o u t l e t f o r raw data ∗/f l o a t ∗ g_inbuf ; /∗ b u f f e r e d i npu t samples ∗/

146 t_ f l o a t ∗ g_invec ; /∗ new i npu t samples ∗/} t_ i n s i g ;

typedef s t r u c t _bonk

95

Page 105: A Perceptually Grounded Approach to Sound Analysis

B – bonk∼ source code

{151

t_pxob jec t x_obj ;vo id ∗obex ;vo id ∗ x_cookedout ;vo id ∗ x_clock ;

156 shor t x_vol ;

/∗ pa ramete r s ∗/i n t x_npoints ; /∗ number o f p o i n t s i n i n pu t b u f f e r ∗/i n t x_per iod ; /∗ number o f i n pu t samples between a n a l y s e s ∗/

161 i n t x_ n f i l t e r s ; /∗ number o f f i l t e r s r e qu e s t e d ∗/f l o a t x_ha l f t one s ; /∗ nomina l h a l f t o n e s between f i l t e r s ∗/f l o a t x_over lap ;f l o a t x_ f i r s t b i n ;

166 f l o a t x_h i th r e sh ; /∗ t h r e s h o l d f o r t o t a l growth to t r i g g e r ∗/f l o a t x_ lo th r e sh ; /∗ t h r e s h o l d f o r t o t a l growth to re−arm ∗/f l o a t x_minvel ; /∗ minimum v e l o c i t y we output ∗/f l o a t x_maskdecay ;i n t x_masktime ;

171 i n t x_use loudnes s ; /∗ use l o udn e s s s p e c t r a i n s t e a d o f power ∗/f l o a t x_debouncedecay ;f l o a t x_debounceve l ;double x_learndebounce ; /∗ debounce t ime ( i n " l e a r n " mode on l y ) ∗/i n t x_at tackb in s ; /∗ number o f b i n s to wa i t f o r a t t a c k ∗/

176t_ f i l t e r b a n k ∗ x_ f i l t e r b a n k ;t_h i s t x_h i s t [MAXNFILTERS ] ;t_template ∗ x_template ;t_ i n s i g ∗ x_ in s i g ;

181 i n t x_n in s i g ;i n t x_ntemplate ;i n t x _ i n f i l l ;i n t x_countdown ;i n t x_w i l l a t t a c k ;

186 i n t x_attacked ;i n t x_debug ;i n t x_learn ;i n t x_lea rncount ; /∗ countup f o r " l e a r n " mode ∗/i n t x_spew ; /∗ i f t rue , a lways g en e r a t e output ! ∗/

191 i n t x_maskphase ; /∗ phase , 0 to MASKHIST−1, f o r mask h i s t o r y ∗/f l o a t x_sr ; /∗ c u r r e n t sample r a t e i n Hz . ∗/i n t x_hit ; /∗ next " t i c k " c a l l e d because o f a h i t , not a p o l l

∗/} t_bonk ;

196 //PROTOTYPES// p r o t o t y p e s f o r methods : need a method f o r each incoming messages t a t i c vo id ∗bonk_new( t_symbol ∗ s , long ac , t_atom ∗av ) ;s t a t i c vo id bonk_tick ( t_bonk ∗x ) ;s t a t i c vo id bonk_doit ( t_bonk ∗x ) ;

201 s t a t i c t_ in t ∗bonk_perform ( t_int ∗w) ;s t a t i c vo id bonk_dsp ( t_bonk ∗x , t_ s i g n a l ∗∗ sp ) ;vo id bonk_as s i s t ( t_bonk ∗x , vo id ∗b , long m, long a , char ∗ s ) ;s t a t i c vo id bonk_free ( t_bonk ∗x ) ;vo id bonk_setup ( vo id ) ;

206 vo id main ( ) ;

//methods f o r t r e s h o l d and o th e r f e a t u r e ss t a t i c vo id bonk_thresh ( t_bonk ∗x , t_ f l o a t a r g f1 , t_ f l o a t a r g f 2 ) ;s t a t i c vo id bonk_pr int ( t_bonk ∗x , t_ f l o a t a r g f ) ;

96

Page 106: A Perceptually Grounded Approach to Sound Analysis

B.1 – The bonk∼ Method

211 s t a t i c vo id bonk_bang ( t_bonk ∗x ) ;s t a t i c vo id bonk_learn ( t_bonk ∗x , i n t n ) ;s t a t i c vo id bonk_forget ( t_bonk ∗x ) ;

//methods f o r r e ad s and w r i t e t emp l a t e s216 s t a t i c vo id bonk_write ( t_bonk ∗x , t_symbol ∗ s ) ;

s t a t i c vo id bonk_read ( t_bonk ∗x , t_symbol ∗ s ) ;

//method f o r a t t r i b u t e s s e t t e rvo id bonk_minvel_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ) ;

221 vo id bonk_lothresh_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ) ;vo id bonk_hi thresh_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ) ;vo id bonk_masktime_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ) ;vo id bonk_maskdecay_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ) ;vo id bonk_debouncedecay_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ) ;

226 vo id bonk_debug_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ) ;vo id bonk_spew_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ) ;vo id bonk_use loudness_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ) ;vo id bonk_attackb ins_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ) ;

231 f l o a t q r s q r t ( f l o a t f ) ;double c l o ck_ge t s y s t ime ( ) ;double c l o c k_ge t t ime s i n c e ( double p r e v s y s t ime ) ;char ∗ s t r c p y ( char ∗ s1 , const char ∗ s2 ) ;

236 // c l o c k f u n c t i o ns t a t i c vo id bonk_tick ( t_bonk ∗x ) ;

#def ine HALFWIDTH 0.75 /∗ h a l f peak bandwidth at h a l f power po i n t i n b i n s ∗/

241 //CONTANT Q FILTERBANK IMPLEMENTATIONs t a t i c t_ f i l t e r b a n k ∗ bonk_newf i l t e rbank ( i n t npo in t s , i n t n f i l t e r s , f l o a t h a l f t o n e s ,

f l o a t ove r l ap , f l o a t f i r s t b i n ){

i n t i , j ;f l o a t cf , bw , h , r e l s p a c e ;

246 t_ f i l t e r b a n k ∗b = ( t_ f i l t e r b a n k ∗) g e t b y t e s ( s i z e o f (∗b ) ) ;b−>b_npoints = npo i n t s ;b−>b_n f i l t e r s = n f i l t e r s ;b−>b_ha l f t one s = h a l f t o n e s ;b−>b_over lap = ov e r l a p ;

251 b−>b_ f i r s t b i n = f i r s t b i n ;b−>b_refcount = 0 ;b−>b_next = b o n k_ f i l t e r b a n k l i s t ;b o n k_ f i l t e r b a n k l i s t = b ;b−>b_vec = ( t _ f i l t e r k e r n e l ∗) g e t b y t e s ( n f i l t e r s ∗ s i z e o f (∗b−>b_vec ) ) ;

256// i n con s t an t Q f i l t e r b a n k , s p a c i n g between f i l t e r s i s implemented by t h i s wayh = exp ( ( l o g ( 2 . ) /12 . ) ∗ h a l f t o n e s ) ; /∗ specced i n t e r v a l between f i l t e r s ∗///h=h a l f t o n e s ∗5 ;r e l s p a c e = (h − 1) /( h + 1) ; /∗ nomina l spac ing−per−f f o r f i l t e r b a n k ∗/

261 // r e l s p a c e=h /2 ;

c f = f i r s t b i n ; // f i r s t c e n t e r f r e q o f the f i l t e r b a n k// bandwidthbw = c f ∗ r e l s p a c e ∗ o v e r l a p ;

266 i f (bw < HALFWIDTH)bw = HALFWIDTH;

// c r e a t e s ( i ) f i l t e r s , MAX( i ) =50.// s t op s c r e a t i n g f i l t e r s when c f exceed npo i n t s /2 , r e t u r n i .

271 f o r ( i = 0 ; i < n f i l t e r s ; i++)

97

Page 107: A Perceptually Grounded Approach to Sound Analysis

B – bonk∼ source code

{f l o a t ∗ fp , newcf , newbw ;f l o a t no rma l i z e r = 0 ;i n t f i l t e r p o i n t s , s k i p p o i n t s , hoppo in t s , nhops ;

276f i l t e r p o i n t s = 0 .5 + npo i n t s ∗ HALFWIDTH/bw ;// f i l t e r p o i n t s = 0 .5 + npo i n t s / bw ;i f ( c f > npo i n t s /2){

281 pos t ( "bonk ~: ␣ on l y ␣ u s i n g ␣%d␣ f i l t e r s ␣ ( ran ␣ pa s t ␣ Nyqu i s t ) " , i +1) ;break ;

}i f ( f i l t e r p o i n t s < 4){

286 pos t ( "bonk ~: ␣ on l y ␣ u s i n g ␣%d␣ f i l t e r s ␣ ( k e r n e l s ␣ got ␣ too ␣ s h o r t ) " , i +1) ;break ;

}e l s e i f ( f i l t e r p o i n t s > npo i n t s )

f i l t e r p o i n t s = npo i n t s ;291

hoppo i n t s = 0 .5 + 0 .5 ∗ npo i n t s ∗ HALFWIDTH/bw ;// hoppo i n t s = 0 .5 + 0 .5 ∗ npo i n t s /bw ;

nhops = 1 . + ( npo in t s− f i l t e r p o i n t s ) /( f l o a t ) hoppo i n t s ;296 s k i p p o i n t s = 0 .5 ∗ ( npo in t s− f i l t e r p o i n t s − ( nhops−1) ∗ hoppo i n t s ) ;

// F i l l the k e r n e l o f the f i l t e r s i n f i l t e r b a n k −>f i l t e r k e r n e lb−>b_vec [ i ] . k_s tu f f =

( f l o a t ∗) g e t b y t e s (2 ∗ s i z e o f ( f l o a t ) ∗ f i l t e r p o i n t s ) ;301 b−>b_vec [ i ] . k _ f i l t e r p o i n t s = f i l t e r p o i n t s ;

b−>b_vec [ i ] . k_nhops = nhops ;b−>b_vec [ i ] . k_hoppoints = hoppo i n t s ;b−>b_vec [ i ] . k_sk i ppo i n t s = s k i p p o i n t s ;b−>b_vec [ i ] . k_cen t e r f r eq = c f ;

306 b−>b_vec [ i ] . k_bandwidth = bw ;

//BANDPASS FILTER DESIGN : vede r ef o r ( fp = b−>b_vec [ i ] . k_stu f f , j = 0 ; j < f i l t e r p o i n t s ; j++, fp+= 2){

311 f l o a t phase = j ∗ c f ∗ (2∗3 .14159/ npo i n t s ) ;f l o a t wphase = j ∗ (2∗3 .14159 / f i l t e r p o i n t s ) ;f l o a t window = s i n ( 0 . 5∗ wphase ) ;fp [ 0 ] = window ∗ cos ( phase ) ;fp [ 1 ] = window ∗ s i n ( phase ) ;

316 n o rma l i z e r += window ;// pos t (" cosphase %.2 f s i n ph a s e %.2 f wphase %.2 f window %.2 f norm %.2 f

fp0 %.2 f fp1 %.2 f " , cos ( phase ) , s i n ( phase ) , wphase , window ,no rma l i z e r , f p [ 0 ] , f p [ 1 ] ) ;

}n o rma l i z e r = 1/( n o rma l i z e r ∗ nhops ) ;f o r ( fp = b−>b_vec [ i ] . k_stu f f , j = 0 ;

321 j < f i l t e r p o i n t s ; j++, fp+= 2)fp [ 0 ] ∗= no rma l i z e r , f p [ 1 ] ∗= no rma l i z e r ;

po s t ( " i ␣%d␣␣ c f ␣%.2 f ␣␣bw␣%.2 f ␣Q␣%.2 f ␣nhops ␣%d , ␣hop␣%d , ␣ s k i p ␣%d , ␣ npo i n t s ␣%d , ␣n o rma l i z e r ␣%.8 f ␣ fp0 ␣%.6 f , ␣ fp1 ␣%.6 f " , i , c f , bw , c f /bw , nhops , hoppo in t s, s k i p p o i n t s , f i l t e r p o i n t s , no rma l i z e r , &fp [ 0 ] , &fp [ 1 ] ) ;

326 newcf = ( c f + bw/ o v e r l a p ) /(1 − r e l s p a c e ) ;newbw = newcf ∗ o v e r l a p ∗ r e l s p a c e ;i f (newbw < HALFWIDTH){

98

Page 108: A Perceptually Grounded Approach to Sound Analysis

B.1 – The bonk∼ Method

newbw = HALFWIDTH;331 newcf = c f + 2 ∗ HALFWIDTH / o v e r l a p ;

}c f = newcf ;bw = newbw ;

}336 // s e t s to 0 the r ema in i ng f i l t e r s , i f l e s s than 50 f i l t e r s a r e used

f o r ( ; i < n f i l t e r s ; i++)b−>b_vec [ i ] . k_s tu f f = 0 , b−>b_vec [ i ] . k _ f i l t e r p o i n t s = 0 ;

re tu rn ( b ) ;}

341s t a t i c vo id b o n k_ f r e e f i l t e r b a n k ( t_ f i l t e r b a n k ∗b ){

t_ f i l t e r b a n k ∗b2 , ∗b3 ;i n t i ;

346 i f ( b o n k_ f i l t e r b a n k l i s t == b )b o n k_ f i l t e r b a n k l i s t = b−>b_next ;

e l s e f o r ( b2 = b o n k_ f i l t e r b a n k l i s t ; b3 = b2−>b_next ; b2 = b3 )i f ( b3 == b)

{351 b2−>b_next = b3−>b_next ;

break ;}f o r ( i = 0 ; i < b−>b_n f i l t e r s ; i++)

i f (b−>b_vec [ i ] . k_s tu f f )356 f r e e b y t e s (b−>b_vec [ i ] . k_stu f f ,

b−>b_vec [ i ] . k _ f i l t e r p o i n t s ∗ s i z e o f ( f l o a t ) ) ;f r e e b y t e s (b , s i z e o f (∗b ) ) ;

}

361 s t a t i c vo id bonk_donew ( t_bonk ∗x , i n t npo in t s , i n t pe r i od , i n t ns i g , i n t n f i l t e r s ,f l o a t h a l f t o n e s , f l o a t ove r l ap , f l o a t f i r s t b i n , f l o a t s amp l e r a t e )

{i n t i , j ;t_h i s t ∗h ;f l o a t ∗ f p ;

366 t_ i n s i g ∗g ;t_ f i l t e r b a n k ∗ f b ;f o r ( j = 0 , g = x−>x_in s i g ; j < n s i g ; j++, g++){

f o r ( i = 0 , h = g−>g_his t ; i −−; h++)371 {

h−>h_power = h−>h_before = 0 , h−>h_countup = 0 ;f o r ( j = 0 ; j < MASKHIST ; j++)

h−>h_mask [ j ] = 0 ;}

376 /∗ we ought to check f o r f a i l u r e to a l l o c a t e memory he r e ∗/g−>g_inbuf = ( f l o a t ∗) g e t b y t e s ( npo i n t s ∗ s i z e o f ( f l o a t ) ) ;f o r ( i = npo in t s , f p = g−>g_inbuf ; i −−; f p++) ∗ f p = 0 ;

}i f ( ! p e r i o d ) p e r i o d = npo i n t s /2 ;

381 x−>x_npoints = npo i n t s ;x−>x_per iod = pe r i o d ;x−>x_n in s i g = n s i g ;x−>x_n f i l t e r s = n f i l t e r s ;x−>x_ha l f t one s = h a l f t o n e s ;

386 x−>x_template = ( t_template ∗) g e t b y t e s (0 ) ;x−>x_ntemplate = 0 ;x−>x_ i n f i l l = 0 ;x−>x_countdown = 0 ;x−>x_w i l l a t t a c k = 0 ;

99

Page 109: A Perceptually Grounded Approach to Sound Analysis

B – bonk∼ source code

391 x−>x_attacked = 0 ;x−>x_maskphase = 0 ;x−>x_debug = 0 ;x−>x_h i th r e sh = DEFHITHRESH ;x−>x_lo th r e sh = DEFLOTHRESH;

396 x−>x_masktime = DEFMASKTIME;x−>x_maskdecay = DEFMASKDECAY;x−>x_learn = 0 ;x−>x_learndebounce = c l o ck_ge t s y s t ime ( ) ;x−>x_lea rncount = 0 ;

401 x−>x_debouncedecay = DEFDEBOUNCEDECAY;x−>x_minvel = DEFMINVEL ;x−>x_use loudnes s = 0 ;x−>x_debounceve l = 0 ;x−>x_at tackb in s = DEFATTACKBINS ;

406 x−>x_sr = samp l e r a t e ;x−>x_ f i l t e r b a n k = 0 ;x−>x_hit = 0 ;f o r ( fb = b o n k_ f i l t e r b a n k l i s t ; f b ; fb = fb−>b_next )

i f ( fb−>b_n f i l t e r s == x−>x_n f i l t e r s &&411 fb−>b_ha l f t one s == x−>x_ha l f t one s &&

fb−>b_ f i r s t b i n == f i r s t b i n &&fb−>b_over lap == ov e r l a p &&fb−>b_npoints == x−>x_npoints )

{416 fb−>b_refcount++;

x−>x_ f i l t e r b a n k = fb ;break ;

}i f ( ! x−>x_ f i l t e r b a n k )

421 x−>x_ f i l t e r b a n k = bonk_newf i l t e rbank ( npo in t s , n f i l t e r s , h a l f t o n e s , o v e r l ap ,f i r s t b i n ) , x−>x_f i l t e r b ank−>b_refcount++;

}

s t a t i c vo id bonk_tick ( t_bonk ∗x ){

426 t_atom at [MAXNFILTERS ] , ∗ap , at2 [ 3 ] ;i n t i , j , k , n ;t_h i s t ∗h ;f l o a t ∗pp , v e l = 0 , t empe ra tu r e = 0 ;f l o a t ∗ f p ;

431 t_template ∗ tp ;i n t n f i t , n i n s i g = x−>x_nins ig , n temp la te = x−>x_ntemplate , n f i l t e r s = x−>

x_n f i l t e r s ;t_ i n s i g ∗gp ;

#i f d e f _MSC_VERf l o a t powerout [MAXNFILTERS∗MAXCHANNELS ] ;

436 #e l s ef l o a t ∗powerout = a l l o c a ( x−>x_n f i l t e r s ∗ x−>x_n in s i g ∗ s i z e o f (∗ powerout ) ) ;

#end i f

f o r ( i = n i n s i g , pp = powerout , gp = x−>x_in s i g ; i −−; gp++)441 {

f o r ( j = 0 , h = gp−>g_his t ; j < n f i l t e r s ; j++, h++, pp++){

f l o a t power = h−>h_outpower ;f l o a t i n t e n s i t y = ∗pp = ( power > 0 ? 100 . ∗ q r s q r t ( q r s q r t ( power ) ) : 0) ;

446 v e l += i n t e n s i t y ;t empe ra tu r e += i n t e n s i t y ∗ ( f l o a t ) j ;// pos t (" power %.12 f i n t e n s i t y %.6 f " , power , i n t e n s i t y ) ;

}}

100

Page 110: A Perceptually Grounded Approach to Sound Analysis

B.1 – The bonk∼ Method

451 i f ( v e l > 0) t empe ra tu r e /= v e l ;e l s e t empe ra tu r e = 0 ;v e l ∗= 0.5 / n i n s i g ; /∗ fudge f a c t o r ∗/i f ( x−>x_hit ){

456 /∗ i f h i t nonzero i t ’ s a c l o c k c a l l b a c k . i f i n " l e a r n " mode update thet emp la t e l i s t ; i n any even t match the h i t to known t emp l a t e s . ∗/

i f ( v e l < x−>x_debounceve l ){

461 i f ( x−>x_debug )pos t ( "bounce␣ c a n c e l l e d : ␣ v e l ␣%f ␣debounce ␣%f " ,

v e l , x−>x_debounceve l ) ;re tu rn ;

}466 i f ( v e l < x−>x_minvel )

{i f ( x−>x_debug )

pos t ( " low␣ v e l o c i t y ␣ c a n c e l l e d : ␣ v e l ␣%f , ␣m inve l ␣%f " ,v e l , x−>x_minvel ) ;

471 re tu rn ;}x−>x_debounceve l = v e l ;i f ( x−>x_learn ){

476 double l a s t t i m e = x−>x_learndebounce ;double msec = c l o c k_ge t t ime s i n c e ( l a s t t i m e ) ;i f ( ( ! n temp la te ) | | ( msec > 200) ){

i n t countup = x−>x_lea rncount ;481 /∗ no rma l i z e to 100 ∗/

f l o a t norm ;f o r ( i = n f i l t e r s ∗ n i n s i g , norm = 0 , pp = powerout ; i −−; pp++)

norm += ∗pp ∗ ∗pp ;i f ( norm < 1.0 e−15) norm = 1.0 e−15;

486 norm = 100 . f ∗ q r s q r t ( norm ) ;/∗ check i f t h i s i s the f i r s t s t r i k e f o r a new temp la t e ∗/i f ( ! countup ){

i n t o ldn = ntemp la te ;491 x−>x_ntemplate = ntemp la te = o ldn + n i n s i g ;

x−>x_template = ( t_template ∗) t_ r e s i z e b y t e s ( x−>x_template , o ldn∗ s i z e o f ( x−>x_template [ 0 ] ) , n temp la te ∗ s i z e o f ( x−>

x_template [ 0 ] ) ) ;f o r ( i = n i n s i g , pp = powerout ; i −−; o l dn++)

f o r ( j = n f i l t e r s , f p = x−>x_template [ o ldn ] . t_amp ; j−−;pp++, fp++)

496 ∗ f p = ∗pp ∗ norm ;}e l s e{

i n t o ldn = ntemp la te − n i n s i g ;501 i f ( o ldn < 0) pos t ( " bonk_tick ␣bug" ) ;

f o r ( i = n i n s i g , pp = powerout ; i −−; o l dn++){

f o r ( j = n f i l t e r s , f p = x−>x_template [ o ldn ] . t_amp ; j−−;pp++, fp++)

506 ∗ f p = ( countup ∗ ∗ f p + ∗pp ∗ norm )/( countup + 1 .0 f ) ;

}}countup++;

101

Page 111: A Perceptually Grounded Approach to Sound Analysis

B – bonk∼ source code

511 i f ( countup == x−>x_learn ) countup = 0 ;x−>x_lea rncount = countup ;

}e l s e re tu rn ;

}516 x−>x_learndebounce = c l o ck_ge t s y s t ime ( ) ;

i f ( n temp la te ){

f l o a t b e s t f i t = −1e30 ;i n t t emp la t e coun t ;

521 n f i t = −1;f o r ( i = 0 , t emp la t ecoun t = 0 , tp = x−>x_template ;

t emp la t e coun t < ntemp la te ; i++){

f l o a t dotprod = 0 ;526 f o r ( k = 0 , pp = powerout ;

k < n i n s i g && temp la t ecoun t < ntemp la te ;k++, tp++, temp la t ecoun t++)

{f o r ( j = n f i l t e r s , f p = tp−>t_amp ;

531 j−−; f p++, pp++){

i f (∗ f p < 0 | | ∗pp < 0) pos t ( " bonk_tick ␣bug␣2" ) ;dotprod += ∗ f p ∗ ∗pp ;

}536 }

i f ( dotprod > b e s t f i t ){

b e s t f i t = dotprod ;n f i t = i ;

541 }}i f ( n f i t < 0) pos t ( " bonk_tick ␣bug" ) ;

}e l s e n f i t = 0 ;

546 }e l s e n f i t = −1; /∗ h i t i s z e r o ; t h i s i s the "bang" method . ∗/

x−>x_attacked = 1 ;i f ( x−>x_debug )

551 pos t ( "bonk␣ out : ␣number␣%d , ␣ v e l ␣%f , ␣ t empe ra tu r e ␣%f " , n f i t , v e l , t empe ra tu r e );

SETFLOAT( at2 , n f i t ) ;SETFLOAT( at2+1, v e l ) ;SETFLOAT( at2+2, t empe ra tu r e ) ;

556 o u t l e t _ l i s t ( x−>x_cookedout , 0 , 3 , at2 ) ;

f o r ( n = 0 , gp = x−>x_in s i g + ( n i n s i g −1) ,pp = powerout + n f i l t e r s ∗ ( n i n s i g −1) ; n < n i n s i g ;

n++, gp−−, pp −= n f i l t e r s )561 {

f l o a t ∗pp2 ;f o r ( i = 0 , ap = at , pp2 = pp ; i < n f i l t e r s ;

i ++, ap++, pp2++){

566 ap−>a_type = A_FLOAT;ap−>a_w. w_float = ∗pp2 ;

}o u t l e t _ l i s t ( gp−>g_out let , 0 , n f i l t e r s , a t ) ;

}571 }

102

Page 112: A Perceptually Grounded Approach to Sound Analysis

B.1 – The bonk∼ Method

// r e p o r t the a t t a c ks t a t i c vo id bonk_doit ( t_bonk ∗x ){

576 i n t i , j , ch , n ;t _ f i l t e r k e r n e l ∗k ;t_h i s t ∗h ;f l o a t growth = 0 , ∗ fp1 , ∗ fp3 , ∗ fp4 , h i t h r e s h , l o t h r e s h ;i n t n i n s i g = x−>x_nins ig , n f i l t e r s = x−>x_n f i l t e r s ,

581 maskphase = x−>x_maskphase , nextphase , o ldmaskphase ;t_ i n s i g ∗gp ;nex tphase = maskphase + 1 ;i f ( nex tphase >= MASKHIST)

nex tphase = 0 ;586 x−>x_maskphase = nex tphase ;

o ldmaskphase = nex tphase − x−>x_at tackb in s ;i f ( o ldmaskphase < 0)

o ldmaskphase += MASKHIST ;i f ( x−>x_use loudnes s )

591 h i t h r e s h = q r s q r t ( q r s q r t ( x−>x_h i th r e sh ) ) ,l o t h r e s h = q r s q r t ( q r s q r t ( x−>x_lo th r e sh ) ) ;

e l s e h i t h r e s h = x−>x_hi th resh , l o t h r e s h = x−>x_lo th r e sh ;f o r ( ch = 0 , gp = x−>x_in s i g ; ch < n i n s i g ; ch++, gp++){

596 f o r ( i = 0 , k = x−>x_f i l t e r b ank−>b_vec , h = gp−>g_his t ;i < n f i l t e r s ; i ++, k++, h++)

{f l o a t power = 0 , maskpow = h−>h_mask [ maskphase ] ;f l o a t ∗ i n b u f= gp−>g_inbuf + k−>k_sk i ppo i n t s ;

601 i n t countup = h−>h_countup ;i n t f i l t e r p o i n t s = k−>k_ f i l t e r p o i n t s ;/∗ i f the u s e r asked f o r more f i l t e r s t ha t f i t under theNyqu i s t f r equency , some f i l t e r s won ’ t a c t u a l l y be f i l l e d i nso we s k i p runn ing them . ∗/

606 i f ( ! f i l t e r p o i n t s ){

h−>h_countup = 0 ;h−>h_mask [ nex tphase ] = 0 ;h−>h_power = 0 ;

611 cont inue ;}// f o r each f i l t e r :/∗ run the f i l t e r r e p e a t e d l y , s l i d i n g i t f o rwa rd by hoppo in t s ,

f o r nhop t imes ∗/616 f o r ( fp1 = inbu f , n = 0 ; n < k−>k_nhops ; fp1 += k−>k_hoppoints , n++)

{f l o a t rsum = 0 , isum = 0 ;f o r ( fp3 = fp1 , fp4 = k−>k_stuf f , j = f i l t e r p o i n t s ; j −−;){

621 // //////////////////////////////////////// c a l c u l a t i n g the power f o r each f i l t e r ///g=the i npu t b u f f e r ////////////////////// fp4= fp [ 0 ] e fp [ 1 ] //////////////// //////////////////////////////////////

626 f l o a t g = ∗ fp3++;rsum += g ∗ ∗ fp4++;isum += g ∗ ∗ fp4++;

}power += rsum ∗ rsum + isum ∗ isum ;

631 // pos t (" power %.12 f " , power ) ; // c a p i r e se p o s s i b i l e dec imare iv a l o r i d i power da po s t a r e i n max window )

}

103

Page 113: A Perceptually Grounded Approach to Sound Analysis

B – bonk∼ source code

i f ( ! x−>x_w i l l a t t a c k )h−>h_before = maskpow ;

636 i f ( power > h−>h_mask [ o ldmaskphase ] ){

i f ( x−>x_use loudnes s )growth += q r s q r t ( q r s q r t ( power /(h−>h_mask [ o ldmaskphase ] + 1 .0 e

−15) ) ) − 1 . f ;e l s e growth += power /(h−>h_mask [ o ldmaskphase ] + 1 .0 e−15) − 1 . f ;

641 // pos t (" power %.12 f h−>h_mask [ o ldmaskphase ] %.12 f growth %.12 f " ,power , h−>h_mask [ o ldmaskphase ] , growth ) ;

}i f ( ! x−>x_w i l l a t t a c k && countup >= x−>x_masktime )

maskpow ∗= x−>x_maskdecay ;646

i f ( power > maskpow ){

maskpow = power ;countup = 0 ;

651 }countup++;h−>h_countup = countup ;h−>h_mask [ nex tphase ] = maskpow ;h−>h_power = power ;

656 }}i f ( x−>x_w i l l a t t a c k ) //an a t t a c k i s r e p o r t e d{

// however we won ’ t a c t u a l l y r e p o r t the a t t a c k u n t i l the spectrum stopgrowing . growth must d e c r e a s e below l o t h r e s h .

661 i f ( x−>x_w i l l a t t a c k > MAXATTACKWAIT | | growth < x−>x_lo th r e sh ){

/∗ i f haven ’ t yet , and i f not i n spew mode , r e p o r t a h i t ∗/i f ( ! x−>x_spew && ! x−>x_attacked ){

666 f o r ( ch = 0 , gp = x−>x_in s i g ; ch < n i n s i g ; ch++, gp++)f o r ( i = n f i l t e r s , h = gp−>g_his t ; i −−; h++)

h−>h_outpower = h−>h_mask [ nex tphase ] ;x−>x_hit = 1 ;// s e t s a c l o c k to go o f f n m i l l i s e c o n d s from the c u r r e n t l o g i c a l

t ime wi th c l o ck_de l ay ( c l o c k to s chedu l e , n (ms) )671 // Schedu l e the e x e c u t i o n o f a Clock

c l o ck_de l ay ( x−>x_clock , 0) ;}

}i f ( growth < x−>x_lo th r e sh )

676 x−>x_w i l l a t t a c k = 0 ;e l s e x−>x_w i l l a t t a c k++;

}e l s e i f ( growth > x−>x_h i th r e sh ){

681 i f ( x−>x_debug ) pos t ( " a t t a c k : ␣ growth ␣=␣%f " , growth ) ;x−>x_w i l l a t t a c k = 1 ;x−>x_attacked = 0 ;f o r ( ch = 0 , gp = x−>x_in s i g ; ch < n i n s i g ; ch++, gp++)

f o r ( i = n f i l t e r s , h = gp−>g_his t ; i −−; h++)686 h−>h_mask [ nex tphase ] = h−>h_power , h−>h_countup = 0 ;

}

// spew mode a lways output data f o r e v e r y per fo rmed a n a l y s i s/∗ i f i n " spew" mode j u s t a lways output ∗/

104

Page 114: A Perceptually Grounded Approach to Sound Analysis

B.1 – The bonk∼ Method

691 i f ( x−>x_spew ){

f o r ( ch = 0 , gp = x−>x_in s i g ; ch < n i n s i g ; ch++, gp++)f o r ( i = n f i l t e r s , h = gp−>g_his t ; i −−; h++)

h−>h_outpower = h−>h_power ;696 x−>x_hit = 0 ;

c l o ck_de l ay ( x−>x_clock , 0) ;}x−>x_debounceve l ∗= x−>x_debouncedecay ;

}701

//4//PERFORM ROUTINE// I t r e c e i v e s a p o i n t e r to a p i e c e o f the DSP cha i n and i t i s e xpec t ed to r e t u r n

the l o c a t i o n o f the next pe r fo rm r o u t i n e on the cha i n .//The next l o c a t i o n i s de te rm ined by the number o f arguments s p e c i f i e d f o r the

per fo rm r o u t i n e w i th the c a l l to dsp_add ( ) .// For example , i f we pas s t h r e e arguments , we need to r e t u r n w + 4 .

706 s t a t i c t_ in t ∗bonk_perform ( t_int ∗w){

t_bonk ∗x = ( t_bonk ∗) (w [ 1 ] ) ;i n t n = ( i n t ) (w [ 2 ] ) ; // v e c t o r s i z ei n t onse t = 0 ;

711 i f ( x−>x_countdown >= n)x−>x_countdown −= n ;

e l s e{

i n t i , j , n i n s i g = x−>x_n in s i g ;716 t_ i n s i g ∗gp ;

i f ( x−>x_countdown > 0){

n −= x−>x_countdown ;on s e t += x−>x_countdown ;

721 x−>x_countdown = 0 ;}whi le ( n > 0){

i n t i n f i l l = x−>x_ i n f i l l ;726 i n t m = (n < ( x−>x_npoints − i n f i l l ) ?

n : ( x−>x_npoints − i n f i l l ) ) ;f o r ( i = 0 , gp = x−>x_in s i g ; i < n i n s i g ; i ++, gp++){

f l o a t ∗ f p = gp−>g_inbuf + i n f i l l ;731 t_ f l o a t ∗ i n 1 = gp−>g_invec + onse t ;

f o r ( j = 0 ; j < m; j++)∗ f p++ = ∗ i n 1++;

}i n f i l l += m;

736 x−>x_ i n f i l l = i n f i l l ;//when i npu t i s f i l l e d w i th npo i n t samples , bonk_doit !i f ( i n f i l l == x−>x_npoints ){

bonk_doit ( x ) ;741

/∗ s h i f t o r c l e a r the i n pu t b u f f e r and update c oun t e r s ∗/i f ( x−>x_per iod > x−>x_npoints )

x−>x_countdown = x−>x_per iod − x−>x_npoints ;e l s e x−>x_countdown = 0 ;

746 i f ( x−>x_per iod < x−>x_npoints ){

i n t o v e r l a p = x−>x_npoints − x−>x_per iod ;f l o a t ∗ fp1 , ∗ fp2 ;f o r ( n = 0 , gp = x−>x_in s i g ; n < n i n s i g ; n++, gp++)

105

Page 115: A Perceptually Grounded Approach to Sound Analysis

B – bonk∼ source code

751 f o r ( i = ove r l ap , fp1 = gp−>g_inbuf ,fp2 = fp1 + x−>x_per iod ; i −−;)

∗ fp1++ = ∗ fp2++;x−>x_ i n f i l l = o v e r l a p ;

}756 e l s e x−>x_ i n f i l l = 0 ;

}n −= m;onse t += m;

}761 }

re tu rn (w+3) ;}

//3//DSP METHOD766 //From MAX 5 API : The dsp method s p e c i f i e s the s i g n a l p r o c e s s i n g f u n c t i o n your

o b j e c t d e f i n e s a l ong wi th i t s arguments .//The o b j e c t ’ s dsp method w i l l be c a l l e d whenever the MSP s i g n a l c omp i l e r i s

b u i l d i n g a sequence o f o p e r a t i o n s ( known as the DSP Chain ) t ha t w i l l beper fo rmed on each s e t o f aud io samples .

//The op e r a t i o n sequence c o n s i s t s o f a p o i n t e r s to f u n c t i o n s ( c a l l e d per fo rmr o u t i n e s ) f o l l ow e d by arguments to tho s e f u n c t i o n s .

s t a t i c vo id bonk_dsp ( t_bonk ∗x , t_ s i g n a l ∗∗ sp ){

771 i n t i , n = sp [0]−>s_n , n i n s i g = x−>x_n in s i g ;t_ i n s i g ∗gp ;

x−>x_sr = sp [0]−>s_sr ;

776 f o r ( i = 0 , gp = x−>x_in s i g ; i < n i n s i g ; i ++, gp++)gp−>g_invec = (∗ ( sp++))−>s_vec ;

// adds your o b j e c t ’ s pe r fo rm method to the DSP c a l l c ha i n w i th dsp_add ( o b j e c t ’ spe r fo rm rou t i n e , #, . . . ) and s p e c i f i e s the arguments i t w i l l be pas sed .

//The per fo rm r o u t i n e i s used f o r p r o c e s s i n g aud io .781 //#=The number o f arguments tha t w i l l f o l l o w

// . . . the argumentsdsp_add ( bonk_perform , 2 , x , n ) ;

}

786 s t a t i c vo id bonk_thresh ( t_bonk ∗x , t_ f l o a t a r g f1 , t_ f l o a t a r g f 2 ){

i f ( f 1 > f2 )pos t ( "bonk : ␣warn ing : ␣ low␣ t h r e s h o l d ␣ g r e a t e r ␣ than ␣ h i ␣ t h r e s h o l d " ) ;

x−>x_lo th r e sh = ( f1 <= 0 ? 0.0001 : f 1 ) ;791 x−>x_h i th r e sh = ( f2 <= 0 ? 0.0001 : f 2 ) ;

}

s t a t i c vo id bonk_pr int ( t_bonk ∗x , t_ f l o a t a r g f ){

796 i n t i ;po s t ( " t h r e s h ␣%f ␣%f " , x−>x_lothre sh , x−>x_h i th r e sh ) ;po s t ( "mask␣%d␣%f " , x−>x_masktime , x−>x_maskdecay ) ;po s t ( " at tack−b i n s ␣%d" , x−>x_at tackb in s ) ;po s t ( " debounce ␣%f " , x−>x_debouncedecay ) ;

801 pos t ( "m inve l ␣%f " , x−>x_minvel ) ;po s t ( " spew␣%d" , x−>x_spew ) ;pos t ( " u s e l o udn e s s ␣%d" , x−>x_use loudnes s ) ;

po s t ( "number␣ o f ␣ t emp l a t e s ␣%d" , x−>x_ntemplate ) ;806 i f ( x−>x_learn ) pos t ( " l e a r n ␣mode" ) ;

i f ( f != 0)

106

Page 116: A Perceptually Grounded Approach to Sound Analysis

B.1 – The bonk∼ Method

{i n t j , n i n s i g = x−>x_n in s i g ;t_ i n s i g ∗gp ;

811 f o r ( j = 0 , gp = x−>x_in s i g ; j < n i n s i g ; j++, gp++){

t_h i s t ∗h ;i f ( n i n s i g > 1) pos t ( " i npu t ␣%d : " , j +1) ;f o r ( i = x−>x_n f i l t e r s , h = gp−>g_his t ; i −−; h++)

816 pos t ( "pow␣%f ␣mask␣%f ␣ b e f o r e ␣%f ␣ count ␣%d" ,h−>h_power , h−>h_mask [ x−>x_maskphase ] ,h−>h_before , h−>h_countup ) ;

}pos t ( " f i l t e r ␣ d e t a i l s ␣ ( f r e q u e n c i e s ␣ a r e ␣ i n ␣ u n i t s ␣ o f ␣%.2 f−Hz . ␣ b i n s ) : " ,

821 x−>x_sr ) ;f o r ( j = 0 ; j < x−>x_n f i l t e r s ; j++)

pos t ( "%2d␣␣ c f ␣%.2 f ␣␣bw␣%.2 f ␣␣nhops ␣%d␣hop␣%d␣ s k i p ␣%d␣ npo i n t s ␣%d" ,j ,x−>x_f i l t e r b ank−>b_vec [ j ] . k_cente r f r eq ,

826 x−>x_f i l t e r b ank−>b_vec [ j ] . k_bandwidth ,x−>x_f i l t e r b ank−>b_vec [ j ] . k_nhops ,x−>x_f i l t e r b ank−>b_vec [ j ] . k_hoppoints ,x−>x_f i l t e r b ank−>b_vec [ j ] . k_sk ippo in t s ,x−>x_f i l t e r b ank−>b_vec [ j ] . k _ f i l t e r p o i n t s ) ;

831 }i f ( x−>x_debug ) pos t ( "debug␣mode" ) ;

}

s t a t i c vo id bonk_forget ( t_bonk ∗x )836 {

i n t ntemp la te = x−>x_ntemplate , newn = ntemp la te − x−>x_n in s i g ;i f ( newn < 0) newn = 0 ;x−>x_template = ( t_template ∗) t_ r e s i z e b y t e s ( x−>x_template ,

x−>x_ntemplate ∗ s i z e o f ( x−>x_template [ 0 ] ) ,841 newn ∗ s i z e o f ( x−>x_template [ 0 ] ) ) ;

x−>x_ntemplate = newn ;x−>x_lea rncount = 0 ;

}

846 s t a t i c vo id bonk_bang ( t_bonk ∗x ){

i n t i , ch ;x−>x_hit = 0 ;t_ i n s i g ∗gp ;

851 f o r ( ch = 0 , gp = x−>x_in s i g ; ch < x−>x_n in s i g ; ch++, gp++){

t_h i s t ∗h ;f o r ( i = 0 , h = gp−>g_his t ; i < x−>x_n f i l t e r s ; i ++, h++)

h−>h_outpower = h−>h_power ;856 }

bonk_tick ( x ) ;}

s t a t i c vo id bonk_read ( t_bonk ∗x , t_symbol ∗ s )861 {

FILE ∗ f d = fopen ( s−>s_name , " r " ) ;f l o a t vec [MAXNFILTERS ] ;i n t i , n t emp la te = 0 , r ema in i ng ;f l o a t ∗ fp , ∗ fp2 ;

866 i f ( ! fd ){

pos t ( "%s : ␣open␣ f a i l e d " , s−>s_name) ;re tu rn ;

107

Page 117: A Perceptually Grounded Approach to Sound Analysis

B – bonk∼ source code

}871 x−>x_template = ( t_template ∗) t_ r e s i z e b y t e s ( x−>x_template ,

x−>x_ntemplate ∗ s i z e o f ( t_template ) , 0) ;whi le (1 ){

f o r ( i = x−>x_n f i l t e r s , f p = vec ; i −−; f p++)876 i f ( f s c a n f ( fd , "%f " , fp ) < 1) goto nomore ;

x−>x_template = ( t_template ∗) t_ r e s i z e b y t e s ( x−>x_template ,n temp la te ∗ s i z e o f ( t_template ) ,

( n temp la te + 1) ∗ s i z e o f ( t_template ) ) ;f o r ( i = x−>x_n f i l t e r s , f p = vec ,

881 fp2 = x−>x_template [ n temp la te ] . t_amp ; i −−;)∗ fp2++ = ∗ f p++;

ntemp la te++;}

nomore :886 i f ( r ema in i ng = ( ntemp la te % x−>x_n in s i g ) )

{pos t ( "bonk_read : ␣%d␣ t emp l a t e s ␣ not ␣a␣ mu l t i p l e ␣ o f ␣%d ; ␣ d ropp ing ␣ e x t r a s " ) ;x−>x_template = ( t_template ∗) t_ r e s i z e b y t e s ( x−>x_template ,

n temp la te ∗ s i z e o f ( t_template ) ,891 ( ntemp la te − r ema in i ng ) ∗ s i z e o f ( t_template ) ) ;

n temp la te = ntemp la te − r ema in i ng ;}pos t ( "bonk : ␣ read ␣%d␣ t emp l a t e s \n" , n temp la te ) ;x−>x_ntemplate = ntemp la te ;

896 f c l o s e ( fd ) ;}

s t a t i c vo id bonk_write ( t_bonk ∗x , t_symbol ∗ s ){

901 FILE ∗ f d = fopen ( s−>s_name , "w" ) ;i n t i , n t emp la te = x−>x_ntemplate ;t_template ∗ tp = x−>x_template ;f l o a t ∗ f p ;i f ( ! fd )

906 {pos t ( "%s : ␣ cou ldn ’ t ␣ c r e a t e " , s−>s_name) ;re tu rn ;

}f o r ( ; ntemplate−−; tp++)

911 {f o r ( i = x−>x_n f i l t e r s , f p = tp−>t_amp ; i −−; f p++)

f p r i n t f ( fd , "%6.2 f ␣" , ∗ f p ) ;f p r i n t f ( fd , "\n" ) ;

}916 pos t ( "bonk : ␣wrote ␣%d␣ t emp l a t e s \n" , x−>x_ntemplate ) ;

f c l o s e ( fd ) ;}

// f r e e f u n t i o n921 s t a t i c vo id bonk_free ( t_bonk ∗x )

{

i n t i , n i n s i g = x−>x_n in s i g ;t_ i n s i g ∗gp = x−>x_in s i g ;

926 #i f d e f MSPdsp_free ( ( t_pxob jec t ∗) x ) ;

#end i ff o r ( i = 0 , gp = x−>x_in s i g ; i < n i n s i g ; i ++, gp++)

f r e e b y t e s ( gp−>g_inbuf , x−>x_npoints ∗ s i z e o f ( f l o a t ) ) ;931 c l o c k_ f r e e ( x−>x_clock ) ;

108

Page 118: A Perceptually Grounded Approach to Sound Analysis

B.1 – The bonk∼ Method

i f (!−−(x−>x_f i l t e r b ank−>b_refcount ) )b o n k_ f r e e f i l t e r b a n k ( x−>x_ f i l t e r b a n k ) ;

}

936 //1// INITILIZATION ROUTINEvo id main ( ){

t_c l a s s ∗c ;t_ob jec t ∗ a t t r ;

941 long a t t r f l a g s = 0 ;t_symbol ∗ sym_long = gensym ( " long " ) , ∗ sym_float32 = gensym ( " f l o a t 3 2 " ) ;

//NEW INSTANCE ROUTINE// c r e a t e s new i n s t a n c e o f the c l a s s w i th c lass_new (name , mnew , mfree , s i z e ( i n

b y t e s ) o f the data s t r u c t u r e , mmenu , type ( most o f t e n A_GIMME, 0) , 0 . )946 //mnew=The i n s t a n c e c r e a t i o n f u n c t i o n

// mfree=The i n s t a n c e f r e e f u n c t i o n//mmenu=The f u n c t i o n c a l l e d when the u s e r c r e a t e s a new ob j e c t o f the c l a s s

from the Patch window ’ s p a l e t t e ( UI o b j e c t s on l y ) , 0L i f you ’ r e notd e f i n i n g a UI ob j e c t ,

// type=A s tanda rd Max type . The f i n a l argument o f the type l i s t s hou l d be a 0 .Gen e r a l l y , obex o b j e c t s have a s i n g l e type argument , A_GIMME, f o l l ow e d by a0 .

951 c = class_new ( "bonk3~" , (method )bonk_new , (method ) bonk_free , s i z e o f ( t_bonk ) , (method ) 0L , A_GIMME, 0) ;

c l a s s_ob e x o f f s e t_ s e t ( c , c a l c o f f s e t ( t_bonk , obex ) ) ; // c a l c o f f s e t c a l c u l a t e sbyte−o f f s e t from the b eg i nn i n g o f bonk s t r u c t u r e . The v a l u e i s s t o r e i nobex f i e l d o f same s t r u c t u r e .

//NEW ATTRIBUTES956 // c r e a t e s ( new ) a t t r i b u t e w i th at t r_of f se t_new (name , type , a t t r i b u t e i s f o r

s e t t i n g / query f l a g , method (NULL i s d e f a u l t method ) get , method se t , byte−o f f s e t ) .

// adds a t t r i b u t e to the o b j e c t o f the c l a s s . w i th c l a s s_adda t r ( )a t t r = at t r_of f se t_new ( " npo i n t s " , sym_long , a t t r f l a g s , (method ) 0L , (method ) 0L ,

c a l c o f f s e t ( t_bonk , x_npoints ) ) ;c l a s s_adda t t r ( c , a t t r ) ;

961 a t t r = at t r_of f se t_new ( "hop" , sym_long , a t t r f l a g s , (method ) 0L , (method ) 0L ,c a l c o f f s e t ( t_bonk , x_per iod ) ) ;

c l a s s_adda t t r ( c , a t t r ) ;

a t t r = at t r_of f se t_new ( " n f i l t e r s " , sym_long , a t t r f l a g s , (method ) 0L , (method ) 0L ,c a l c o f f s e t ( t_bonk , x_ n f i l t e r s ) ) ;

c l a s s_adda t t r ( c , a t t r ) ;966

a t t r = at t r_of f se t_new ( " h a l f t o n e s " , sym_float32 , a t t r f l a g s , (method ) 0L , (method) 0L , c a l c o f f s e t ( t_bonk , x_ha l f t one s ) ) ;

c l a s s_adda t t r ( c , a t t r ) ;

a t t r = at t r_of f se t_new ( " o v e r l a p " , sym_float32 , a t t r f l a g s , (method ) 0L , (method ) 0L , c a l c o f f s e t ( t_bonk , x_over lap ) ) ;

971 c l a s s_adda t t r ( c , a t t r ) ;

a t t r = at t r_of f se t_new ( " f i r s t b i n " , sym_float32 , a t t r f l a g s , (method ) 0L , (method )0L , c a l c o f f s e t ( t_bonk , x_ f i r s t b i n ) ) ;

c l a s s_adda t t r ( c , a t t r ) ;

976 a t t r = at t r_of f se t_new ( "minve l " , sym_float32 , a t t r f l a g s , (method ) 0L , (method )bonk_minvel_set , c a l c o f f s e t ( t_bonk , x_minvel ) ) ;

109

Page 119: A Perceptually Grounded Approach to Sound Analysis

B – bonk∼ source code

c l a s s_adda t t r ( c , a t t r ) ;

a t t r = at t r_of f se t_new ( " l o t h r e s h " , sym_float32 , a t t r f l a g s , (method ) 0L , (method )bonk_lothresh_set , c a l c o f f s e t ( t_bonk , x_ lo th r e sh ) ) ;

c l a s s_adda t t r ( c , a t t r ) ;981

a t t r = at t r_of f se t_new ( " h i t h r e s h " , sym_float32 , a t t r f l a g s , (method ) 0L , (method )bonk_hithresh_set , c a l c o f f s e t ( t_bonk , x_h i th r e sh ) ) ;

c l a s s_adda t t r ( c , a t t r ) ;

a t t r = at t r_of f se t_new ( "masktime" , sym_long , a t t r f l a g s , (method ) 0L , (method )bonk_masktime_set , c a l c o f f s e t ( t_bonk , x_masktime ) ) ;

986 c l a s s_adda t t r ( c , a t t r ) ;

a t t r = at t r_of f se t_new ( "maskdecay" , sym_float32 , a t t r f l a g s , (method ) 0L , (method) bonk_maskdecay_set , c a l c o f f s e t ( t_bonk , x_maskdecay ) ) ;

c l a s s_adda t t r ( c , a t t r ) ;

991 a t t r = at t r_of f se t_new ( " debouncedecay " , sym_float32 , a t t r f l a g s , (method ) 0L , (method ) bonk_debouncedecay_set , c a l c o f f s e t ( t_bonk , x_debouncedecay ) ) ;

c l a s s_adda t t r ( c , a t t r ) ;

a t t r = at t r_of f se t_new ( "debug" , sym_long , a t t r f l a g s , (method ) 0L , (method )bonk_debug_set , c a l c o f f s e t ( t_bonk , x_debug ) ) ;

c l a s s_adda t t r ( c , a t t r ) ;996

a t t r = at t r_of f se t_new ( "spew" , sym_long , a t t r f l a g s , (method ) 0L , (method )bonk_spew_set , c a l c o f f s e t ( t_bonk , x_spew ) ) ;

c l a s s_adda t t r ( c , a t t r ) ;

a t t r = at t r_of f se t_new ( " u s e l o udn e s s " , sym_long , a t t r f l a g s , (method ) 0L , (method )bonk_use loudness_set , c a l c o f f s e t ( t_bonk , x_use loudnes s ) ) ;

1001 c l a s s_adda t t r ( c , a t t r ) ;

a t t r = at t r_of f se t_new ( " a t t a c k b i n s " , sym_long , a t t r f l a g s , (method ) 0L , (method )bonk_attackb ins_set , c a l c o f f s e t ( t_bonk , x_at tackb in s ) ) ;

c l a s s_adda t t r ( c , a t t r ) ;

1006 //METHODS!// adds method to o b j e c t o f the c l a s s w i th c lass_addmethod ( c l a s s po i n t e r , m,

name , type , 0)//m=f u n c t i o n ge t c a l l e d when method i s invoquedc lass_addmethod ( c , (method ) bonk_dsp , " dsp" , A_CANT, 0) ;

1011 class_addmethod ( c , (method ) bonk_bang , "bang" , A_CANT, 0) ;c lass_addmethod ( c , (method ) bonk_forget , " f o r g e t " , 0) ;c lass_addmethod ( c , (method ) bonk_learn , " l e a r n " , A_LONG, 0) ;c lass_addmethod ( c , (method ) bonk_thresh , " t h r e s h " , A_FLOAT, A_FLOAT, 0) ;c lass_addmethod ( c , (method ) bonk_print , " p r i n t " , A_DEFFLOAT, 0) ;

1016 class_addmethod ( c , (method ) bonk_read , " read " , A_DEFSYM, 0) ;c lass_addmethod ( c , (method ) bonk_write , " w r i t e " , A_DEFSYM, 0) ;c lass_addmethod ( c , (method ) bonk_ass i s t , " a s s i s t " , A_CANT, 0) ;

// adds s p e c i a l obex methods1021 class_addmethod ( c , (method ) object_obex_dumpout , "dumpout" , A_CANT, 0) ;

c lass_addmethod ( c , (method ) ob jec t_obex_qu ick re f , " q u i c k r e f " , A_CANT, 0) ;

// adds some s t anda rd method h and l e r s f o r i n t e r n a l messages used by a l l MSPo b j e c t s w i th c l a s s_ d s p i n i t ( c l a s s p o i n t e r )

c l a s s_ d s p i n i t ( c ) ;1026

110

Page 120: A Perceptually Grounded Approach to Sound Analysis

B.1 – The bonk∼ Method

// r e g i s t e r s a p r e v i o u s l y d e f i n e d o b j e c t c l a s s w i th c l a s s _ r e g i s t e r ( name_space ,c l a s s p o i n t e r ) . Th i s f u n c t i o n i s r e qu i r e d , and shou ld be c a l l e d at the endo f main ( ) .

// namespace=The d e s i r e d c l a s s ’ s name space . T yp i c a l l y , #CLASS_BOX, f o r obexc l a s s e s o r #CLASS_NOBOX f o r c l a s s e s which w i l l o n l y be used i n t e r n a l l y

c l a s s _ r e g i s t e r (CLASS_BOX, c ) ;

1031 bonk_class = c ;

pos t ( "\n" ) ;po s t ( "BonkOMM~␣v␣ 1 .0 ␣−␣ d e t e c t s ␣ a t t a c k s ␣ i n ␣ aud io ␣ s i g n a l s " ) ;po s t ( " Zeng i ␣ r e v i s i o n " ) ;

1036 pos t ( " O r i g i n a l ␣by␣ M i l l e r ␣ Pucket te ␣and␣Ted␣Appel , ␣ h t tp : // c r c a . ucsd . edu/~msp/" ) ;po s t ( "\n" ) ;

}

1041 //2//NEW INSTANCE ROUTINEs t a t i c vo id ∗bonk_new( t_symbol ∗ s , long ac , t_atom ∗av ){

shor t j ;t_bonk ∗x ;

1046//CREATE INSTANCE// c r e a t e s i n s t a n c e o f the o b j e c t c l a s s by a l l o c a t i n g memory wi th o b j e c t_a l l o c (

c l a s s p o i n t e r ) .// I t s use i s r e q u i r e d wi th obex−c l a s s ob j e c t s , i n s i d e the o b j e c t ’ s new i n s t a n c e

r o u t i n e .i f ( x = ( t_bonk ∗) o b j e c t_a l l o c ( bonk_class ) ) {

1051vo id ∗ o b j e c t_a l l o c ( t_c l a s s ∗c ) ;

t_ i n s i g ∗g ;

1056 x−>x_npoints = DEFNPOINTS ;x−>x_per iod = DEFPERIOD ;x−>x_n f i l t e r s = DEFNFILTERS ;x−>x_ha l f t one s = DEFHALFTONES;x−>x_ f i r s t b i n = DEFFIRSTBIN ;

1061 x−>x_over lap = DEFOVERLAP;x−>x_n in s i g = 1 ;

x−>x_h i th r e sh = DEFHITHRESH ;x−>x_lo th r e sh = DEFLOTHRESH;

1066 x−>x_masktime = DEFMASKTIME;x−>x_maskdecay = DEFMASKDECAY;x−>x_debouncedecay = DEFDEBOUNCEDECAY;x−>x_minvel = DEFMINVEL ;x−>x_at tackb in s = DEFATTACKBINS ;

1071i f ( ! x−>x_per iod ) x−>x_per iod = x−>x_npoints /2 ;x−>x_template = ( t_template ∗) g e t b y t e s (0 ) ;x−>x_ntemplate = 0 ;x−>x_ i n f i l l = 0 ;

1076 x−>x_countdown = 0 ;x−>x_w i l l a t t a c k = 0 ;x−>x_attacked = 0 ;x−>x_maskphase = 0 ;x−>x_debug = 0 ;

1081 x−>x_learn = 0 ;x−>x_learndebounce = c l o ck_ge t s y s t ime ( ) ;x−>x_lea rncount = 0 ;

111

Page 121: A Perceptually Grounded Approach to Sound Analysis

B – bonk∼ source code

x−>x_use loudnes s = 0 ;x−>x_debounceve l = 0 ;

1086 x−>x_sr = sy s_ge t s r ( ) ; /∗ g e t s the sample r a t e ∗/

/∗ someth ing u s e f u l f o r debugi f ( ac ) {

sw i t ch ( av [ 0 ] . a_type ) {1091 ca se A_LONG:

x−>x_n in s i g = av [ 0 ] . a_w. w_long ;b reak ;

}

1096 }

i f ( x−>x_n in s i g < 1) x−>x_n in s i g = 1 ;i f ( x−>x_n in s i g > MAXCHANNELS) x−>x_n in s i g = MAXCHANNELS; ∗/

1101 // t ak e s an atom l i s t and p r o p e r l y s e t any a t t r i b u t e s d e s c r i b e d w i t h i n . todo t h i s the s imp l e s t way i s to use f u n t i o n a t t r_a rg s_proce s s ( o b j e c twhose a t t r i b u t e s w i l l be p roce s s ed , ac , av )

// ac=The count o f t_atoms i n av// av=An atom l i s t//The f u n c t i o n a t t r_a rg s_proce s s ( x , ac , av ) i s t y p i c a l l y used i n o b j e c t ’ s

new i n s t a n c e to c o n v e n i e n t l y p r o c e s s a t t r i b u t e arguments .a t t r_a rg s_proce s s ( x , ac , av ) ;

1106x−>x_in s i g = ( t_ i n s i g ∗) g e t b y t e s ( x−>x_n in s i g ∗ s i z e o f (∗ x−>x_in s i g ) ) ;

//CREATE INLET// c r e a t e s the s i g n a l i n l e t w i th dsp_setup ( ( c a s t to t_prob j e c t ) o b j e c t

po i n t e r , n s i g n a l s ) , so you need not make them y o u r s e l f !1111 // n s i g n a l s=The number o f s i g n a l / proxy i n l e t s to c r e a t e f o r the o b j e c t . I f

the o b j e c t has no s i g n a l i n l e t s , you may pas s 0 .dsp_setup ( ( t_pxob jec t ∗) x , x−>x_n in s i g ) ;

//CREATE OUTLETS// s t o r e s the dumpout o u t l e t i n the obex wi th the g e n e r i c f u n c t i o n

ob jec t_obex_store ( o b j e c t po i n t e r , key , v a l ) . The dumpout o u t l e t a r et ha t used by a t t r i b u t e s to r e p o r t data i n r e s pon s e to ’ ge t ’ q u e r i e s .

1116 // key=A symbo l i c name f o r the data to be s t o r e d// v a l=A t_ob jec t ∗ , to be s t o r e d i n the obex , r e f e r e n c e d under the key//The g e n e r i c ca s e i s no rma l l y adapted to be used as f o l l o w :

ob jec t_obex_store ( x , _sym_dumpout , out let_new ( x , NULL) ) ;// c r e a t e s new o u t l e t s w i th out let_new ( ob j e c t , s ) .// s=A C−s t r i n g s p e c i f y i n g the message tha t w i l l be s en t out t h i s o u t l e t , o r

NULL to i n d i c a t e the o u t l e t w i l l be used to send v a r i o u s messages .1121 ob jec t_obex_store ( x , gensym ( "dumpout" ) , out let_new ( x , NULL) ) ;

// c r e a t e s an o u t l e t t ha t w i l l ALWAYS send the l i s t message wi th l i s t o u t ( (t_ob jec t +) o b j e c t ) ) .

x−>x_cookedout = l i s t o u t ( ( t_ob jec t ∗) x ) ;

1126 f o r ( j = 0 , g = x−>x_in s i g + x−>x_nins ig −1; j < x−>x_n in s i g ; j++, g−−) {g−>g_out l e t = l i s t o u t ( ( t_ob jec t ∗) x ) ;

}

//CLOCK1131 // c r e a t e s a new Clock o b j e c t w i th clock_new ( o b j e c t po i n t e r , ( method ) fn ) .

Th i s f u n c t i o n i s no rma l l y c a l l e d i n the new i n s t a n c e r o u t i n e f u n c t i o n .// fn=Funct i on to be c a l l e d when the c l o c k goes o f f , t h i s f u n c t i o n must be

c a l l e d ob j e c t_ t i c k .// Clock o b j e c t i s used as i n t e r f a c e to the Max s c h e d u l e r .

112

Page 122: A Perceptually Grounded Approach to Sound Analysis

B.1 – The bonk∼ Method

x−>x_clock = clock_new ( x , (method ) bonk_tick ) ;

1136 bonk_donew ( x , x−>x_npoints , x−>x_period , x−>x_nins ig , x−>x_n f i l t e r s ,x−>x_ha l f tones , x−>x_over lap , x−>x_ f i r s t b i n , s y s_ge t s r ( ) ) ;

}re tu rn ( x ) ;

}1141

/∗ A t t r i b u t e s e t t e r s . ∗/vo id bonk_minvel_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ){

i f ( ac && av ) {1146 f l o a t f = atom_get f loa t ( av ) ;

i f ( f < 0) f = 0 ;x−>x_minvel = f ;

}}

1151vo id bonk_lothresh_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ){

i f ( ac && av ) {f l o a t f = atom_get f loa t ( av ) ;

1156 i f ( f > x−>x_h i th r e sh )pos t ( "bonk : ␣warn ing : ␣ low␣ t h r e s h o l d ␣ g r e a t e r ␣ than ␣ h i ␣ t h r e s h o l d " ) ;

x−>x_lo th r e sh = ( f <= 0 ? 0.0001 : f ) ;}

}1161

vo id bonk_hi thresh_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ){

i f ( ac && av ) {f l o a t f = atom_get f loa t ( av ) ;

1166 i f ( f < x−>x_lo th r e sh )pos t ( "bonk : ␣warn ing : ␣ low␣ t h r e s h o l d ␣ g r e a t e r ␣ than ␣ h i ␣ t h r e s h o l d " ) ;

x−>x_h i th r e sh = ( f <= 0 ? 0.0001 : f ) ;}

}1171

vo id bonk_masktime_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ){

i f ( ac && av ) {i n t n = atom_getlong ( av ) ;

1176 x−>x_masktime = (n < 0) ? 0 : n ;}

}

vo id bonk_maskdecay_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av )1181 {

i f ( ac && av ) {f l o a t f = atom_get f loa t ( av ) ;f = ( f < 0) ? 0 : f ;f = ( f > 1) ? 1 : f ;

1186 x−>x_maskdecay = f ;}

}

vo id bonk_debouncedecay_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av )1191 {

i f ( ac && av ) {f l o a t f = atom_get f loa t ( av ) ;f = ( f < 0) ? 0 : f ;f = ( f > 1) ? 1 : f ;

113

Page 123: A Perceptually Grounded Approach to Sound Analysis

B – bonk∼ source code

1196 x−>x_debouncedecay = f ;}

}

vo id bonk_debug_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av )1201 {

i f ( ac && av ) {i n t n = atom_getlong ( av ) ;x−>x_debug = (n != 0) ;

}1206 }

vo id bonk_spew_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ){

i f ( ac && av ) {1211 i n t n = atom_getlong ( av ) ;

x−>x_spew = (n != 0) ;}

}

1216 vo id bonk_use loudness_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ){

i f ( ac && av ) {i n t n = atom_getlong ( av ) ;x−>x_use loudnes s = (n != 0) ;

1221 }}

vo id bonk_attackb ins_set ( t_bonk ∗x , vo id ∗ a t t r , long ac , t_atom ∗av ){

1226 i f ( ac && av ) {i n t n = atom_getlong ( av ) ;n = (n < 1) ? 1 : n ;n = (n > MASKHIST) ? MASKHIST : n ;x−>x_at tackb in s = n ;

1231 }}/∗ end a t t r s e t t e r s ∗/

vo id bonk_as s i s t ( t_bonk ∗x , vo id ∗b , long m, long a , char ∗ s ) {1236 i f (m == ASSIST_INLET)

s t r c p y ( s , " ( S i g n a l ) ␣Audio ␣ Input , ␣ An a l y s i s ␣ A t t r i b u t e s " ) ;e l s e i f (m==ASSIST_OUTLET) {

switch ( a ) {case 0 : s t r c p y ( s , " ( L i s t ) ␣Raw␣ F i l t e r ␣ Ampl i tudes " ) ; break ;

1241 case 1 : s t r c p y ( s , " ( L i s t ) ␣ I n s t r umen t ␣Number , ␣ Loudness , ␣Temperature " ) ;break ;

case 2 : s t r c p y ( s , "Dump" ) ; break ;}

}}

1246s t a t i c vo id bonk_learn ( t_bonk ∗x , i n t n ){

i f ( n < 0) n = 0 ;i f ( n )

1251 {x−>x_template = ( t_template ∗) t_ r e s i z e b y t e s ( x−>x_template , x−>x_ntemplate ∗

s i z e o f ( x−>x_template [ 0 ] ) , 0) ;x−>x_ntemplate = 0 ;

}x−>x_learn = n ;

114

Page 124: A Perceptually Grounded Approach to Sound Analysis

B.1 – The bonk∼ Method

1256 x−>x_lea rncount = 0 ;}

/∗ get c u r r e n t system t ime ∗/double c l o ck_ge t s y s t ime ( )

1261 {re tu rn ge t t ime ( ) ;

}

/∗ get the e l a p s e d t ime s i n c e the g i v en system time , i n m i l l i s e c o n d s ∗/1266 double c l o c k_ge t t ime s i n c e ( double p r e v s y s t ime )

{re tu rn ( ( ge t t ime ( ) − p r e v s y s t ime ) ) ;

}

1271 f l o a t q r s q r t ( f l o a t f ){

re tu rn 1/ s q r t ( f ) ;}

115

Page 125: A Perceptually Grounded Approach to Sound Analysis

B – bonk∼ source code

FigureB.1:

Max

patcherwindow

showing

ourtest

patchrealized

toanalyze

the·OMM·sounds

with

bonk∼3.0

116

Page 126: A Perceptually Grounded Approach to Sound Analysis

Appendix C

Writing External for Max/MSPwith XCode

Max5 and XCode3.X

This section is presented to familiarize with the Max/MSP environment and will showhow it can be extended by creating external objects.

With respect to this thesis, the tutorial of Zicarelli [62] and has been taken in account.Since no existing paper was found on how to write external for Max5, the three tutorialhave been adapted. In writing an external object for Max, the task is to write a sharedlibrary in C that is loaded and called by the “master environment” and in turns calls uponhelpful routines back in the master environment. You create a class, or template for thebehavior of an object. Instances of this class “do the work” of the object, when they aresent messages. Your external object definition will:

Define the class: its data structure, size, and how instances are to be created anddestroyed

Define functions (called methods) that will respond to various messages, performingsome action

With the name externals are considered all the external object of Max/MSP, i.e. notincluded in the software issue. Therefore an external could be any objects created by yourown (in such a programming language) or developed by somebody else. In later chapters,4 and 5, we use bonk∼ , an external object originally developed by Miller Puckette, willbe extensively analyzed, modified, for the purpose of onset detection.someone else. Theneed to write an external is to add one or more specific task to the the logical andarithmetics unit or the DSP chain of the software.

First, we downloaded theMax 5 Software Development Kit (SDK) from cycling74.com,which includes framework, API reference and some examples. The framework contains

117

Page 127: A Perceptually Grounded Approach to Sound Analysis

C – Writing External for Max/MSP with XCode

the header files where Max/MSP standard function and struct get called. The variousparts of the framework are described in the API. So, while creating new one or modify-ing existing external, you must do it in according to the Max/MSP API reference. Todevelop your object,

Since the most objects are written in C, now we procede describing the process todevelop objects in C, but externals can also be created it in such different programminglanguages. We used Xcode(version 3.1.2) to develop the external, the latest version ofnative IDE of the Apple Mac OS X. Let’s see an Xcode example to understand the mostsignificant contents of a project:

Figure C.1: XCode main window

The Source folder includes the source code you develop, typically is only one filenamed yourobject.c. The External Frameworks and Libraries folder is the place whereto add the MaxAPI and MaxAudioAPI (MSP) frameworks. The Product folder containsthe external created after compiling the source code, while Target are the option of thecompiler. The objects created are single file with .mxo extension, but those only seemto be files, because they hide contents in it. This is what under Mac OS is called a

118

Page 128: A Perceptually Grounded Approach to Sound Analysis

"bundle", or simply a package.A bundle contains a list of files and folders, like the ones showed in this view:

Figure C.2: a Bundle

You can create both Max or MSP external, depending on user requirement, withsome difference in the structure, essentially MSP externals are the ones which involveAudio DSP, such as the one i used while Max externals are logic and arithmetic objects.

In order to use the external in the Max patcher windows, you have to add the .mxopackage, produced by the building of the source code, into the msp-external (or max-external) folder in the application folder. But you can do this automatically by tellingXCode where to build your object in the building target.

For better understand what target are, you can think at the option of the compiler.Most of the option are predefined when compiling an external for Max/MSP with xCode.An example of configuring the target manually, should be represented by typing a filewith .xcconfig extension, and adding this to the project. Then you can use it as targetfield in XCode.

Three are the basic component of a Max external source code:

1. the entry poin as main() function

2. description of the object as the Structs

3. definition of the functionality as the Methods =>BEHAVIOUR

Some methods and element of the structs are required by Max, and are explained inthe MaxAPI reference. The development of the source code can be summarized in fivepoints:

1. including the right header files (usually ext.h and ext_obex.h for MSP objects)

2. declaring a C structure for your object

3. writing an initialization routine called main that defines the class

4. writing a new instance routine that creates a new instance of the class, whensomeone makes one or types its name into an object box

5. writing methods (or message handlers) that implement the behavior of the object.

119

Page 129: A Perceptually Grounded Approach to Sound Analysis

C – Writing External for Max/MSP with XCode

120

Page 130: A Perceptually Grounded Approach to Sound Analysis

Bibliography

[1] Cycling ’74. Max 5 API Reference, 2009.[2] Samer Abdallah and Mark Plumbley. Unsupervised onset detection: A probabilistic

approach using ica and a hidden markov classifier, 2003.[3] James Beauchamp. Analysis, Synthesis, and Perception of Musical Sounds: The

Sound of Music (Modern Acoustics and Signal Processing). Springer, December2006.

[4] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B. Sandler. Atutorial on onset detection in music signals. Speech and Audio Processing, IEEETransactions on, 13(5):1035–1047, 2005.

[5] J. P. Bello, C. Duxbury, M. Davies, and M. Sandler. On the use of phase and energyfor musical onset detection in the complex domain. Signal Processing Letters, IEEE,11(6):553–556, 2004.

[6] Juan Pablo Bello and Mark Sander. Phase-based note onset detection for musicsignals. In in Proceedings of IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP’03), pages 441–444. IEEE Computer Society, 2003.

[7] Luciano Berio. Intervista sulla musica. Laterza, 1981.[8] J. Bilmes. Timing is of the essence: Perceptual and computational techniques

for representing, learning, and reproducing expressive timing in percussive rhythm.Master’s thesis, MIT, Cambridge, MA, 1993.

[9] Paul Brossier. Automatic Annotation of Musical Audio for Interactive Applications.PhD thesis, Queen Mary University of London, UK, August 2006.

[10] Judith C. Brown. Calculation of a constant q spectral transform. J. Acoust. Soc.Am., 1991.

[11] Judith C. Brown and Miller S. Puckette. An efficient algorithm for the calculationof a constant q transform. J. Acoust. Soc. Am., 1992.

[12] Ryan J. Cassidy and J. O. Smith III. Auditory filter bank lab.[13] Nick Collins. A comparison of sound onset detection algorithms with emphasis on

psychoacoustically motivated detection functions. In In AES Convention 118, pages28–31, 2005.

[14] P. Cosi, G. De Poli, and G. Lauzzana. Auditory modelling and self-organizing neuralnetworks for timbre classification. Journal of New Music Research, 23(1):71–98,

121

Page 131: A Perceptually Grounded Approach to Sound Analysis

Bibliography

March 1994.[15] Roger B. Dannenberg. Nyquist Reference Manual. Carnegie Mellon University

School of Computer Science, Pittsburgh, PA 15213, U.S.A., 2007.[16] Alain de Cheveignè. Pitch perception models - a historical review. Technical report,

CNRS - Ircam, Paris, 2004.[17] Filipe Diniz, Iuri Kothe, Sergio L. Netto, and Luiz W. P. Biscainho. High-selectivity

filter banks for spectral analysis of music signals. EURASIP Journal on Advancesin Signal Processing, 2007.

[18] C. Dodge and T. Jerse. Computer music: syntesis, composition and performance.Thomson Learning, 1985.

[19] Carlo Drioli and Nicola Orio. Elementi di acustica e psicoacustica, 1999.[20] C. Duxbury, M. Sandler, and M. Davis. A hybrid approach to musical note onset

detection. In In Proc. Digital Audio Effects Workshop (DAFx, 2002.[21] Chris Duxbury, Juan Pablo Bello, Mike Davies, Mark Sandler, and Mark S. Com-

plex domain onset detection for musical signals. In In Proc. Digital Audio EffectsWorkshop (DAFx, 2003.

[22] Chris Duxbury, Juan Pablo Bello, Mark Sandler, and Mike Davies. A comparisonbetween fixed and multiresolution analysis for onset detection in musical signals. InIn Proc. Digital Audio Effects Workshop (DAFx, 2004.

[23] Ichiro Fujinaga. Max/MSP Externals Tutorial, 2005.[24] Toby Gifford and Andrew R. Brown. Listening for noise: An approach to percussiv

onset detection. In The Australasian Computer Music Conference, 2008.[25] M. Gimenes, E. R. Miranda, and C. Johnson. A memetic approach to the evolution

of rhythms in a society of software agents. In Proceedings of the 10th BrazilianSymposium of Musical Computation (SBCM), Belo Horizonte (Brazil), 2005.

[26] John William Gordon. Perception of Attack Transients in Musical Tones. PhDthesis, CCRMA, Department of Music, Stanford University, 1984.

[27] Paul Gurnig. An Introduction to Writing Externs in C for Max/MSP. University ofChicago, 2005.

[28] Kurt Jacobson. A metric for music similarity derived from psychoacoustic featuresin digital music signals. PhD thesis, University of Miami, 2006.

[29] K. L. Kashima and B. Mont-Reynaud. The bounded-q approach to time-varyingspectral analysis. Tech. Rep. STAN M-28, Stanford University, Department ofMusic, 1985.

[30] A. Klapuri. Sound onset detection by applying psychoacoustic knowledge. InICASSP ’99: Proceedings of the Acoustics, Speech, and Signal Processing, 1999.on 1999 IEEE International Conference, pages 3089–3092, Washington, DC, USA,1999. IEEE Computer Society.

[31] Alexandre Lacoste and Douglas Eck. A supervised classification algorithm for noteonset detection. EURASIP J. Appl. Signal Process., 2007(1):153, January 2007.

122

Page 132: A Perceptually Grounded Approach to Sound Analysis

Bibliography

[32] Kai Lassfolk and Jaska Uimonen. Spectutils, an audio signal analysis and visual-ization toolkit for gnu octave. In 11th Int. Conference on Digital Audio Effects(DAFx-08), 2008.

[33] Paul Masri. Computer Modeling of Sound for Transformation and Synthesis ofMusical Signals. PhD thesis, University of Bristol, UK, 1996.

[34] James Mccartney. Rethinking the computer music language: Supercollider. InRethinking the Computer Music Language: SuperCollider, volume 26, pages 61–68, Cambridge, MA, USA, 2002. MIT Press.

[35] Jon Mccormack. A developmental model for generative media. In Advances inArtificial Life, pages 88–97. 2005.

[36] E. R. Miranda. Computer Sound Design Synthesis techniques and programming.Focal press, 2002.

[37] Eduardo R. Miranda. Artificial phonology: Disembodied humanoid voice for com-posing music with surreal languages. Leonardo Music Journal, 15(1):8–16, 2005.

[38] M. S. Puckette, T. Apel, and David Zicarelli. Real-time audio analysis tools for pdand msp. In In Proceedings of the ICMC, 1998.

[39] Miller Puckette. Is there life after midi? ICMA, 1994.[40] Miller Puckette. Max at seventeen. Comput. Music J., 26(4):31–43, 2002.[41] Miller Puckette. The Theory and Technique of Electronic Music. World Scientific

Publishing Co. Pte. Ltd., 2007.[42] Miller S. Puckette. Pure data: recent progress. In Pure Data: recent progress,

1997.[43] Arunan Ramalingam and Sridhar Krishnan. Gaussian mixture modeling of short-

time fourier transform features for audio fingerprinting. IEEE Transactions on In-formation Forensics and Security, 1(4):457–463, December 2006.

[44] Curtis Roads. The Computer Music Tutorial. The MIT Press, February 1996.[45] Curtis Roads. Microsound. The MIT Press, 2004.[46] D. Rocchesso and F. Fontana. The Sounding Object. Mondo Estremo, 2003.[47] Davide Rocchesso. Introduction to Sound Processing. GNU GNU Free Documen-

tation License, 2003.[48] Davide Rocchesso. Programmazione visuale, versione 1.3, 2007.[49] Davide Rocchesso. Sound to sound, sense to sense, 2008.[50] X. Rodet and F. Jaillet. Detection and modeling of fast attack transients. In

International Computer Music Conference (ICMC), pages 30–33, 2001.[51] E. D. Scheirer. Tempo and beat analysis of acoustic musical signals. Journal of

the Acoustical Society of America, 103(1):588–601, 1998.[52] X. Serra. Musical Sound Modeling with Sinusoids plus Noise, pages 91–122. Swets

and Zeitlinger, 1997.[53] Xavier Serra. Parshl: An analysis/synthesis program for non-harmonic sounds based

on a sinusoidal representation, 1985.

123

Page 133: A Perceptually Grounded Approach to Sound Analysis

Bibliography

[54] Xavier Serra. A System for Sound Analysis/Transformation/Synthesis Based ona Deterministic Plus Stochastic Decomposition. PhD thesis, Stanford University,1989.

[55] Barry Vercoe. The Csound Book: Perspectives in Software Synthesis, Sound De-sign, Signal Processing,and Programming. The MIT Press, March 2000.

[56] Tony S. Verma and Teresa H. Y. Meng. Extending spectral modeling synthesis withtransient modeling synthesis. Comput. Music J., 24(2):47–59, 2000.

[57] Gil Weinberg and Scott Driscoll. Robot-human interaction with an anthropomor-phic percussionist. In CHI ’06: Proceedings of the SIGCHI conference on HumanFactors in computing systems, pages 1229–1232, New York, NY, USA, 2006. ACM.

[58] Gil Weinberg and Scott Driscoll. Toward robotic musicianship. Comput. Music J.,30(4):28–45, 2006.

[59] Gil Weinberg and Scott Driscoll. The interactive robotic percussionist: new de-velopments in form, mechanics, perception and interaction design. In HRI ’07:Proceedings of the ACM/IEEE international conference on Human-robot interac-tion, pages 97–104, New York, NY, USA, 2007. ACM.

[60] Gil Weinberg, Mark Godfrey, Alex Rae, and John Rhoads. A real-time geneticalgorithm in human-robot musical improvisation. Computer Music Modeling andRetrieval: Sense of Sounds, pages 351–359, 2008.

[61] Stephen Wilson. Information Arts : Intersections of Art, Science, and Technology(Leonardo Books). The MIT Press, April 2003.

[62] David Zicarelli. Writing External Objects for Max 4.0 and MSP 2.0. Cycling ’74,2001.

[63] Udo Zölzer, Xavier Amatriain, Daniel Arfib, Jordi Bonada, Giovanni De Poli, PierreDutilleux, Gianpaolo Evangelista, Florian Keiler, Alex Loscos, Davide Rocchesso,Mark Sandler, Xavier Serra, and Todor Todoroff. DAFX:Digital Audio Effects.John Wiley & Sons, May 2002.

124