visual enhancement using multiple audio streams in live...

Visual enhancement using multiple audiostreams in live music performance

Rozenn Dahyot1, Conor Kelly1, and Gavin Kearney2

1School of Computer Science and Statistics, Trinity College Dublin, Ireland

2Department of Electronic and Electrical Engineering, Trinity College Dublin, Ireland

Correspondence should be addressed to Rozenn Dahyot ([email protected])

ABSTRACTThe use of multiple audio streams from digital mixing consoles is presented for application to real-timeenhancement of synchronised visual effects in live music performances. The audio streams are processedsimultaneously and their temporal and spectral characteristics can be used to control the intensity, durationand colour of the lights. The efficiency of the approach is tested on rock and jazz pieces. The result of theanalysis is illustrated by a visual OpenGL 3-D animation illustrating the synchronous audio-visual eventsoccurring in the musical piece.

1. INTRODUCTION

Visual effects such as stage lighting or fog machines arewidely used in live music performances to enhance theemotion and mood of the music played. Such schemesare designed to visually immerse the audience in the feel-ing of the song. Video displays such as TV screens orvideo projectors are now standard facilities in small tolarge size venues and recent trends in art tend to designcomputer programs to automatically allow interactionsbetween the music and the visual effects [1].

Typically in small to medium sized auditoriums, soundreinforcement for jazz and rock ensembles performingon stage involves the use of around 8 microphones, amixing console and loudspeaker amplification. The mi-crophone signals are pre-amplified and processed at theconsole and a stereo mix for ’Front of House’ amplifi-cation is generated. This stereo mix is fed to the light-ing desk which allows control over several effects of thestage lights (colour, flash, intensity, direction, etc.). Of-ten artificial intelligence is involved in the making ofshows to assist in the work of sound and lighting engi-neers. For example, one popular automatic process isreal-time beat detection implemented on a basic level onlighting desks [2]. The visual effects can then be syn-chronised to the music. Such algorithms tend to focusprimarily on the low-frequency content of the stereo mixto infer tempos since the mid and upper frequency rangesare generally ‘cluttered’ from the mix of sources. How-

ever, current lighting systems do not avail of the multi-ple audio streams available from digital mixing consolessuch as the popular Tascam Digital Interface (TDIF) orAlesis Digital Audio (ADAT) protocols. We proposehere to process in real-time a multi-channel audio streamfrom a digital mixing console to perform reliable lightingenhancement through temporal beat detection and fre-quency analysis. The advantage of such settings is thatthe musical content of each instrument is well separated,since, in well engineered performances the sound pres-sure level of a particular instrument contributes greaterthan the ’spill’ from the other instruments at its corre-sponding microphone. Thus, no processing for sourceseparation is required for the different instruments. Thetemporal and spectral characteristics of these signals canthen be analysed simultaneously to generate enhancedvisual effects. Another advantage of using the separatedsources is that the mid to high frequency componentswhich are crucial in determination of signal attack areuncluttered. Thus a high audio resolution is of impor-tance for the accurate detection of pitch and the temporalproperties of higher frequency percussive instruments,such as hi-hats, as well as visual enhancement of spa-tial effects such as reverbs or delays on vocals or guitarsfor instance.

We propose here to create a portable affordable systemthat will help automatically generate in real-time a vi-sual artistic rendering of the music being played live in

AES 31ST INTERNATIONAL CONFERENCE, London, England, 2007 JUNE 25–271

Dahyot et al. Visual enhancement using multiple audio streams

a small or medium venue without the undesireable bud-get constraints that face many working artists. As an al-ternative to lightings, we illustrate our multi-stream mu-sic analysis by creating a real-time OpenGL animationthat reacts to events in the music piece. Such a sys-tem could be used to increase the exposure of not yetwell-known artists to international audience in the vir-tual world (e.g. by simultaneously performing in SecondLife http://secondlife.com/).

Our smart system has been tested on jazz and rockpieces. We show that real-time high resolution multi-stream music analysis can be performed, with reliableaccuracy. We use various methods such as frequencyspectrum analysis, beat detection amplitude analysis toget a feel for the mood and tempo of the song (see section4). An animation is then created representing with moreor less accuracy the members of the bands with their in-struments on a stage. The lightings and the motion of thecharacters in the rendering changes in realtime accordingto the song (see section 5). Section 6 comments on thecurrent performances of our system.

2. RELATED WORKS

The work presented in this paper mixed different do-mains of computer science: digital music processing andcomputer graphics. In the following paragraphs, refer-ences are given to both areas of research.

2.1. Analysis of music

Digital processing of music has attracted a lot of atten-tion mainly due to the high commercial value for onlinesells of songs. A huge literature exists about features thatare efficient to process music, and a good review can befound in [3]. Some of them, such as loudness, Fouriertransform, energy band or median frequency have beenused in our system and are presented in section 4.

Beat detection significantly aids the classification of mu-sic genre, and is an elementary step for more thoroughanalysis of the music [4]. Further methods propose to in-fer the structure of popular music have been proposed byMaddage [4] and include music transition, voice detec-tion and repeated pattern detection. Application of thesemethods can be found in music transcription, music sum-marisation and retrieval, and also in music streaming.

2.2. Visual music

Perception of music. By stimulating a second sensealong with music, visual effects have the ability to con-tribute to the communication that takes place betweenperformers and their listeners [5]. Examples of vi-sual contributions in music performances include facialexpression or body gesture and movement of the per-formers, video projections, light and pyrotechnic showsamongst others. They amplify the emotion of the musicand completely immerse the observer in the feeling ofthe song.

Visuals & Graphics. In the following paragraphs, wereport several computer-aided systems that have beenproposed to generate visuals for music. Visuals can becreated in many ways e.g. by films and lighshows and,as an illustration, the reader can visit the web exhibitionvisual music [6] that presents several visual expressionsexplored by artists to extend the perception of music.

One important visual cue to music is the natural move-ment of the body expressed by performers [5] or by thelisteners. For instance human foot taps are inferred fromthe perceived beat. Dancing is also a natural illustrationof music. Denman et al. [7] have proposed to synchro-nise the beat of a song with the visual motion of a dancerin a video (performed for another song). Time-scalechanges of the video feed are then performed to synchro-nise the detected beat of the new song with the move-ments of the dancer. The beat detection from a mono-phonic audio stream and the extraction of the motionfrom the video are performed offline. The created videocan have application for generating visuals in nightclubsor for postproduction in music videos.

Several applications using music to synchronise in real-time computer graphics animations have been proposed[8, 9, 10]. Applications can be found for entertainmentor for learning music In [9], the motion curves is syn-chronise to music in computer animations. In [10], thegraphics animation mimics the expressiveness of a drum-mer. However, on the contrary to [7], no audio analysisis performed as the relevant cues are already availablein the complementary MIDI1 stream of the soundtrack[8, 9, 10]. MIDI stores the events that would create thesound instead of the sound itself. This allows easy ac-cess to pitch, velocity, instrument and timing informa-tion. Using a MIDI file instead of a raw audio signalavoids the need of performing digital music processing

1Musical Instrument Digital Interface


of 7


in realtime. Unfortunatly, this information is not alwaysaccessible in realtime from every instrument playing in amusical piece.

3. OVERVIEW OF OUR SYSTEM

We consider live performance of small band Rock or Jazzwith a few musicians and instruments. On stage, severalmicrophones are placed close to each instrument. Weassume the availability of multichannel audio streamsfrom TDIF or ADAT interfaces from such consoles asthe Yamaha 02R96 or the standalone analog to digi-tal conversion capabilities of units such as the MOTU2408MK2. Figure 1 show an overview of the system.Several audio recordings coming from different micro-phones (mic1, mic2, etc.) are available for analysis. Thefinal mix of the song is used only as a soundtrack in thefinal rendering.

Fig. 1: Overview of our system.

The main advantage of considering separate audiosources from each microphones instead of the mixedtrack resides in the fact that the different sources are wellseparated. In fact the closest instrument to each micro-phone is the one mainly audible on the correspondingaudio stream little spill from the other instruments rang-ing in the order of 30 to 50dB less. Using only the mixto analyse the music would lessen the amount of data

to process in real-time but would require computation-ally expensive routines to separate the contribution fromeach musician [11]. As an alternative, we propose to takethe advantage of the direct out or bussing facilities avail-able on most mixing consoles so that the seperated audiois already presented for analysis. This choice allows usto use simple and fast algorithms to extract reliably rele-vant music features but has the drawback of requiring theanalysis of several audio streams in parallel.

4. MUSIC ANALYSIS

Currently four audio channels are analysed simultane-ously. These correspond most of the time to the micro-phones of the singer (voice), the drums, the guitarist andthe bass. For the Jazz piece we analysed, the saxophoneis selected instead of the guitar. In our simulation, in-dependent audio streams are stored as mono WAV filessampled at 44100Hz.

Figure 2 shows an example of these different recordingsfor an extract of a rock song.

0 1 2 3 4 5 6 7 8 9 10−0.2

0

0.2

0 1 2 3 4 5 6 7 8 9 10−0.1

0

0.1

0 1 2 3 4 5 6 7 8 9 10−0.5

0

0.5

0 1 2 3 4 5 6 7 8 9 10−0.1

0

0.1

Fig. 2: 10 seconds of a rock song. From top to bottom:Pressure signals of singer voice, guitar, bass and drums.

4.1. Beat detection

The beat detection algorithm is performed onto the drumaudio stream. For the audio signal x(t), the loudnessis computed for each window of 1024 samples (ie ∆ =1024

44100 = 0.0232s by:

l(n∆) =∫ (n+1)∆

n∆

x2(t) dt, starting at n = 0 (1)


of 7


The detection of the beat is performed by thresholdingthe loudness information. To be independent of back-ground noise during the performance or the differentloudness of the different sources, the threshold T is adap-tative for each frame:

T (n∆) = γ · 11∆

∫ n∆

n∆−1x2(t) dt (2)

where γ is a proportional coefficient set by hand at γ =1.4 and the normalised integral corresponds to the av-erage loudness in the preceding second before the win-dow n∆. A beat is then detected when l(n∆) > T (n∆).Sometimes several successive temporal windows are de-tected above the threshold. Consequently only the firstdetected beat amongst a successive sequence of detectedbeats is actually labelled as a beat (i.e. the rule is thata beat is impulsive and cannot be detected in successivewindows). Figure 3 shows the results of our beat detec-tion performed on ten seconds of a rock song performedlive. As can be noticed, the detected beat is sometimesone temporal window in advance of the actual peak in theloudness signal. This means that when a beat is detected,it is with ∆ = 0.0232 second of accuracy. This temporalprecision in the audio analysis is largely sufficient as thevisual rendering is only changing every 0.04 second (i.e.the animation has 25 frames per second).

1 2 3 4 5 6 7 8 9 100

0.005

0.01

0.015

0.02

0.025

Fig. 3: Result of beat detection performed on 10s of theaudio track of drums for a rock song recorded in live ses-sion (cf. fig. 2). Red dot indicate detected beat and theblue curve corresponds to the loudness computed every∆.

4.2. Fourier Analysis

For each audio stream, a Fast Fourier Transform (FFT) is

computed every 23.2ms (or 1024 samples) as follow:

X(n∆, f ) =∫ (n+1)∆

n∆

x(t) · exp(−2π f t) dt (3)

Using a adapted passband filter, each instrument is sepa-rated from any possible spills coming from other sources,and information such as the band energy of the instru-ment (or the voice) is recorded:

A(n∆) =∫ f1

f0|X(n∆, f )| d f (4)

where [ f0, f1] define the frequency band of the instru-ment. Without too much additional computational cost,the median frequency or the mean of the spectrum is alsocomputed as follow [3]:

f (n∆) =

∫ f1f0

|X(n∆, f )| · f d f∫ f1f0

|X(n∆, f )| d f(5)

These are the measures used in our system. Other infor-mative features such as pitch can also be computed if thecomputation time remains low for the hardware used.

5. REAL-TIME ANIMATION

A simulation of a stage complete with lighting effects isrendered on screen. This rendering is created and drawnusing the OpenGL graphics library. Some of the variousgraphics methods used in the render include (see figure4) 3D modelling, texture mapping and Tesselated objectsas explained in the following paragraphs.

5.1. 3D Modeling

The musicians and many of the stage props such as thedrums, guitars, microphones, lights and light rig weremodelled in 3D Studio Max as 3D models and then im-ported into openGL. Figure 5 shows a screen shot of thedrums being modelled in 3D Studio Max. 3DS modelsare composed of many thousands of vertices and theirtexture coordinates. They are quite computational ex-pensive to draw. For this reason, there is a trade off inthe detail represented in the simulation and the speed atwhich it runs.

5.2. Texture mapping

Texture mapping is the process of applying textures(stored as jpegs) to shapes drawn in order to add color


of 7


Fig. 4: A screen shot of the render: Vocalist and gui-tarist are illuminated in the foreground, ambient lightingshines red in the background indicating an uptempo beat.

Fig. 5: Modelling the drum kit in 3D Studio Max.

and realism to the scene. The less complicated objects inthe render such as the enclosing walls of the stage and thefront of the stage floor can be presented with far fewervertices and so are drawn with their static coordinatesspecified in the code. The texture mapping coordinatesare also specified and so at render time, these textures areapplied to the shapes to give them a realistic look. Thisis far more efficient than drawing 3DS models and so isused wherever possible.

5.3. Tesselated objects

OpenGL uses the Phong Illumination model to calcu-late lighting in scenes. This calculates light reflections ateach vertex of an object and interpolates the light to sur-rounding polygons. As spot lights will be shining downonto the stage floor, it is required to display and reflectthem realistically. To do this, the stage floor is drawn asa very fine mesh of vertices in a process known as tes-sellation. However, as it is a computationally expensiveprocess, a trade off in between realism and computationhas been found.

5.4. Animation to render the music feel

The visualisation of the information extracted from themusic played is done in three ways:

1. Spot lighting. The most important of these is theconcentration of spotlighting on any musician whois currently active. This uses information from theFFT performed on each musician’s channel. Theenergies A are calculated and when a certain thresh-old level is breached, a spot light is shone on themusician.

2. Ambient lighting. Rather than concentrate on indi-vidual channels, ambient lighting’s focus is on thebehaviour of the song as a whole and so concen-trates on the energies of all channels. It analysesthe predominant frequencies of the FFT (i.e. f )anduses tempo information to attempt to provide am-bient lighting in accordance with the mood of thesong. The interpretation of the songs is based onthe generalisation that lower frequencies and lowertempos indicate a more relaxed mood. This trig-gers low key colors such as purple or dark red.Brighter colors illuminate the stage when songs oc-cupy higher frequency bands and have faster temposto reflect the more exited performance.

3. Physical movement of the musicians on stage. Spotlighting will demonstrate that a musician’s level hasbreached a certain threshold and is deemed to beplaying or singing. Simple animation of the charac-ters is performed give more information of exactlyhow loud their part is. This is shown by the speedat which the musicians arms and body move, thefaster they move and the bigger a part in the overallmix they are playing. This is done using hierarchial


of 7


animation of the 3D models. The models seen inthe render are actually made up of several models(body, head, legs etc. ) which are drawn together inOpenGL to simulate a person. These can be rotatedand moved around each other to animate movement.This animation is controlled in accordance with themusic analysis so their movement is directly linkedto the music they are producing.

6. PERFORMANCE AND OPTIMIZATION

6.1. Hardware

The system as described in this paper is currently run-ning on a standard laptop (Model: HP NX9420) with In-tel Core Duo CPU at 1.66GHz, 512mb of RAM memoryand a graphics card ATi Radeon X1600 256MB. The mu-sic analysis and the animation are created in real-time 25frames per second for the video rendering.

6.2. Computational efficiency

Computational efficiency of the code is of major issuefor our system to work. As a lot of calculation is takingplace for the analysis of multi-stream music, we needalso a reasonable reaction time in the rendering to avoiddesynchroning artefacts. Various methods are used in thegraphics component of the project to ensure optimal per-formance. One such method is the use of hardware pre-caching of OpenGL display lists that allow some com-mand to be precompiled onto the graphics card’s mem-ory and so removes the need for the CPU to perform re-peated expensive calculations. This takes advantage ofthe dedicated memory and computation power of mod-ern GPUs. In a direct comparison between the code withno hardware pre-caching and the code which makes useof display lists, a speed up (measured in the frame persecond count) of roughly 40% was achieved.

6.3. Perception of the animation

Some results of the system are shown as videos athttps://www.cs.tcd.ie/Rozenn.Dahyot/Demos/DemosMusic.html. The system has been successivelytested using four simultaneous audio channels fromrock and jazz bands, mainly in live performance situ-ations but also with less noisy environments such asin studio recordings. The perceived animation is wellsynchronised to the beat, in particular the lights and themovement of the drummer.

Figure 6 shows two images from a recorded animation.A rock song is played where first only one guitar anddrums are playing. Then a second guitar starts soloing.The yellow lighting on the musicians indicate if they arecurrently playing. The reddish lightings on the soloistillustrate the measure f computed in real-time by chang-ing linearly from blue to red. For a better visualisation,a bar with a moving and changing colour spot indicatesthe value of f in its range in realtime.

(a)

(b)

Fig. 6: (a) The singer and one guitarist are not playingand are in the dark. (b) The guitarist is playing a solo, andusing the median frequency f , the colour of the red lightsvaries from blue (low values of f ) to red (high values off ).


of 7


7. CONCLUSION AND FUTURE WORK

We have presented an innovative system using multi-channel music recordings for real-time rendering. Us-ing the computational power of a recent laptop, we haveshown how to simultaneously perform music analysisand render a graphic animation expressing some aspectof the music being played. Both CPUs and GPUs abili-ties have being used to speed up the system.

Future directions of this research will look at creatingother animations illustrating better the music, such as us-ing changes on facial expressions on one virtual face [5],or by animating a virtual dancer [8], or more generallyto create more expressive animations. The music pro-cessing part of the system can also be improved by us-ing prior information, for instance for the beat detectionwhere currently no past information is used (i.e. beatsare detected without the knowledge when the last beatwas detected). The use of other informative audio fea-tures such as pitch will also be investigated.

ACKNOWLEDGEMENTS

Part of this work has been funded by the EuropeanNetwork of Excellence on Multimedia Understandingthrough Semantics, Computation and Learning MUS-CLE FP6-5077-52, http://www.muscle-noe.org/.

8. REFERENCES

[1] T. Winkler, Composing Interactive Music - Tech-niques and Ideas Using Max. MIT Press, 1998.

[2] U. Sandstrom, Stage lighting Controls. FocalPress, 1997.

[3] M. Davy and S. Godsill, “Audio information re-trieval: a bibliographical study,” University ofCambridge, UK, Tech. Rep., November 2001.

[4] N. Maddage, “Automatic structure detection forpopular music,” IEEE Multimedia, vol. 13, no. 1,pp. 65 – 77, 2006.

[5] W. F. Thompson, P. Graham, and F. A. Russo, “See-ing music performance: Visual influences on per-ception and experience,” Semiotica, pp. 203–227,2005.

[6] “Visual music,” http://www.hirshhorn.si.edu/visualmusic/, Hirshhorn Museum, 2005.

[7] H. Denman and A. Kokaram, “Dancing to a differ-ent tune,” in 2nd IEE European Conference on Vi-sual Media Production (CVMP), 30 Nov. - 1 Dec.2005, pp. 147 – 153.

[8] D. Reidsma, A. Nijholt, R. Poppe, R. Rienks, andG. Hondorp, “Virtual rap dancer: Invitation todance,” in CHI ’06 extended abstracts on Humanfactors in computing systems. ACM, 2006, pp.263–266.

[9] M. Cardle, L. Barthe, S. Brooks, and P. Robinson,“Music-driven motion editing: Local motion trans-formations guided by music analysis,” in 20th IEEEEurographics UK Conference (EGUK), 2002.

[10] A. M. Wood-Gaines, “Modelling expressive move-ment of musicians,” Master’s thesis, MSc Com-puting Science, Simon Fraser University, February1997.

[11] S. Choi, A. Cichocki, H. Park, and S.-Y. Lee,“Blind source separation and independent compo-nent analysis: A review,” Neural Information Pro-cessing - Letters and Reviews, vol. 6, no. 1, pp. 1–57, January 2005.


of 7

visual enhancement using multiple audio streams in live...

Documents