synthesizing expressive music through the language of conducting

JNMR017

Abstract

This article presents several novel methods that have beendeveloped to interpret and synthesize music to accompanyconducting gestures. The central technology used in thisproject is the Conductor’s Jacket, a sensor interface thatgathers its wearer’s gestures and physiology. A bank of soft-ware filters extracts numerous features from the sensorsignals; these features then generate real-time expressiveeffects by shaping the note onsets, tempos, articulations,dynamics, and note lengths in a musical score. The result isa flexible, expressive, real-time musical response. This articlefeatures the Conductor’s Jacket software system and des-cribes in detail its architecture, algorithms, implementationissues, and resulting musical compositions.

1. Introduction

1.1 Premise: trying to make new electronic instrumentsmore expressive

In recent decades there have been tremendous innovations in the development of interfaces and methods for perform-ing real-time, interactive music with computers. However,most of these systems (with a few notable exceptions) have not been adopted by performing musicians. This indicates a problem or delay in the acceptance of these newtools by the larger musical culture. Why is it that most musicians have not yet traded in their guitars, violins, or conducting batons for new technologies? Perhaps this isbecause the new technologies do not yet convey the mostdeeply meaningful aspects of human expression. That is, they do not capture and communicate the significant andemotional aspects of the gestures that are used to controlthem.

“One of the most telling, and annoying, characteristics of per-formance by computers is precisely its ‘mechanical’ nature – thenuance and expression performed instinctively by human playersafter many years of intensive practice is completely squeezed outof a quantized sequencer track. If more of the musical sensibil-ities informing human expressive performance can be impartedto computer programs, the range of contexts into which they canusefully be inserted will grow markedly.” (Rowe, 1996)

Identifying and quantifying the qualities that make certainmusical instruments significant and successful is a complexproblem. It’s not clear, for example, that computer technol-ogy will ever achieve the fine sound and responsiveness of agreat violin. But before attempting to solve the most difficultcases, such as equaling or surpassing the violin, there areseveral basic technical issues that should be addressed.

1.2 Prescription: where improvement is needed

First of all, many current interfaces do not satisfy funda-mental criteria for real-time interaction; they do not sampletheir input data fast enough or with enough degrees offreedom to capture the speed and complexity of humanmovement. Secondly, their sensing environments often inap-propriately constrain the range and style of movements thatthe performer can make. For example, whereas perhaps apiano keyboard provides an appropriate and comfortableconstraint to the limbs, perhaps a keyboard and mouse donot. Thirdly, the software that maps the gestural inputs tomusical outputs is often not powerful enough to respondappropriately to the structure, quality, and character in thegestures. Finally, there tend to be simplistic assumptionsabout what level of complexity and analysis of the data is

Accepted: 1 July, 2001

Correspondence: Dr. Teresa Marrin Nakra, 381 Farrwood Drive, Ward Hill, MA 01835, USA. Tel.: +1-617-686-4898, Fax: +1-978-372-1078,E-mail: [email protected]

Synthesizing Expressive Music through the Language

of Conducting

Dr. Teresa Marrin Nakra

Artistic Director, Immersion Music Inc., Boston, MA, USA

Journal of New Music Research 0271-3683/01/2205-367$16.002001, Vol. 22, No. 5, pp. ••–•• © Swets & Zeitlinger

2 Teresa Marrin Nakra

sufficient. Ultimately, all musical systems could benefit fromgreater information about how their input parameters varydynamically and what the variations “mean” in human terms.Without this knowledge, interactive systems cannot anti-cipate and respond appropriately or meaningfully to theircontrol inputs.

In order for new computer technologies to replace,enhance, or transform the capabilities of traditional instru-ments, they need to convey affect, interpretation, and skill.In order for performers to express themselves artfully, theyrequire musical instruments that are not only sensitive tosubtle variations in input but also capable of controlling mul-tiple modulation streams in real-time; these instruments mustalso be repeatable and deterministic in their output. Whatoften happens with the replacement of traditional mechani-cal instruments with sensor-based interfaces is that the manydimensions in the input stream are effectively reduced in thetransduction process and projected down to minimal axes ofcontrol. While the reduction of dimensionality in an inputstream often improves the responsiveness of automaticrecognition and other engineering tasks, it is not always idealfor music, where subtlety in the “microstructural variation”(Clynes, 1990) provides much of the interesting content forthe performer and the listener. Therefore, the first two aspectsto focus and improve upon are: (1) acquiring the full dimen-sionality of the input gestures with sensor systems, and (2) acquiring the sensory data at high rates, so as to preservethe information without aliasing and other subsamplingpathologies.

It is not only the modeling and sensing of input data that needs improvement; musical outputs (the sonic resultsof mapping gestures to sound) also tend to cluster in eitherof two opposite extremes: on the one side, they can be toosimple and obvious, and on the other too complex or con-fusing for the audience to follow. The issue of mappings (theway that the musical response reflects the gesture that acti-vates and controls it) is one of the hardest things to get rightwith electronic instruments –:

“Often, the controller feels right, but the shape of the responsedoes not fit your musical idea. This is a huge area . . .” (Ryan,1991)

This commonly encountered problem might also bedescribed as a perceptual disconnect between the way thegesture looks and the way the music sounds. This can happenbecause of brittle, unnatural, or overly constrained mapp-ings between gesture and sound. A mapping that is difficultfor the public to understand can dampen their enthusiasm or response to the performance. Such performances canquickly become frustrating for people on both sides of the stage. Some people feel that this problem is inherent tothe craft and that computers will never be able to interpretgestures. However, I am much more sanguine about futuredevelopments. The musicians in Sensorband share myopinion:

“Pierre Boulez . . . mentioned that there always will be a dif-ference between computer music and traditional instrumentalmusic, because musicians can perform (or interpret) gestures,and computers cannot. We think that this is the previous gener-ation’s view on computers. One of the most important goals forus is to make gestural music performances with computers.”(Bongers, 1998)

Gestural music performances will certainly need to becomemore stirring and communicative in order for interactivemusic to become a widely appreciated art form. The inher-ent problems in sensing and gesture recognition must besolved before sensor-based instruments will be widelyadopted in the performing arts. The musical response froman instrument should, of course, always fit the musicalgesture that generates it, but designing a system where thatis always true could be infinitely difficult. Thus, one of thebiggest challenges in designing real-time musical systems isto carefully choose which issues to target first.

1.3 Focus: framing a project around conducting

One way to improve upon the live performance of computermusic is to use powerful computer technologies to observeand measure the movements that expert performers make,correlate them to actual musical results, and build a real-timemodel with enough variation in it to synthesize new musicwith a new performer. I undertook such a process in anattempt to build a more expressive and responsive gesturalinterface, focusing on the performance parameters of orchestral conducting. The result of this project, called theConductor’s Jacket, reflected aspects of both research andperformance perspectives. The method I developed com-bined data-gathering, analysis, and synthesis. Many hours’worth of data was collected from conductors in real-life situations, a feature-set was identified from the data, and the results were fed back into a real-time music system. Theapproach was designed to achieve practical answers thatwould be meaningful to active musicians while remainingrespectful of the complexity of the subject.

Why observe conductors for clues about gestural interfacedesign? The answer is pretty straightforward: establishedtechniques provide a baseline from which to analyze andevaluate performers. The existence and documentation of an established conducting technique provides a frameworkwithin which gestures hold specific meanings and functions.That is, there exists a commonly understood (although some-times hard to quantify) mapping between gesture and soundin the context of musical conducting, and it provides a richand sophisticated framework for the design of a gesturalinterface.

Conducting is a gestural art form, a craft for skilled prac-titioners. It resembles dance in many ways, except that it isgenerative, not reflective, of the music that accompanies it.Without an instrument to define and constrain their gestures,

JNMR017

Synthesizing expressive music through the language of conducting 3

conductors are free to express themselves exactly as theywish, and so there is enormous variety in the gestural stylesof different individuals. In addition, conducting is a matureform that has developed over 250 years and has an estab-lished, documented technique. The gesture language of conducting is understood and practiced by many musicians,and commonly used as a basis for evaluating their skill andartistry. In order to be able to understand the meaning andsignificance of gestures, it helps to have a shared foundationof understanding; the technique of conducting convenientlyprovides this with its widely understood gesture system.

Studying the phenomenon of conducting from a quantita-tive perspective also presents a number of challenging tech-nical problems to solve. Analyzing conducting gestures inreal-time requires very quick response times, high samplingrates, little hysteresis or external noise, and powerful featurerecognition of highly complex, time-varying functions. Forexample, conductors demand a response time on the order of 1 kHz, which is a factor of almost two orders of magni-tude difference from the 10–30 Hz response time of manyexistent gesture-recognition technologies. Also, conductorsshould not be loaded down with any heavy or bulky equip-ment that might limit their expressive movements. These sen-sitive and difficult conditions inherently require creative androbust solutions.

Another reason to study conducting is that it is an his-torically successful method for broadcasting and communi-cating information in real-time. It is an optimized languageof signals whose closest analogues are perhaps sign and semaphore languages, or mime. John Eliot Gardner, the well-known British conductor, relates conducting to the electricalrealm:

“the word ‘conductor’ is very significant because the idea of a current being actually passed from one sphere to another,from one element to another is very important and very much part of the conductor’s skill and craft.” (Teledec video,1994).

Finally, conductors themselves are inherently interesting sub-jects. They represent a small minority of the musical popu-lation, and are considered to be among the most skillful,expert, and expressive of all musicians. They amplify andexaggerate their gestures in order to be easily seen by manypeople, and they are quite free to move around duringrehearsals and performances, unencumbered by an instru-ment. The gestures of conducting influence and facilitate thehigher-level functions of music, such as tempo, dynamics,phrasing, and articulation; a conductor’s efforts are expendednot in playing notes, but in shaping those played by others.Finally, conductors have to manipulate reality; they purpose-fully (if not self-consciously) modulate the apparent viscos-ity of the air around them in order to communicate expressiveeffects – two gestures that have the same trajectory andvelocity but different apparent frictions will provide differ-ent musical impressions. For all the above reasons, we found

conducting to be an extremely interesting and rich area tostudy.

1.4 Conducting technique

Conducting is a precise system of symbolically mapped ges-tures. While styles can vary greatly across individuals, con-ductors do share an established technique. There also existsa canon of literature that clarifies and defines the basic elements of the technique; some of the more prominentauthors are Max Rudolf, Elizabeth A.H. Greene, Sir AdrianBoult, Hermann Scherchen, Gunther Schuller, and HaroldFarberman. The technique has been widely adopted and consistently used in classical music, which means that anyskilled conductor should be capable of conducting anyensemble. This is, for the most part, true.

Conducting technique involves gestures of the wholebody: posture in the torso, rotations and hunching of theshoulders, large arm gestures, smaller forearm and wrist gestures, delicate hand and finger movements, and facialexpressions. The baton functions as an extension of the arm,providing an extra, elongated limb and extra joint with which to provide expressive effects. Conductors’ movementssometimes have the fluidity and naturalness of masterStanislavskian actors, combined with musical precision andscore study. It is a gestalt profession; it involves all of thefaculties simultaneously, and should not be done halfheart-edly. Leonard Bernstein once answered the question, “Howdoes one conduct?” with the following:

“Through his arms, face, eyes, fingers, and whatever vibra-tions may flow from him. If he uses a baton, the baton itself must be a living thing, charged with a kind of electricity, whichmakes it an instrument of meaning in its tiniest movement. If he does not use a baton, his hands must do the job with equal clarity. But baton or no baton, his gestures must be firstand always meaningful in terms of the music.” (Bernstein, 1988)

The skill level of a conductor is also discernable by musi-cians; individual conductors are often evaluated based ontheir technical ability to convey information. ElizabethGreene wrote that skillful conductors have a certain ‘clarityof technique,’ and described it in this way:

“While no two mature conductors conduct exactly alike, thereexists a basic clarity of technique that is instantly – and univer-sally – recognized. When this clarity shows in the conductor’sgestures, it signifies that he or she has acquired a secure under-standing of the principles upon which it is founded and reasonsfor its existence, and that this thorough knowledge has beenaccompanied by careful, regular, and dedicated practice.”(Greene, 1987)

So, while there are many aspects to conducting that are fuzzyand hard to define, it nonetheless has several features thatmake it promising for rigorous analysis. The specific

JNMR017


methods and gestures used by conductors will not be dis-cussed here; for more information, please see (Marrin Nakra,2000) or (Rudolf, 1994).

2. Background and related work

This section presents the work that inspired and informed theConductor’s Jacket project, focusing on prior interfaces andsystems for conducting. The author’s Digital Baton device is described, along with recent developments in expressionrules and theoretical frameworks for musical mappings.

2.1 Interfaces and interactive systems for conducting

During the past thirty years there have been many attemptsto build systems to “conduct” music using electronics. Thesehave varied widely in their methods, gesture sensors, andresults. They include systems using light-pens, radio trans-mitters, ultrasound reflections, sonar, video tracking, sensorgloves, accelerometers, keyboards and mice, pressuresensors, and infrared tracking systems. Many of these aredescribed in detail in (Marrin, 1996) and (Marrin Nakra,2000). Important examples include the Radio Baton(Mathews, 1991) (Boie, Mathews, & Schloss, 1989) and theVirtual Orchestra (http://www.virtualorchestra.com), a com-mercial venture run by Fred Bianchi and David Smith. Twoother teams have recently implemented Hidden MarkovModels to track conducting gestures with impressive results:Satoshi Usa of the Yamaha Musical Instrument Research Lab(Usa & Mochida, 1998), and Andrew Wilson of the MITMedia Lab (Wilson & Bobick, 1998). Other interfaces for interactive music that influenced the development of theConductor’s Jacket include the BodySynth (Pareles, 1993),BioMuse (Lusted & Knapp, 1989), Lady’s Glove (Bean,1998), DancingShoes (Paradiso, Hu, & Hsiao, 1998), andMiburi (Marrin, 1996).

2.2 The digital baton and hyperinstruments

Beginning in 1987 at the MIT Media Lab, Professor TodMachover and his students began to bring ideas and tech-niques from interactive music closer to the classical per-forming arts traditions with hyperinstruments (Machover &Chung, 1989), which were designed for expert and practicedperformers. My primary contribution to this effort was the Digital Baton (Marrin, 1996), a ten-ounce moldedpolyurethane device that was designed to resemble a tradi-tional conducting baton. Designed and built with the help of Machover, Joe Paradiso, and a team of students, the baton incorporated eleven sensory degrees of freedom: 21/2degrees of position, three orthogonal degrees of acceleration,and five points of pressure. The sensors were extremelyrobust and durable, particularly the infrared position track-ing system that worked under a variety of stage lighting conditions.

The Digital Baton software system enabled a number ofconducting-like functions; while it did not explicitly makeuse of traditional conducting gestures, it made use of musicalbehaviors that lay in the continuum between actuating indi-vidual notes and shaping their higher-level behaviors. In thesense that a performer did not have to “play” every notedirectly, the control functions were more “conductorial” innature. We also made use of “shepherding” behaviors; thesewere used to shape higher-level features in the music withoutcontrolling every discrete event.

Machover wrote two pieces of original music for theDigital Baton and we first performed them in a concert ofhis music in London’s South Bank Centre in March of 1996.Later, he incorporated it into the Brain Opera performancesystem, where it was used to trigger and shape multiple layersof sound in the live, interactive show. Having designed andcontributed to the construction of the instrument, I also per-formed with it as one of three instrumentalists in more than150 live performances of the Brain Opera. Despite severalimportant contributions that were made and its success inperformance, however, there were a few fatal flaws in thedesign of Digital Baton. Perhaps its most severe drawbackwas that it was too large and heavy to easily allow for grace-ful, comfortable gestures; it was five times the weight of anormal cork-and-balsa wood conducting baton. This meantthat it was too heavy for a conductor to use in place of a tra-ditional baton, and therefore wasn’t practical for use by otherconductors. Secondly, it lacked a sophisticated software layerto interpret gestures; it was unable to apply expression rulesand theoretical frameworks to the many degrees of freedomof conducting gestures. Nonetheless, the Digital Baton wasan excellent early attempt to create a useful gestural inter-face, and made an important contribution to the state-of-the-art at the time.

2.3 Expression rules research

It was argued above that new musical technologies may notyet convey the most deeply meaningful aspects of human

JNMR017

Fig. 1. The author performing with the Digital Baton, February1996 (photo by Webb Chappell).


expression, and for that reason are uninteresting to perform-ing musicians. However, in the area of Musical Expressionresearch (defining the rules and patterns of musical expres-sion), significant progress has been made recently. Severalresearchers have shown that variations in tempo and dynam-ics conform to structural principles and have demonstratedthem in terms of expression rules that describe the expres-sive strategies that musicians use. Major contributors to thisarea include Caroline Palmer (Palmer, 1988), David Epstein(Epstein, 1995), MaryAnn Norris (Norris, 1991), PeterDesain and Henkjan Honig (Desain & Honing, 1997), J.Sundberg (Sundberg, 1998), Neil Todd (Todd, 1992), CarolKrumhansl, and Giuseppe De Poli (De Poli, Roda, & Vidolin, 1998). Robert Rowe has also written that one of themost important motivations we have for improving the stateof the art in interactive music systems is to include greatermusicianship into computer programs for live performance.He feels that not only should the programs be more sensitiveto human nuance, but they should also become more musical(Rowe, 1996). Recent progress made by Max Mathews inimproving the musical responsiveness of his Radio Batonsystem can be attributed to important collaborations with J.Sundberg (personal communication with Mathews, Decem-ber 2000) and the incorporation of “expression rules” intothe real-time musical mappings.

2.4 Theoretical frameworks for mappings betweengestures and music

There does not yet exist an established set of rules about howto map gestural inputs to musical outputs; the task of creat-ing meaningful relationships between gesture and soundremains primarily an ad hoc art. Few have attempted to formalize theories and frameworks for mappings, whichBarry Schrader described as action/response mechanisms(Schrader, 1991), into systematic grammars, definitions, andcategories. One of the first to try to formulate a coherenttheory about gestural music mappings was Joel Ryan, whowas interested in using empirical methods “to recover thephysicality of music lost in adapting to the abstractions oftechnology” (Ryan, 1991).

In 1996 I developed a theoretical framework for musicalmappings that would make sense for the Digital Baton(Marrin, 1996), describing how an entire gestural language,such as conducting, is constructed from its most primitiveelements. The simplest formulation of my framework dividesall controls into two categories: continuous and discrete. Discrete gestures were defined as single impulses or staticsymbols that represent one quantity that is repeatable anddeterministic; an example would be flipping a switch orpressing a key on a keyboard. More elaborate examples ofdiscrete gestures can be found in the class of semaphores andfixed postures, such as in American Sign Language. All ofthese gestures, no matter how complicated they are, gener-ate one discrete symbolic mapping – for example, a binaryflip, a single note, or a single word. On the other side, con-

tinuous gestures were defined to be gestures that did not havea simple, one-to-one relationship but rather might have anambiguous trajectory that would need to be interpreted. It is relatively simple to make mappings between discretephenomena, but more continuous controls hold the richestand most elusive information. After making use of this discrete/continuous formalism, it was extended into therealm of regular grammars. A multiple-state beat model wasdeveloped to account for all the possible variations on con-ducting beat patterns; ultimately, however, this proved to betoo brittle to implement. After attempting to find a frame-work through theory, a framework was eventually developedfrom a quantitative data analysis of conducting technique,which is described in the following section.

3. Approach

The Conductor’s Jacket project was begun in the spring of1997 and carried out in five interwoven stages over two years.The first task was to model the body as a signal generatorand design a system to optimally sense the most importantsignals. This involved several initial pilot tests and in-vestigations into physiological sensors and data-gatheringmethods, as well as constructing several versions of the inter-face and sensor hardware and collecting data from numeroussubjects. The final version of the device was a fitted shirt thatcontained sixteen physiological and positional sensors (1 res-piration, 1 heart rate, 1 skin conductivity, 1 temperature, 4electromyogram, and 8 Polhemus motion sensors). The datafrom each channel was acquired reliably at rates of 3 kHz.The jacket was designed to sense as many potentially sig-nificant signals from a working conductor as possible with-out changing or constraining his behavior (Marrin & Picard,1998). Six local orchestra conductors with a wide a range ofexpertises and styles agreed to wear a personalized jacket andlet us collect data during their rehearsals and performances(Marrin & Picard, 1998). Next, a visual analysis was used toextract the most promising features from the data andexplored useful filtering, segmentation, and recognition algo-rithms for exposing the underlying structural detail in thosefeatures (Marrin Nakra, 1999). Afterwards, a more in-depthinterpretation project was done to explain the stronger under-lying phenomena in the data; this consisted of interpretingthe results of the analysis phase and making decisions aboutwhich features were most salient and meaningful. In the finalstage, a software system was built to synthesize expressivemusic by recognizing and reflecting musical features andstructures in conducting gestures (Marrin Nakra, 2000).

It should be acknowledged that the use of physiologicalsensors in the Conductor’s Jacket required a significantamount of additional work to ensure that their outputs wereas noise-free as possible. There were also many additionalquestions regarding the expressive intentions behind varioussignals generated by muscular activity, respiration, heart rate,etc. Since relatively little work had been previously done on

JNMR017


the physiological states of performing musicians and con-ductors, establishing a ground truth from which to analyzethe data was not always possible. Deep questions of emotionand expression motivated our analysis, but were not alwaysanswerable. A more in-depth discussion of these issues canbe found in (Harrer, 1975), (Marrin Nakra, 2000), (MarrinNakra, 1999), and (Marrin & Picard, 1998).

3.1 Integration of data analysis results

The analysis phase of the Conductor’s Jacket project yieldedthirty-five salient features (Marrin Nakra, 2000); these features were identified with intuitive and natural gesturaltendencies. Most of these features involved muscle tension and respiration patterns, since those signals reflected more conscious control on the part of the subjects. Eight of these features were later implemented and synthesized insoftware: right arm beat detection, left arm expressive variation, one-to-one mappings between muscle tension and dynamic intensity, predictive indications, minimizationof repetitive signals, recognition of information-bearing ges-tures, rate encoding, and correlations between respiration and phrasing.

After these results were determined, the Conductor’sJacket sensor interface was then redesigned to emphasize themuscle tension (electromyogram, or EMG) and respirationsignals. We explored various configurations of muscle groupson both arms, finally settling on the following seven EMGmeasurements: right biceps, right forearm extensor, righthand (the opponens pollicis muscle), right shoulder (trapez-ius), left biceps, left forearm extensor, and left hand (oppo-nens pollicis). All sensors were held in place on the surfaceof the skin by means of elastic bands, and the leads weresewn onto the outside of the jacket with loops of thread.Additional loops of thread were used to strain-relieve thecables. The jacket was attached to the computer by means of

a cable that was plugged into sensor connections at thewearer’s belt. The final Conductor’s Jacket controller lookedlike this.

Besides integrating power, ground, and numerous signals,the jacket needed to be practical to use. It was thereforedesigned to be easy to wear; it took about as much time towear as a turtleneck sweater. After it was on, the sensors hadto be adjusted under the elastics so that they reliably con-tacted the skin. Elastics on the wrists ensured that the EMGsensors would sit in place on the hand. In order to plug intothe computer, the user would wear a belt that was attachedto the connecting cable. The eight sensor connections werethen made on the belt and the grounding strap was put inplace next to the skin on the inside of the belt. For more infor-mation on the hardware architecture of the system, please see(Marrin Nakra, 2000).

4. Expressive synthesis

This section presents numerous examples from the musicalsoftware system that was built for the Conductor’s Jacket. Itsbasic function was to detect incoming signals from the jacketsensors data in real-time, interpret their expressive features,and apply them expressively in a musical context. The goalwas for the music to convey qualities that general audienceswould recognize as being similar to and reflective of the original control gestures. Our test for the success of eachalgorithm was if it sounded “right” and “intuitive” to boththe performers and the audience. The following sectiondescribes the system architecture and real-time digital signalprocessing techniques that were implemented. It also pre-sents the C++ algorithms that mapped the resulting signalsto sound. Several etudes and compositions are described indetail at the end, along with descriptions of their public performances.

JNMR017

Fig. 2. The final form of the Conductor’s Jacket.


4.1 System architecture

The final Conductor’s Jacket system included an updatedjacket interface (described in section 3), two networked com-puters, and MIDI-controllable sound production equipment(synthesizers, effects processors, and a mixing board). Tohandle the fast data acquisition, filtering and mapping tasks of the system, two separate computers were used. Thismodularized the different tasks, avoided overtaxing a singleprocessor, optimized data acquisition rates, and allowed foreasier debugging. Also, it provided for a more flexible devel-opment environment, whereby the many software mappingson one computer could access any of the parallel data streamsfrom the other. Both machines used the Windows 95 operat-ing system and were connected via Ethernet over a TCP/IPsocket. They could be connected via local-area network orcrossover Ethernet cable; the data rates were equivalentwhether the data was sent over a network or directly betweenmachines.

Computer 1 ran National Instruments’ Labview program-ming environment, which filtered, processed and streamedthe data to an open socket connection. Computer 2 ranMicrosoft’s Visual Developer Studio C++ environment,which read the incoming data streams off of the socket andmapped them to algorithms that generated MIDI output. Thesoftware architecture followed the basic model of a typicalcomputer vision or gesture recognition system, with opti-mizations for real-time responsiveness. Despite severalstages of multi-channel data processing, the lag time betweengesture and sound was imperceptible. The data flowedthrough several stages in a straightforward trajectory: sensorinputs Æ preprocessing Æ filtering Æ mapping Æ musicaloutput.

The EMG and respiration sensors in the jacket output up to eight lines of data as raw voltages. The signals wereacquired by a ComputerBoards ISA-bus data acquisition card(CIO-DAS 1602/16) in Computer 1, which converted themto 16-bit digital signals at a specified sampling rate (usually3 kHz/channel) and voltage range (+/-10 volts). After acquir-ing each successive array of sensor data, a customizedLabview application processed and filtered the data points inparallel, converting them from 16-bit fixed-point values tobyte integers by first splitting each value into an 8-bit expo-nent and an 8-bit mantissa. The mantissas and exponentswere then each converted into byte strings and concatenatedinto one long byte string. Labview then opened an Ethernetsocket connection to Computer 2 and sent out a new bytestring every 50 milliseconds. A C++ program on Computer2 read the data from the socket connection, split off thesensor value components in the right order, scaled each mantissa by its exponent, and then assigned each final valueto a local variable, defined in an associated header file(“sensor.h”). Each pointer was overwritten by its new valueat every execution of the program, although the precisetiming of each execution was not controlled due to Windows

system-level interrupts and timer inconsistencies. Additionalfunctions in the code took those variables and applied themto musical mappings that used the Rogus MIDI library(Denckla & Pelletier, 1996). A menu-driven graphical userinterface allowed the performer to dynamically choose whichpiece to perform and select one of several graphing modesfor visual feedback of the data.

A number of tricks were required to get the system torespond in real-time. For example, the different channels ofsensor data were acquired and processed in small buffers(window lengths of 30 samples). The two applications alsohad to be launched and executed in a particular order, so thatthe socket connection was correctly established. Despite theextra steps that were required to get it working, this archi-tecture had a few advantages. Most importantly, it allowedfor the most recent value to be available at each execution ofthe C++ code, which ensured a smoother transition betweensuccessive sensor values and fewer perceptible glitches in the musical result. While the exact latency between gestureand resultant sound was never measured, there was no per-ceptible delay.

4.2 Real-time signal processing

Several real-time filters were built in Labview to extract rel-evant features from the sensor data, including rectification,beat (peak) detection, beat time-stamps, beat amplitudes, firstderivatives, second derivatives, envelopes (smoothed instan-taneous values), power spectral densities, spectral analysesand spectrograms (using both the Fast Fourier Transform and the Fast Hartley Transform), and noise analysis. Thealgorithms and details of all these filters are given in the proceeding section. Most of these were used in the final per-formance system; some were used to analyze the data inorder to build appropriate mappings.

In order to make the EMG sensor data more usable formappings, it was first run through a zero-mean filter, whichestimated the mean from a fixed number of samples and subtracted it from the signal. This yielded a signal with abaseline of zero. The mean was continually re-estimated atevery new window of samples.

After giving the data a zero-mean, full rectification wasdone by taking the absolute value of every sample. Beatdetection was done for both the right and left biceps signals.A simple peak detector algorithm was used with a fixedthreshold of 0.1 volt (a value of 600 out of 65 536) and arunning window size of 150. The algorithm picked peaks byfitting a quadratic polynomial to sequential groups of datapoints. The number of data points used in the fit was speci-fied by a set width control of 50. For each peak the quadraticfit was tested against the threshold level (600); peaks withheights lower than the threshold were ignored. Peaks weredetected only after about width/2 (25) data points wereprocessed beyond the peak location. If more than one peakwas found in a window, only one peak would be registered.

JNMR017


At every peak registered, a beat would be sent with a valueof 1. A consistent feature of this beat detection method wasthe problem of multiple triggers, which occurred when onebeat overlapped adjacent sample buffers. Increasing thebuffer size solved the problem to a certain extent, but alsoslowed down the overall response time of the system.

In order to gather a measure of the force that was used togenerate each beat, a filter detected beat intensity by calcu-lating amplitudes for each peak. When one or more peakswere detected in a window of 150 samples, this filter returnedthe value of the greatest one. This value was frequently usedto determine the volume of the notes for a given beat. A beattimer was also developed that simply time-stamped themoment of a beat to the millisecond. This time-stamp wasused to find the inter-onset interval (in milliseconds) betweenconsecutive beats. Each successive time value was sent toComputer 2, which computed the difference between it andits predecessor to determine the millisecond interval betweenbeats. That information was converted to a local tempo value,as the number of beats per minute.

One method we used to smooth the EMG signal was atime-domain envelope follower. This was created by first convolving an impulse response with a decaying exponen-tial. The result was then scaled and convolved with the EMG signal. Many aspects of this algorithm were experi-mented with, and we achieved some success in removingextraneous jitter in the EMG signal, at some expense in processing time.

Another group of filters performed spectral analyses ofthe signals. These enabled observation of the frequency-domain behavior of EMG signals in real-time. The first computed a fast fourier transform on an unrectified signal.(Rectification was not applied in this case, since the loss ofnegative frequencies would have taken out half of the DCcomponent.) Since the input sequence was not a power of 2,our algorithm used an efficient Real Discrete Fourier Trans-form routine. Another filter graphed the power spectrum ofan input signal. The formula we used for this function is:

A third filter computed a spectrogram, a graphical represen-tation of a signal in which each vertical slice represents themagnitude (intensity) of each of its component frequencies.This utility computed the real FFT and plotted the magni-tudes of the frequencies (organized in discrete bins andsorted according to frequency value) versus time. The fullfrequency spectrum was divided into either 256, 512, or 1024bins, depending on how much delay was tolerable at thattime. A variant of this utility sorted the bins according to theirmagnitudes. This gave a sense of the overall magnitude ofactivity across all the frequencies. This gave a smootherresult that was easier to read and resembled a Power Spec-tral Density graph. In the end it turned out that the frequencydomain correlated well with the amplitude domain; at amoment of high amplitude (i.e. at the moment of a beat), the

power spectrum FFT X n= [ ][ ]2

frequencies would all increase markedly in magnitude. There were few instances where frequencies didn’t respondtogether, and sometimes this provided a helpful indicator ofnoise. For example, if the magnitudes of the frequencies werevisibly elevated at around 60 Hz, it indicated that externalsources of electrical noise were too high and needed to beremoved.

Finally, we wrote a noise analysis function to fit the fre-quency distribution of the signal to a Gaussian curve; if thetwo matched closely, then the noise on the sensor was deter-mined to be white and minimal. Otherwise, if the two distri-butions didn’t fit, that indicated a large source of either signalor noise. If the sensor was placed properly on the surface ofthe skin and the muscle was not contracting, it was safe tosay that the anomaly came from a source of electrical noise.This utility was useful for finding sources of noise that werenot detectable from a visual analysis of the time-domaincomponent of the signals.

A constant challenge of the signal processing effort wasto keep the sampling rates high and the window sizes smallin order to ensure real-time response; here we frequentlybumped up against the uncertainty principle. The uncertaintyprinciple defines the tradeoff between frequency resolutionand time resolution – in order to obtain high frequency res-olution you need to maximize the number of samples col-lected in one buffer. Conversely, to have fine degree of timeresolution, you need to minimize the number of samples perbuffer. Optimizing conditions vis a vis time and frequencywere a constant theme of the signal processing and filteringtasks on Computer 1. In general we tended to compromisein the frequency domain in order to ensure real-time respon-siveness of the system.

4.3 Code interfacing between filters and mapping structures

After the filters were written in Labview on Computer 1, alayer of code was needed that could transfer the filtered datato be mapped on Computer 2. We developed a unique utilityusing sockets and TCP/IP over ethernet. It grabbed each lineof sensor data in the form of an 8-bit unsigned integer, con-verted each one to a string, and concatenated all the individ-ual strings into one long string. Then, using a built-inLabview utility called TCPWrite.vi, the data was given a con-nection ID. A second built-in function provided an ethernetaddress and a remote port number with which to set up thesocket. At the first execution of this function, the socket wasopened; at every successive execution, each line of data wasconcatenated into one long string and sent out to the speci-fied ethernet address. On Computer 2, special code waswritten in C++ to open up a socket and read data from thesame port number. When the expected number of data byteswas received, the C++ application printed the number ofbytes received to a terminal window. The C++ code did notfully execute unless the right amount of data was detected atthe socket.

JNMR017


4.4 C++ mapping algorithms and etudes

Finally, after the stages of acquisition, filtering, conversion,and transfer, the data was available to be mapped to musicalstructures in C++. A range of algorithms was designed toexplore a variety of musical and gestural possibilities withthe data. We initially decided to build a general-purposetoolkit of Etudes, simple test pieces that applied mappingstructures one-by-one. Later, several full pieces were imple-mented in MIDI, and finally, a software package calledSuperconductor (by Microsound International and ManfredClynes) was included to enable any piece of classical or computer music to be performed with the system.

Like traditional instrumental “studies,” the idea for thefirst Conductor’s Jacket Etudes was that these short piecescould be practiced in order to hone some aspect of the tech-nique. The basic functions were written as a series of low-level abstractions so that they could be useful for creating abroad range of musical styles; they comprised a constructionkit for mapping gestures. The Etudes were intended to besimple and clear – not motivated by any overriding aestheticconsiderations, but rather by an attempt to build straightfor-ward relationships between gesture and sound.

Etude 1, entitled “Tuning,” was written to provide audi-tory feedback to the performer to indicate the state and cal-ibration of the sensor system. This was intended to be similarto the function of instrumental tuning, which traditionalacoustic musicians do before starting to play. In the classicaltradition, tuning is done not only to make sure that all theinstruments match the same reference pitch, but also for themusicians to play a little bit to test the state of their instru-ments, with enough time to make a few minor adjustments.The “Tuning” etude played out repeated notes sonifying theraw output of each successive sensor, mapping the rawvoltage of the sensor signal directly to pitch and volume. Thesounds clue the performer in to the state of the system bysonifying the range and responsiveness of the sensor; if thesensor is not behaving as expected, the performer will noticean audible difference from the expected norm.

Etude 2 mapped the filtered output of the right bicepsEMG sensor to different parameters of a single sound; thevolume, timbre, and pitch were modified directly and simul-taneously (although with slightly different algorithms) by thesignal. If the muscle was not activated, no sound played; asthe muscle contracted, a sound grew in volume and increasedin pitch and brightness. This very direct relationship providedboth the performer and the observer with an immediate senseof the profile of the muscle tension signal. In Etude 3, issuesof beats, cutoffs, and sustained crescendo/diminuendo wereexplored. The right biceps EMG generated beats (note-ons)and cutoffs (note-offs) for a set of chords. Once a chord wasinitiated it was sustained up to a maximum time of 18seconds until a beat gesture again cut it off. While the notewas playing, the left biceps value scaled the sustainedchannel volume of the note; tensing the left biceps musclemade the note get louder, relaxing it made the note get softer.

Ultimately, more than twenty Etudes were developed.However, many of the mappings seemed simple and sparewhen performed one-on-one, and the series of one-to-onemappings quickly became repetitive. It became clear thatbigger, multi-layered systems were needed in order to ma-nipulate many aspects of a piece of music simultaneously.We decided to focus on developing a software system for theConductor’s Jacket that would be general-purpose enough to incorporate either existent pieces of music from the clas-sical repertoire or new works composed specifically for it.The remainder of this chapter describes the structure andfunctionality of the final performance system that was implemented.

4.5 Expressive musical functions

Using C++ algorithms in combination with a MIDI librarydeveloped at the MIT Media Lab (Denckla and Pelletier,1996), a number of real-time expressive controls were developed for the Conductor’s Jacket. These included features such as beat detection, tempo tracking, cues, cutoffs,holds, pauses, note volumes, channel volumes, articulations,panning, octave doublings, triggers, pitch choices, accents,timbre morphing, balances, meter, and voicings. These real-time musical performance parameters are described in detailin this section.

Perhaps the most important musical quantities for con-ductors are the individual beats that they generate with theirgestures. These beats are indicated by an accelerationaldownward motion, called an ictus, which changes directionand appears to “bounce” back upward. The moment of thebeat is indicated when the change in direction occurs. Thereare several theories about whether or not conductors are indirect control of the individual beats of an orchestra, and so a few different methods were explored at the outset. Weended up preferring a ‘direct-drive’ model, meaning that eachbeat was directly given by a specific gesture, and the musicstops and starts instantaneously. If no beats occur, then thevariable emg0.beatData remains at 0 and no notes play. If abeat occurs, then emg0.beatData changes to 1 and one beat’sworth of notes play. The current beat number is kept track ofby a counter (i) that increments at every beat and pushes thecurrent point in the score forward. The code for increment-ing the beat pointer is given below:

In most implementations, the notes of a piece were encodedin a fixed order and rhythm, reflecting the Western classicalconvention of the written score. Once started, the score could

if .

i i

.

emg beatData

emg beatData

0

0

>( )

= +=

[ ]

0

1

0

{

;

;

. . .

}

JNMR017


only progress in one direction until complete, although nobeats would play unless first given by the performer.

Secondly, tempo detection was an essential element togive to the control of the performer. For this we developed aratio-based structure for keeping track of the timing differ-entials between consecutive notes. This was done on two different levels: within beats (such as the relationship be-tween consecutive sixteenth-notes) and across beats (such asbetween consecutive quarter-notes). As described earlier,Computer 1 detected beats and sent time-stamps when theyoccurred. When a beat was received at Computer 2, the pre-vious beat’s time-stamp was subtracted from the most recentbeat’s time-stamp. This provided the inter-onset interval inmilliseconds. The across-beat timing difference, known astempo, was then defined in terms of the inter-onset interval:

We also defined the variable waittime to be the distancebetween successive sixteenth notes, which were the shortestnotes that we implemented. An elaborate smoothing functionwas developed so that the rapid tempo adjustments (particu-larly in the presence of some noise in the beat-generatingsignals) would not sound too abrupt and so that the tempocouldn’t get too fast too quickly:

We also set an upper limit for waittime (approximately 220milliseconds), which prevented problems that might arise ifthe distance between successive beats was very large – aswhen the performer might stop temporarily and want to startagain without the tempo starting up impossibly slowly. Also,the tempo was defined ahead of time for the very first beatof the piece, since tempo cannot be determined from onlyone Boolean beat. (Human conductors usually use only onebeat to cue the orchestra to begin, but in that case the orches-tra has the positional information about the velocity andplacement of the arm that helps it anticipate when the firstbeat should be played.)

Thirdly, cutoffs were implemented. Similar to beats, theyare strong gestures that indicate the stop time of a held noteor a pause. A cutoff function was therefore created so that ifa particular score had a musical fermata indicated (i.e., a noteof arbitrary length to be decided by the conductor), then theend of the fermata would be triggered by a beat. Beats andcutoffs were triggered by the same gestures, but the contextprovided by the score determined how the gestures wereinterpreted and used.

The individual note volumes (MIDI velocities) were alsocontrolled dynamically by the performer. Although methodsfor doing this varied significantly depending upon themusical context, we commonly used a weighted combinationof three values: the force with which the current beat wasmade, the current respiration value, and the previous volume(to average or smooth the contour). The reason for including

waittime current eronset erval

oldwaittime

= ( ) +( )( )) +

_ _

* ;

int int 5

3 2 4 50

tempo ute milli onds

eronset erval between beats in ms

= ( )( )

1 60000min sec

int int

the breathing information in the dynamic contours camefrom a feature found in the analysis of the original conductordata; we found a strong correlation between the respirationand phrasing patterns. Therefore, a multi-modal mappingwas developed to incorporate both muscle tension and res-piration values together. An elaborate algorithm would typi-cally be used only for the first two notes within a beat. Forsubsequent notes within a beat, the current value of the rightbicep signal would be compared to its value at the beginningof the beat; if the current value was greater, then the newnote’s volume value would be the volume of the first noteincremented by 8 (out of a total range of 0–127). Otherwise,it would be subtracted by 8. This method of linking notevolumes within one beat was modeled after the work ofManfred Clynes (Clynes, 1990).

In some cases it was found to be useful to combine the control of individual note volumes with MIDI channelvolumes; this allowed the performer to have a lot moredynamic range of control between a subtle whisper and aloud sweep. In one case this function was called morph, and each morph function was identified with a channel #. Itscode for channel 0 is given below:

Combining channel volumes with note velocities allowed fora new degree of control over the dynamic range; in the handsof skillful performers, the effect could be extremely dramaticand compelling. Articulation was another musical structurecontrolled in real-time by the Conductor’s Jacket. Articula-tions are defined as “the characteristics of attack and decayof single tones or groups of tones” (Randel, 1986). Since ourprevious quantitative study of conductors had determinedthat the forearm EMG signal indicated the quality of articu-lation, we decided to determine the lengths of individualnotes by the usage of the forearm. We approximated the articulation of each note to be its relative length – staccatonotes would be short, legato notes long. The code belowshows how the EMG1 sensor directly determines the lengthof each sixteenth-note (weighted equally with the previousvalue for length, which smoothes the result):

length emg1.rectified hWnd length

if length

length

r- s.play_note m, length

= ( ) +( )>( )

=

> ( )

;

{

;

}

;

2000

2000

morph0 abs emg0.bAmpData 10 morph0

if morph0

morph0

m.set MidiMsg:: control_chn,0,1, morph0,

ou node

r- o.write &m

= +( )( )>( )

=

()

> ( )

2 4

127

127

* ;

{

;

}

_ ;

;

JNMR017


Other musical algorithms were constructed with which tocontrol panning, octave doublings, pitch, accents, timbremorphing, balances, and voicings. These are described indetail in (Marrin Nakra, 2000), and were used in several com-positions that will be described in the next section.

4.6 Musical compositions for the Conductors Jacket

From March of 1999 through the present several pieces havebeen developed for performance by the Conductor’s Jacket.Numerous public performances and demonstrations havebeen given. Several of the more successful completed worksare described in this section.

The first full piece that was implemented for the Con-ductor’s Jacket was intended to be a realistic simulation of conducting. The piece that was chosen was J.S. Bach’sToccata and Fugue in D minor, a composition for the organ.The mapping strategy applied several of the musical algo-rithms described above, including beat tracking, tempo track-ing, use of the left hand for expressive variation, one-to-onecorrelation between muscle tension and dynamic intensity,division of labor between biceps, triceps, and forearm, andthe link between respiration and phrasing. This implementa-tion represented a proof of concept that the physiologicalsignals of a musician can be used to drive the expressiveaspects of a real-time performance. In the performance ofthis piece the action of the right biceps muscle determinesthe beats, tempo, beat volumes, and cutoffs. The rightforearm gives articulations; sustained contraction of themuscle yields longer notes and therefore a legato quality,whereas shorter contractions of the muscle yield shorter,notes with a staccato quality. The use of the left bicepsmuscle causes octave doublings to fill out the bass and theaggregate values of the left arm muscles versus the right arm

muscles determines the panning of all the voices. The piecewas implemented as a full score and once initiated, cannotbe reversed. It can be paused in the middle if the performerdoes not generate new beats, although this direct-drive modelis susceptible to noise. An MPEG clip of this implementa-tion is available for download at: http://immersionmusic.org/Teresa_Marrin_Nakra.mpg (16 Megabytes)

Another piece written by the author is entitled “Song forthe End,” a composition that uses improvisational techniques.The set of notes is predetermined but can be performed withany rhythm; this is related to the concept of “recitative.” Theopening section is a melismatic vocal passage in which thenotes are sequenced in a particular order but every other facetof their performance – start time, volume, accent, length, andtimbre are under the direct control of the performer. This isfollowed by a fixed sequence in which different muscles can raise and lower the note velocities, channel volumes, andtimbral characteristics of four different voices. The combi-nation of controlling note volumes and channel volumes inparallel provided the possibility for extreme crescendo/diminuendo effects, which was fun for the performer but alsorequired greater practice and concentration. At the end of thepiece there is a rhapsodic conclusion that is also under theperformer’s direct mix control.

In October of 1999, a version of J.S. Bach’s BrandenburgConcerto no. 4 in G major, movement 3 (Presto), was imple-mented. This piece was developed in collaboration withManfred Clynes and Microsound International. By combin-ing the functions of the Conductor’s Jacket software withClynes’ Superconductor software package, we developed asystem that performed rate-based tempo following with theright biceps, dynamics shaping with the left biceps, bassvoicing with the right forearm, and treble voicing with the left forearm. The premiere performance of this systemwas featured on CNN Headline News in November of 1999.An MPEG clip of this performance is available for down-load at: http://immersionmusic.org/teresa-sensbles.mpg (16Megabytes)

More recently, additional pieces have been implementedfor the Conductor’s Jacket system, including a commissionedConcerto for Conductor by composer John Oswald thatreceived its premiere in Boston’s Symphony Hall in May of2001 with the Boston Modern Orchestra Project and con-ductor Gil Rose. Additional pieces have been written andimplemented, including an experimental piece called SP/RING in which the jacket not only controls the flow of themusic but also of video images projected on a screen. Activedevelopment and commissions of new works for the Con-ductor’s Jacket continues through the efforts of ImmersionMusic, an artistic production company I founded in June of2000.

5. Evaluation and future work

“In music, the instrument often predates the expression it autho-rizes.” (Attali, 1985)

JNMR017

Fig. 3. The author performing with the Conductor’s Jacket,October 1999. Photo credit: Webb Chappell.


The Conductor’s Jacket project has yielded several importantcontributions to the state-of-the-art in hardware and softwarefor computer music. The initial data collection and analysisproject established patterns in the data of real conductors at work, which enabled the interactive software to react inmeaningful ways to the gesture and physiology of its users.Secondly, the Conductor’s Jacket project demonstrated theutility of using an established technique as the basis of aninteractive paradigm. The gestural language of conducting iswell documented, and its structure has a regular grammar.Without such a framework, it would not have been possibleto establish a norm and interpret deviations from it in anymusically meaningful way. Another advantage of the Con-ductor’s Jacket architecture is that its software system wasoptimized to work within its own real-time constraints. Thatis, all the data filters run continually, but only make the com-putationally expensive mapping calculations when they arecalled for. This means that “rather than having all gesturestracked at all times, gestures with significance for particularsections of the composition were interpreted as needed”(Rowe, 1996). The most important advancements of the soft-ware system are in gesture-based tempo tracking, left-armcrescendo techniques, and wrist-based control of articula-tion. It has successfully served as a proof-of-concept for themethod that was used to develop it.

The Conductor’s Jacket also succeeded in achieving acertain feeling of emphasis. Emphasis might be defined asthe property of an instrument that reflects the amount ofeffort and intensity with which the performer works to createthe sound. Most sensor-based instruments fall very short ofthis property because their sensing mechanisms are notdesigned to gather this quantity and the synthesis models donot reflect this parameter. The choice to use muscle tensionsensors seems to have enabled the jacket to react to theemphasis of the performer. Joel Ryan described this phe-nomenon as effort:

“Physical effort is a characteristic of the playing of all musicalinstruments. Though traditional instruments have been greatlyrefined over the centuries, the main motivation has been toincrease ranges, accuracy, and subtlety of sound and not to min-imize the physical. Effort is closely related to expression in the playing of traditional instruments. It is the element of energyand desire, of attraction and repulsion in the movement of music.But effort is just as important in the formal construction of musicas in its expression: effort maps complex territories onto thesimple grid of pitch and harmony. And it is upon such territo-ries that much of modern musical invention is founded.” (Ryan,1991)

Watanabe and Yachida reflect this idea about emphasis usingthe word degree:

“Methods such as DP matching, HMMs, neural networks, andfinite state machines put emphasis on classifying a kind ofgesture, and they cannot obtain the degree information ofgesture, such as speed, magnitude and so on. The degree infor-

mation often represents user’s attitude, emotion and so on, whichplay an important role in communication. Therefore the interac-tive systems should also recognize not only the kind of gesturebut also the degree information of gesture.” (Watanabe &Yachida, 1998)

Another advantage of the Conductor’s Jacket is that it doesn’tconstrain the gestures of its wearer. The constraints of aninstrument are the aspects of their design that force thehuman performer to gesture or pose in a particular way. Somesensor-based instruments have the problem of being overly-constrained; the nature of their sensing mechanisms force theperformer to make contorted, unnatural gestures to achieveparticular effects. All instruments by their very natures willimpose some constraints on their performers, but in order forthe instrument to be successful the constraints have to limitthe physical movement in such a way as to not hamper theexpressive capabilities of the performer. The Conductor’sJacket seems to have succeeded by this measure. The onlyfrequent awkwardness was with quick schedule changes orlast-minute opportunities to perform. In those cases, unlikewith a typical instrument, the wearer would have to leave theroom to put on the instrument. This was not ideal for instan-taneous requests, but otherwise didn’t present any seriousproblem.

The jacket was also not subject to the typical problem ofCartesian Reliance, which is the property of an instrumentwhereby the different degrees of freedom are set up in a coor-dinate space for the performer. An example of such a devicewould be a mixing board, where the individual levers eachmove in one direction and map to one quantity. For a musicalinstrument to be usable by a performer, it should not rely tooheavily on the Cartesian paradigm. That is, humans do notnaturally perceive of the world in Cartesian terms; our intu-itive ways of gesturing are often invariant to aspects such as translation, rotation, and scale. Certainly the expressiveinformation in conducting is not contained in the absolutepositional information, and this doctoral project has shownthat quantities such as muscle tension can sometimes bemore useful than gathering the accelerational informationfrom the position of the moving arm. The common relianceon orthogonal vertices doesn’t map well to meaningful,natural musical responses. For example, in most traditionalmusic, pitch and loudness are coupled – they can’t just bemanipulated as independent parameters. Unlike car controls,there is no inherent meaning in increasing a single valueindependently. While they work for the Theremin, orthogo-nal relationships may perhaps be too simplistic and arbitrary;they make the engineering easier, but force the musician toconform to an unnatural structure.

The jacket also partially solves the disembodimentproblem. Disembodiment is a property of many sensor-basedinstruments that don’t themselves include the source of theirown sound. Unlike with traditional instruments, where thephysical action is what generates the vibrations, many sensorinstruments are indirectly connected through a long chain of

JNMR017


processes to the actuation of its sounds. Many sensor instru-ments have no resonant cavity and depend upon synthesizersand speakers to convey their sounds; the sounds come out ofa set of speakers that might be physically far removed fromthe actions that generate them. A second component of thedisembodiment problem originates from the mapping layerthat separates the transduction and actuation processes ofinteractive music systems. Mappings allow for any sound tobe mapped to any input arbitrarily, and the extreme freedomand range of possibility makes it hard to construct mappingsthat look and sound “real” to an audience. It is still not wellunderstood how to construct mappings such that they intu-itively map well to an action; this is because interactive musicis still an extremely new art form.

For example, the Digital Baton was extremely sensitive tothe disembodiment problem. Perhaps this was because ofcertain properties in its sensory system – the 2D space infront of the performer was interpreted as a large grid, dividedinto an arbitrary number of cells that acted as triggers. Thismethod worked well algorithmically, but frequently confusedaudiences because they could not see the virtual grid in frontof the performer and were not able to connect the actions ofthe performer with the musical responses that they heard. Itmight be said that this caused alienation – that is, the natureof the instrument made it especially difficult to constructmappings that sounded “embodied.” This can cause audi-ences to become distracted and detached, particularly if theycan’t identify how the performer is controlling the sounds.The problem of disembodiment resembles the situation ofconducting, where the performer gestures silently and anexternal source generates the sound. Therefore, it might besaid that conducting is, by definition, disembodied, and pro-vides a useful model for sensor-based systems. For thisreason it was decided not to add a local source of auditoryand tactile feedback to the jacket. Instead, the Conductor’sJacket project attempted to address this issue by bringing thesensors closer to the source of the physical gestures, namelyin the performer’s physiology.

“The most important requirements are immediacy and depth ofcontrol, coupled with the greatest flexibility possible; immedi-acy and depth because the user needs feedback on his action,flexibility because the user wants a predictable but complexresponse.” (Gelhaar, 1991)

One problem with the Conductor’s Jacket was that internalmuscular movements did not always correlate with the per-ceived character of the movement. That is, it is possible tochange the position and velocity of a limb without havingthat movement reflected in the tension of the muscle. Andconversely, it is possible to create tension in a muscle withoutthat tension being visually obvious to observers. While itseems that observers can “see” muscle tension to a certainextent, it is not known how much of this is due to other move-ments associated with the tension. For example, the momentof a beat is not always visually obvious to an untrained eye,but absolutely clear from the right biceps EMG signal. And

the amount of emphasis in the beat, not always proportionalto the visually observed velocity of the arm, is nonethelessclear in the muscle tension signal.

This discrepancy between the visual and physiologicalevents in a gesture sometimes causes confusion for the audi-ence and remains an issue to be resolved. Perhaps one solu-tion will be to add additional sensors and measurements tothe jacket system. Also, it seems that the intuitiveness andnaturalness of a mapping has mostly to do with the audi-ence’s perception, and not with the performer’s ease of use.That is, a performer can quickly train to a system, even a dif-ficult system, to the point where the mapping and gesturesfeel normal and natural. But it is the audience that ultimatelydecides if a mapping ‘works’ or not. If the audience is con-fused about the relationship between the gesture and themusic, then the mapping does not work. If people feel a cognitive dissonance between their expectations and theresponse, then the mapping does not work. If they distrustwhether or not the performer is miming to the sound of themusic, then the mapping does not work.

In the case of established art forms this phenomenon doesnot happen; for example, in a performance of a symphonyorchestra, the audience does not need to be able to see theprecise finger movements of each musician. This is becausethey trust that all the sound is being created in front of them;they are comfortable about what to expect. With perfor-mances of new interactive music, by comparison, there arevery few shared expectations. However, the answer is notalways to “prove” that the performer is indeed controlling thesound. For example, the Conductor’s Jacket software waspurposefully made to be “direct-drive” on the beat level sothat when the beats stopped, the music stopped playing. This, we thought, would convince people of the connectionbetween my gesture and the sound. However, it is profoundlyunmusical to stop a piece in the middle just to show that youcan. This area remains full of open questions and is ripe formuch future work and debate.

A related area that was not fully investigated was that ofevaluating the results that were obtained for “correctness”and “expressivity.” Our ultimate criterion for success was thegeneric evaluation of our audiences; deeper procedures forevaluation were not attempted in the scope of this project. A careful study of audience reactions (perhaps from bothtrained and untrained in music) could yield very importantinformation about the validity of our results. Future work has been targeted in this area with the help of methods fromcognitive psychology. These will help establish importantground truths and reference points by which to validate (orrefute) our method and results.

Future work on the Conductor’s Jacket will extend andimprove upon the mappings that convert gestural signals intomusic. These mappings must become even more intuitive,and respond even more appropriately to the structure, quality,and character in the gestures. They must not only satisfy theperformer’s needs for intuition and naturalness, but alsomake the gestural-musical relationships clear to the audience

JNMR017


that is witnessing the performance. Another improvement tobe made for the Conductor’s Jacket system will be a reliable,non-direction-dependent, high bandwidth wireless connec-tion. This is not so much a research question as it is a designissue that could be solved with more expertise and time. Thiswireless connection could provide the added benefit ofproper electrical isolation for the wearer. Finally, the Con-ductor’s Jacket would benefit tremendously from a positionalsensing system. Full 3D positional sensing would allow thesystem to anticipate the placement of beats by providing thetrajectory of the arm in the phases leading up to each beat.(This trajectory is essential in giving the musicians a senseof where the midpoint of the gesture is, so that they can anti-cipate the time at which the beat will occur.) If properlymodeled, trajectories between beats give valuable informa-tion about the current tempo, future changes in tempo, andupcoming changes in emphasis. It could also improve theaccuracy of the beat detection system by having a secondmechanism running in parallel.

In addition to the work that remains to be done, there areseveral practical improvements that could be implementedquickly. For example, the problem of double triggers isannoying and detracts from the overall effect of the perfor-mance. Max Mathews’ suggestion for this problem in was toadd a refractory period after each beat (personal conversa-tion, May 1999). According to Mathews, a space of approxi-mately 100 milliseconds or so after a beat in which thesystem does not look for new beats should take care of the problem. He also explained that nerves also operate under a similar principle and have a natural recovery periodafter each firing. Another option to use is a phase-lock loop, but that is generally not sufficient because it is not flexible enough to account for immediate tempo changes.Another, simpler solution might be found by improving the real-time envelope or smoothing filters that have beenwritten for the right biceps EMG signal, so that its uppervalues would not be so jagged. In addition, the performanceof my beat detection system might have improved if thesignal had first been low-pass filtered and then analyzed forinflection points (peak and trough events with associatedstrengths).

Nearly as troublesome as double-triggered beats is theproblem of missed beats. The Conductor’s Jacket systemsometimes fails to detect weak beats because their signalsfall within the noise band; this can be inconvenient for theconductors if they have to exaggerate their beats in order forthem to be recognized by the system. One solution would beto add a positional measurement system into the hardwareand run an inflection point recognition system in parallel withthe beat recognition system; this would allow the detectionthreshold to be lowered in the beat recognition system. Onlywhen an EMG spike and an inflection point co-occur wouldthe final system send a beat.

Another issue is that the Conductor’s Jacket software cur-rently lacks a flexible notion of pulse; within each beat, thevolumes of consecutive sixteenth notes conform to a fixed

profile, scaled by the volume of the first one. This workedreasonably well to “humanize” the sound of the music on themicrostructural level, but could have perhaps included morevariety. The system would also benefit from a more flexiblepulse framework on the level of the measure and the phrase.It could also have controlled other musical qualities such asonset articulations (attack envelopes) and vibrato amplitudesand rates. Finally, more than one tempo framework should beavailable at any time to the performer. Some observers foundthe “direct-drive” beat model too direct; they wanted the relationship between the gesture and the performance to beless tightly coupled. To them, it was distracting to have the“orchestra” stop instantly, knowing that a human orchestrawould take a few beats to slow down and stop. So perhaps afinal system should have a built-in switch between direct-drive, direct-drive with latency, and rate-based tempo track-ing modes.

Ultimately, it is not clear whether the Conductor’s Jacketis better suited for a more improvisational or score-basedinterpretational role. Perhaps if it is to be a successful instru-ment, it will need to do both. The majority of the pieces thathave been implemented for the jacket focus require the per-former to interpret pre-existing musical materials. One issueis that the jacket so far does not include a good way to pickdiscrete notes, since there is no intuitive gesture for note-picking or carving up the continuous gesture-space into dis-crete regions. Perhaps if it included a positional system thena two-dimensional map could be used to pick out notes andchords, but perhaps that is not the best way to use the affor-dances of the jacket. Conductors do not use their gestures todefine notes, and there is no intuitive gestural vocabulary forsuch an action.

The Conductor’s Jacket is potentially very promising astool for teaching conducting skills to students. To this end, a project has been launched to introduce the Conductor’sJacket to students of conducting pedagogy at Arizona StateUniversity. Entitled the Digital Conducting Laboratory, theproject will enable students to practice their assignmentswith a simulated orchestral response. The system is intendedto recreate many of the behaviors of a live orchestrarehearsal, particularly by reacting to the tempo, articulation,and dynamic line generated by the conductor. In addition toimmediate aural feedback, the system will allow the studentsto review their performances via sound files, video playback,and analysis of muscle-tension profiles. The musical materials consist of a set of etudes that systematically take aconductor through a review of basic conducting gestures.Compiled and arranged by the laboratory’s founder anddirector, ASU Professor Gary W. Hill, the etudes are princi-pally derived from familiar classical music, freeing the userto focus primarily on his or her conducting gestures and theresponses that they generate.

In summation, the goal of the Conductor’s Jacket projectwas to model and build a system to automatically recognizethe expressive content in performed gestures. The resultantsystem has contributed to the state of the art in interactive

JNMR017


musical performances and gesture-based musical instru-ments. There now stands an immensely rich area for investi-gation and experimentation, which could yield answers to ourmost basic questions about musical expression and perfor-mance. The Conductor’s Jacket represents a successful proof-of-concept and instrument with which many future projectsand performances can be created.

Hopefully, the Conductor’s Jacket presents an idea orimplementation that performing musicians will find usefuland expressive. Perhaps performers in the future will usegesture to seamlessly interact with computers. If so, perhapsthose interactions will relate to the performance traditionsthat came before. Truly expressive electronic instruments arepossibly still decades away, and getting expression right bydesign is a really hard problem. The traditional musical artshave great potential to provide insight into how to improveand humanize our electronic performances. Established tech-niques can serve as models from which extensions can beslowly developed, allowing audiences to appreciate thembecause of historically and culturally shared expectations.The future remains open to be filled with intelligent, respon-sive, gestural music! The Conductor’s Jacket is just one ideaabout how that will be accomplished.

“music is significant for us as human beings principally becauseit embodies movement of a specifically human type that goes tothe roots of our being and takes shape in the inner gestures whichembody our deepest and most intimate responses. This is of itselfnot yet art; it is not yet even language. But it is the material ofwhich musical art is made, and to which musical art gives sig-nificance.” (Sessions, 1950)

The tools of our medium have not yet been perfected or stan-dardized, but we have to begin somewhere. To paraphrase(Rothenberg, 1996), to the extent that this new musicalinstrument, the Conductor’s Jacket, makes the performerhappy by providing an opportunity for direct expression, thenit has a chance as a musical tool.

Acknowledgements

The work described in this article was done while I was a doctoral student at the Massachusetts Institute of Technology’s Media Laboratory; I gratefully acknowledgethe support of Professors Rosalind W. Picard and TodMachover, who advised me and oversaw this project withdiligence, generosity, and dedication. Thanks also go to JohnHarbison and David Wessel, whose cogent and probing ques-tions as members of the doctoral committee challenged andimproved this work. Thanks also to IBM and Motorola, whoprovided fellowships that helped to fund the research, andGregory Harman and Noshirwan Petigara, who contri-buted excellent work as undergraduate assistants. Finally, Iwould like to acknowledge the support of my husbandJahangir Nakra, whose care and constancy have made it allworthwhile.

References

••. (1994). The Art of Conducting: Great Conductors of the Past.Teledec video. ASIN: 6303276849.

Attali, J. (1985). Noise: The Political Economy of Music. Theoryand History of Literature, volume 16, University of MinesotaPress.

Bean (1998). A Soft Touch. Electronic Musician: 106–109.Boie, B., Max Mathews, & Andy Schloss (1989). The Radio

Drum as a Synthesizer Controller. Proceedings of the Inter-national Computer Music Conference, ICMA.

Bongers, B. (1998). “An Interview with Sensorband.” ComputerMusic Journal 22(1): 13–24.

Boult, Sir Adrian. (1968). A Handbook on the Technique of Conducting. London, Paterson’s Publications, Ltd.

Clynes, M. (1990). “Some guidelines for the synthesis andtesting of Pulse Microstructure in relation to musicalmeaning.” Music Perception 7(4): 403–422.

Clynes, M. (1995). “MicroStructural Musical Linguistics: Composer’s pulses are liked best by the best musicians.”COGNITION, International Journal of Cognitive Science,vol. 55, 1995, pp. 269–310.

De Poli, G., Antonio Roda, & Alvise Vidolin. (1998). A Modelof Dynamics Profile Variation, Depending on ExpressiveIntention, in Piano Performance of Classical Music. XII Colloquium on Musical Informatics, Gorizia, Italy.

Denckla, B. & Pelletier, P. (1996). Rogus McBogus Documen-tation. Cambridge, MA, M.I.T. Media Laboratory.http://theremin.media.mit.edu/Rogus.

Desain, P. & Honing, H. (1997). Structural Expression Compo-nent Theory (SECT), and a Method for Decomposing Expres-sion in Music Performance. Society for Music Perception andCognition.

Epstein, D. (1995). Shaping Time: Music, the Brain, and Per-formance. New York, Shirmer, Inc.

Fernandez, R. & Picard, R. (1997). Signal Processing for Recog-nition of Human Frustration.

Gelhaar, R. (1991). “SOUND = SPACE: an interactive musicalenvironment.” Contemporary Music Review 6(1).

Greene, E.A.H. (1987). The Modern Conductor. EnglewoodCliffs, NJ, Prentice Hall.

Harrer, G. (1975). Grundlagen der Musiktherapie undMusikpsychologie. Stuttgart, Gustav Fischer Verlag.

Hawley, M. (1993). Structure out of Sound. Pb.D. Thesis, MediaLaboratory. Cambridge, MA, M.I.T.

Healey, J. & Picard, R. (1998). Digital Processing of AffectiveSignals. ICASSP, Seattle, Washington.

Lusted, H.S. & Knapp, R.B. (1989). “Music Produced byHuman Bioelectric Signals.” American Association for theAdvancement of Science.

Machover, T. (1996). The Brain Opera. Paris, Erato Disques,http://brainop.media.mit.edu.

Machover, T. & Chung, J. (1989). Hyperinstruments: MusicallyIntelligent and Interactive Performance and CreativitySystems. International Computer Music Conference, pages186–190.

Marrin Nakra, T. (2000). Inside the Conductor’s Jacket: Analy-

JNMR017


sis, Interpretation and Musical Synthesis of ExpressiveGesture. Ph.D. Thesis, Media Laboratory. Cambridge, MA,Mass. Inst. of Technology. http://www.media.mit.edu/~marrin/HTMLThesis/Dissertation.htm

Marrin Nakra, T. (1999). Searching for Meaning in GesturalData: Interpretive Feature Extraction and Signal Processingfor Affective and Expressive Content. Trends in GesturalControl of Music. M. Wanderley, ed. Paris, IRCAM.

Marrin, T., Joseph Paradiso, et al. (1999). Apparatus for Con-trolling Continuous Behavior Through Hand and Arm Ges-tures. United States Patent no. 5,875,257, issued February 23,1999.

Marrin, T. & Picard, R. (1998). Analysis of Affective MusicalExpression with the Conductor’s Jacket. XII Colloquium forMusical Informatics, Gorizia, Italy, pages 61–64.

Marrin, T. & Picard, R. (1998). The Conductor’s Jacket: aDevice for Recording Expressive Musical Gestures. Interna-tional Computer Music Conference, Ann Arbor, MI, pages215–219.

Marrin, T. (1996). Toward an Understanding of Musical Gesture:Mapping Expressive Intention with the Digital Baton. M.S.Thesis, Media Laboratory. Cambridge, MA, Mass. Inst. ofTechnology. http://www.media.mit.edu/~marrin/Thesis.htm

Mathews, M.V. (1991). The Conductor Program and Mechani-cal Baton. Current Directions in Computer Music Research.M.V. Mathews & J.R. Pierce, eds. Cambridge, MIT Press.

Norris, M.A. (1991). The Cognitive Mapping of Musical Intention to Performance. Media Laboratory. Cambridge,MA, MIT.

Palmer, C. (1988). Timing in Skilled Music Performance.Department of Psychology. Ithaca, NY, Cornell University.

Paradiso, J., Eric Hu, & Kai-Yuh Hsiao. (1998). InstrumentedFootwear for Interactive Dance. XII Colloquium on MusicalInformatics, Gorizia, Italy.

Paradiso, J.A. & Neil Gershenfeld. (1997). “Musical Applica-tions of Electric Field Sensing.” Computer Music Journal21(2): 69–89.

Pareles, J. (1993). Vaudeville, Complete with a Tornado. NewYork Times. New York.

Picard, R.W. & Healey, J. (1997). Affective Wearables. IEEEISWC.

Randel, D., ed. (1986). The New Harvard Dictionary of Music.Cambridge, MA, The Belknap Press of Harvard UniversityPress.

Rothenberg, D. (1996). “Sudden Music: Improvising across theElectronic Abyss.” Contemporary Music Review 13(2): 35.

Rowe, R. (1996). “Incrementally Improving Interactive MusicSystems.” Contemporary Music Review 13(2): 49.

Rudolf, M. (1994). The Grammar of Conducting. New York,Schirmer Books.

Ryan, J. (1991). “Some remarks on musical instrument designat STEIM.” Contemporary Music Review 6(1): 7.

Scheirer, E.D. (1998) “Tempo and Beat Analysis of AcousticMusical Signals.” Journal of the Acoustical Society ofAmerica 103:1, pp. 588–601.

Scherchen, H. (1989). Handbook of Conducting. Oxford,Oxford University Press.

Schrader, B. (1991). “Live/Electro-Acoustic History.” Contem-porary Music Review 6(1): 86.

Sessions, R. (1950). The Musical Experience of Composer, Performer and Listener. Princeton, NJ, Princeton UniversityPress.

Sundberg, J. (1998). “Synthesis of Performance Nuance.”Journal of New Music Research 27(3).

Todd, N. (1992). “The dynamics of dynamics: a model ofmusical expression.” Journal of the Acoustical Society ofAmerica 91(6): 3540–3550.

Usa, S. & Mochida, Y. (1998). “A conducting recognition systemon the model of musicians’ process.” Journal of the Acousti-cal Society of Japan 19(4).

Watanabe, T. & M. Yachida. (1998). Real Time Gesture Recogni-tion Using Eigenspace from Multi Input Image Sequences.IEEE Conference on Face and Gesture, Nara, Japan, page 428.

Wilson, A. & Bobick, A.F. (1998). Nonlinear PHMMs for the Interpretation of Parameterized Gesture. IEEE ComputerSociety Conference on Computer Vision and Pattern Re-cognition. http://vismod.www.media.mit.edu/people/drew/watchandlearn/

JNMR017

synthesizing expressive music through the language of conducting

Documents