Post on 31-Aug-2018




0 download


<ul><li><p>Proceedings of the Second Vienna Talk, Sept. 1921, 2010, University of Music and Performing Arts Vienna, Austria</p><p>ROBUST REAL-TIME MUSIC TRACKING</p><p>Andreas Arzt</p><p>Department of Computational PerceptionJohannes Kepler University Linz</p><p>Gerhard Widmer</p><p>Department of Computational PerceptionJohannes Kepler University LinzThe Austrian Research Institutefor Artificial Intelligence (OFAI)</p><p>ABSTRACT</p><p>This paper describes our any-time real-time music track-ing system which is based on an on-line version of thewell-known Dynamic Time Warping (DTW) algorithmand includes some extensions to improve both the pre-cision and the robustness of the alignment (e.g. a tempomodel and the ability to reconsider past decisions). Aunique feature of our system is the ability to cope witharbitrary structural deviations (e.g. jumps, re-starts) on-line during a live performance.</p><p>1. INTRODUCTION</p><p>In this paper we describe the current state of our real-time music tracking system. The task of a music trackingsystem is to follow a musical live performance on-lineand to output at any time the current position in the score.</p><p>While most real-time music tracking systems are basedon statistical approaches (e.g. the well-known systemsby Raphael [8] and Cont [4]), our alignment algorithm isbased on an on-line version of the Dynamic Time Warp-ing algorithm first presented by Dixon in [5]. We sub-sequently proposed various improvements and additionalfeatures to this algorithm (e.g. a tempo model and theability to cope with jumps and re-starts by the per-former), which are the topic of this paper (see also Figure1 for an overview of our system).</p><p>2. DATA REPRESENTATION</p><p>Rather than trying to transcribe the incoming audio streaminto discrete notes and align the transcription to the score,we first convert a MIDI version of the given score into asound file by using a software synthesizer. Due to the in-formation stored in the MIDI file, we know the time ofevery event (e.g. note onsets) in this machine-like, low-quality rendition of the piece and can treat the problem asa real-time audio-to-audio alignment task.</p><p>The score audio stream and the live input stream to bealigned are represented as sequences of analysis frames,computed via a windowed FFT of the signal with a ham-ming window of size 46ms and a hop size of 20ms. Thedata is mapped into 84 frequency bins, spread linearly up</p><p>to 370Hz and logarithmically above, with semitone spac-ing. In order to emphasize note onsets, which are themost important indicators of musical timing, only the in-crease in energy in each bin relative to the previous frameis stored.</p><p>3. ON-LINE DYNAMIC TIME WARPING</p><p>This algorithm is the core of our real-time music track-ing system. ODTW takes two time series describing theaudio signals one known completely beforehand (thescore) and one coming in in real time (the live perfor-mance) , computes an on-line alignment, and at any timereturns the current position in the score. In the follow-ing we only give a short intuitive description of this algo-rithm, for further details we refer the reader to [5].</p><p>Dynamic Time Warping (DTW) is an off-line align-ment method for two time series based on a local costmeasure and an alignment cost matrix computed usingdynamic programming, where each cell contains the costsof the optimal alignment up to this cell. After the matrixcomputation is completed the optimal alignment path isobtained by tracing the dynamic programming recursionbackwards (backward path).</p><p>Originally proposed by Dixon in [5], the ODTW al-gorithm is based on the standard DTW algorithm, but hastwo important properties making it useable in real-timesystems: the alignment is computed incrementally by al-ways expanding the matrix into the direction (row or col-umn) containing the minimal costs (forward path), andit has linear time and space complexity, as only a fixednumber of cells around the forward path is computed.</p><p>At any time during the alignment it is also possibleto compute a backward path starting at the current po-sition, producing an off-line alignment of the two timeseries which generally is much more accurate. This con-stantly updated, very accurate alignment of the last cou-ple of seconds is used heavily in our system to improvethe alignment accuracy (see Section 4). See also Figure 2for an illustration of the above-mentioned concepts.</p><p>4. THE FORWARD-BACKWARD STRATEGY</p><p>We presented some improvements to this algorithm, fo-cusing both on increasing the precision and the robust-</p><p>VITA-1</p></li><li><p>Proceedings of the Second Vienna Talk, Sept. 1921, 2010, University of Music and Performing Arts Vienna, Austria</p><p>Score Representation</p><p>Live Performance</p><p>Output: Score Position</p><p>'Any-time' On-line Music Tracker</p><p>Possible Applications</p><p>Decision Maker</p><p>Structure Model</p><p>Rough Position </p><p>Estimator</p><p>Tracker 2 Tracker 3 Tracker nTracker 1</p><p>Figure 1: Overview of our Any-time On-line Music Tracking System</p><p>ness of the algorithm in [3]. Most importantly, this in-cludes the forward-backward strategy, which reconsid-ers past decisions (using the backward path) and tries toimprove the precision of the current score position hy-pothesis.</p><p>More precisely, the method works as follows: Af-ter every frame of the live input a smoothed backwardpath is computed, starting at the current position (i, j) ofthe forward path. By following this path b = 100 stepsbackwards on the y-axis (the score) one gets a new pointwhich lies with a high probability nearer to the globallyoptimal alignment than the corresponding point of theforward path.</p><p>Starting at this new point another forward path is com-puted until a border of the current matrix (either column ior row j) is reached. If this new path ends in (i, j) again,this can be seen as a confirmation of the current position.If the path ends in a column k &lt; i, new rows are calcu-lated until the current column i is reached again. If thepath ends in a row l &lt; j, the calculation of new rows isstopped until the current row j is reached.</p><p>5. A SIMPLE TEMPO MODEL</p><p>We also introduced a tempo model to our system [1], afeature which has so far been neglected by music trackersbased on DTW.</p><p>5.1. Computation of the Current Tempo</p><p>The computation of the current tempo of the performance(relative to the score representation) is based on a con-stantly updated backward path starting in the current po-sition of the forward calculation. Intuitively, the slope ofsuch a backward path represents the relative tempo dif-ferences between the score representation and the actualperformance. Given a perfect alignment, the slope be-tween the last two onsets would give a very good estima-tion about the current tempo. But as the correctness ofthe alignment of these last onsets generally is quite un-certain, one has to discard the last few onsets and use alarger window over more note onsets to come up with areliable tempo estimation.</p><p>In particular, our tempo computation algorithm uses amethod described in [6]. It is based on a rectified versionof the backward alignment path, where the path betweennote onsets is discarded and the onsets (known from thescore representation) are instead linearly connected. Inthis way, possible instabilities of the alignment path be-tween onsets (as, e.g., between the 2nd and 3rd onset inthe lower left corner in Fig.2) are smoothed away.</p><p>After computing this path, the n = 20 most recentnote onsets which lie at least 1 second in the past are se-lected, and the local tempo for each onset is computedby considering the slope of the rectified path in a windowwith size 3 seconds centered on the onset. This resultsin a vector vt of length n of relative tempo deviationsfrom the score representation. Finally, an estimate of the</p><p>VITA-2</p></li><li><p>Proceedings of the Second Vienna Talk, Sept. 1921, 2010, University of Music and Performing Arts Vienna, Austria</p><p>Figure 2: Illustration of the ODTW algorithm, showingthe iteratively computed forward path (white), the muchmore accurate backward path (grey, also catching the oneonset that the forward path misaligned), and the correctnote onsets (yellow crosses, annotated beforehand). Inthe background the local alignment costs for all pairs ofcells are displayed. Also note the white areas in the upperleft and lower right corners, illustrating the constrainedpath computation around the forward path.</p><p>current relative tempo t is computed using Eq.1, whichemphasizes more recent tempo developments while notdiscarding older tempo information completely, for ro-bustness considerations.</p><p>t =</p><p>ni=1(ti i)n</p><p>i=1 i(1)</p><p>Of course, due to the simplicity of the procedure andespecially the fact that only information older than 1 sec-ond is used, this tempo estimation can recognize tempochanges only with some delay. However, the computationis very fast, which is important for real-time applications.</p><p>5.2. Feeding Tempo Information to the ODTW</p><p>Based on the observation that both the alignment preci-sion and the robustness directly depend on the similar-ity between the tempo of the performance and the scorerepresentation, we now use the current tempo estimateto alter the score representation on the fly, stretching orcompressing it to match the tempo of the performance asclosely as possible. This is done by altering the sequenceof feature vectors representing the score audio. The rela-tive tempo is directly used as the probability to compressor extend the sequence by either adding new vectors orremoving vectors.</p><p>6. LEARNING TEMPO DEVIATIONS FROMDIFFERENT PERFORMERS</p><p>The introduction of this very simple tempo model alreadyleads to considerably improved tracking results. But es-pecially at phrase boundaries with huge changes in tempothe above-mentioned delay in the recognition of tempochanges still results in large alignment errors. Further-more, such tempo changes are very hard to catch instantly,even with more reactive tempo models. To cope with thisproblem we came up with an automatic and very generalway to provide the system with information about possi-ble ways in which a performer might shape the tempo ofthe piece.</p><p>First we extract tempo curves from various differentperformances (audio recordings) of the piece in question.Again, as for the real-time tempo estimation, this is donecompletely automatically using the method described in[6] (see Section 5.1), but as the whole performance isknown beforehand and the tempo analysis can be doneoff-line there is now no need for further smoothing of thetempo computation. These tempo curves are directly im-ported into our real-time tracking system.</p><p>More precisely, as before, after every iteration of thepath computation algorithm the vector vt containing tempoinformation at note onsets is updated based on the back-ward path and the above-mentioned local tempo compu-tation method. But now the tempo curve of the live per-formance over the last w = 50 onsets, again located atleast 1 second in the past, is compared to the previouslystored tempo curves at the same position. To do this alln tempo curves are first normalized to represent the samemean tempo over these w onsets as the live performance.The Euclidean distances between the curve of the liveperformance and the stored curves are computed. Thesedistances are inverted and normalized to sum up to 1, thusnow representing the similarity to the tempo curve of thelive performance.</p><p>Based on the stored tempo curves our system can nowestimate the tempo at the current position. As the currentposition should be somewhere between the last alignedonset oj and the onset oj+1 to be aligned next, we com-pute the current tempo t according to Formula 2, whereti,oj and ti,oj+1 represent the (scaled) tempo informationof curve i at onset oj and oj+1 respectively, and si is thesimilarity value of tempo curve i.</p><p>t =</p><p>ni=1[(ti,oj + ti,oj+1)si]</p><p>2(2)</p><p>Intuitively, the tempo is estimated as the mean of thetempo estimates at these 2 onsets, which in turn are com-puted as a weighted sum of the (scaled) tempi in the storedperformance curves, with each curve contributing accord-ing to its local similarity to the current performance. Pleasenote that this approach somewhat differs from typical waysof training a score follower to follow a particular perfor-mance. We are not feeding the system with rehearsaldata by a particular musician, but with many different</p><p>VITA-3</p></li><li><p>Proceedings of the Second Vienna Talk, Sept. 1921, 2010, University of Music and Performing Arts Vienna, Austria</p><p>ways of how to perform the piece in question, as theanalysed performances may be by different performersand differ heavily in their interpretation style. The sys-tem then decides on-line at every iteration how to weighthe curves, effectively selecting a mixture of the curveswhich represents the current performance best.</p><p>7. ANY-TIME MUSIC TRACKING</p><p>Our system (see Figure 1 for an overview) also includes aunique feature, namely the ability to cope with arbitrarystructural deviations (e.g. large omission, (re-)starts inthe middle of a piece) during a live performance. Whileprevious systems if they did deal at all with serious de-viations from the score had to rely on explicitly pro-vided information about the structure of a piece of mu-sic and points of possible deviation (e.g., notated repeats,which a performer might or might not obey), our sys-tem does without any such information and continuouslychecks all (!) time points in the score as alternatives tothe currently assumed score position, thus theoreticallybeing able to react to arbitrary deviations (jumps etc.) bythe performer.</p><p>At the core is a process (the Rough Position Estima-tor, see Figure 1) that continually updates and evaluateshigh-level hypotheses about possible current positions inthe score, which are then verified or rejected by multipleinstances of the basic alignment algorithm (Tracker 1n, each including its own tempo model) described above.To guide our system in the face of possible repetitionsand to avoid random jumps between identical parts in thescore, we also introduced automatically computed infor-mation about the structure of the piece to be tracked. Wechose to call our new approach Any-time Music Track-ing, as the system is continuously ready to receive inputand find out what the performers are doing, and wherethey are in the piece. For more details we refer the readerto [2].</p><p>8. EVALUATION</p><p>All above-mentioned improvements to the original algo-rithm were extensively evaluated. We have shown thatthe resulting system is both more robust and more accu-rate than the original system. Furthermore we alreadydemonstrated the systems accuracy and reliability live onstage during a real piano performance. For detailed eval-uation results we refer the reader to [3], [1] and [2].</p><p>9. FUTURE WORK</p><p>An important direction for future work is the introduc-tion of explicit event detection into our system, based onboth an estimation of the timing and an analysis of the in-coming audio frames. This would increase the alignmentprecision especially for sparse and/or monophonic music.</p><p>A possible future scenario would be to extend ourany-...</p></li></ul>