teasing the music out of digital data - matthias...

31
November, 2012 Teasing the Music out of Digital Data Matthias Mauch 1

Upload: vutuyen

Post on 15-Mar-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

November, 2012

Teasing the Music out of Digital DataMatthias Mauch

1

Page 3: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Centre for Digital Music

✤ C4DM is part of School of Electronic Engineering and Computer Science at Queen Mary, University of London

✤ ~10 years old (founded in 2003)

✤ led by Mark Plumbley; more than 50 full-time members:

✤ academics (professors, lecturers), ✤ research staff, ✤ research students, ✤ guests

Page 4: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Centre for Digital Music

✤ C4DM is part of School of Electronic Engineering and Computer Science at Queen Mary, University of London

✤ ~10 years old (founded in 2003)

✤ led by Mark Plumbley; more than 50 full-time members:

✤ academics (professors, lecturers), ✤ research staff, ✤ research students, ✤ guests

Page 5: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Areas in the C4DM

✤ Audio Engineering: auto-mixing, feedback-elimination, ... Josh Reiss✤ Interactional Sound & Music: interfaces for interaction with music, ...

Nick Bryan-Kinns✤ Machine Listening: sparse models of audio, object coding, non-

musical/speech sound classification, ... Mark Plumbley✤ Music Informatics: automatic transcription, music classification and

retrieval (by genre, mood, similarity), segmentation, ... Simon Dixon ✤ Music Cognition: models of music in human brains... Geraint

Wiggins, Marcus Pearce✤ New Research Areas include performance studies and augmented

musical instruments... Elaine Chew, Andrew McPherson

my area

Page 6: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Areas in the C4DM

✤ Audio Engineering: auto-mixing, feedback-elimination, ... Josh Reiss✤ Interactional Sound & Music: interfaces for interaction with music, ...

Nick Bryan-Kinns✤ Machine Listening: sparse models of audio, object coding, non-

musical/speech sound classification, ... Mark Plumbley✤ Music Informatics: automatic transcription, music classification and

retrieval (by genre, mood, similarity), segmentation, ... Simon Dixon ✤ Music Cognition: models of music in human brains... Geraint

Wiggins, Marcus Pearce✤ New Research Areas include performance studies and augmented

musical instruments... Elaine Chew, Andrew McPherson

my area

Page 7: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Areas in the C4DM

✤ Audio Engineering: auto-mixing, feedback-elimination, ... Josh Reiss✤ Interactional Sound & Music: interfaces for interaction with music, ...

Nick Bryan-Kinns✤ Machine Listening: sparse models of audio, object coding, non-

musical/speech sound classification, ... Mark Plumbley✤ Music Informatics: automatic transcription, music classification and

retrieval (by genre, mood, similarity), segmentation, ... Simon Dixon ✤ Music Cognition: models of music in human brains... Geraint

Wiggins, Marcus Pearce✤ New Research Areas include performance studies and augmented

musical instruments... Elaine Chew, Andrew McPherson

my areamy areamy area

Page 8: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Music Informatics

✤ “my” area, led by Simon Dixon

✤ harmony analysis: automatic chord transcription, chord progressions, key detection

✤ transcription: multiple fundamental frequency estimation, semi-automatic techniques

✤ music classification: genre classification, mood classification

✤ other stuff: analysis of violoncello timbre in recordings, automatic classification of harpsichord temperament, beat tracking, drum patterns...

Page 9: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

My work

✤ PhD — audio chord transcription

✤ post-doc — lyrics-to-audio alignment, Songle chord/key/beat tracking

✤ Research Fellow at Last.fm — Driver’s Seat

✤ harpsichord tuning estimation

✤ DarwinTunes — analysis of musical evolution

Page 10: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Audio Chord Transcription I

‣ DBN models musical context [1][2]

‣ bass, key, metric position

‣ 2012 state of the art adaptation: Ni et al. [3]

Chapter 4. A Musical Probabilistic Model 67

metric pos.

key

chord

bass

bass chroma

treble chroma

Mi�1

Ki�1

Ci�1

Bi�1

Xbsi�1

Xtri�1

Mi

Ki

Ci

Bi

Xbsi

Xtri

1

Figure 4.2: Our network model topology, represented as a 2-TBN with two slices and six layers.The clear nodes represent random variables, while the observed ones are shaded grey. Thedirected edges represent the dependency structure. Intra-slice dependency edges are drawnsolid, inter-slice dependency edges are dashed.

Page 11: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Audio Chord Transcription I

‣ DBN models musical context [1][2]

‣ bass, key, metric position

‣ 2012 state of the art adaptation: Ni et al. [3]

metric pos.

key

chord

bass

bass chroma

treble chroma

Mi�1

Ki�1

Ci�1

Bi�1

Xbsi�1

Xtri�1

Mi

Ki

Ci

Bi

Xbsi

Xtri

significant improvement in chord transcription

Chapter 4. A Musical Probabilistic Model 82

1.5 2 2.5 3 3.5 4 4.5

Weller et al.

full−MBK

full−MB

full−M

full−plain

mean overlap rank

2 groups have mean column ranks significantly different from Weller et al.

(a) full models

1.5 2 2.5 3 3.5 4 4.5

Weller et al.

majmin−MBK

majmin−MB

majmin−M

majmin−plain

mean overlap rank

2 groups have mean column ranks significantly different from Weller et al.

(b) majmin models

Figure 4.11: Friedman significance analysis based on the song-wise RCO rank of the fourconfigurations with (a) full chords and (b) majmin chords. In each case the MIREX resultsof Weller et al. are given for comparison. Where confidence intervals overlap, the differencebetween methods cannot be considered significant. We can see that our best model actuallyranks higher on average than Weller’s, if not significantly so. See discussion in Section 4.4.1.

0 10 20 30 40 50 60 70 80 90

N

aug

dim

min7

maj7

7

maj6

maj/5

maj/3

min

maj 68.73

32.52

16.84

20.48

23.48

32.61

11.54

14.86

36.45

30.13

61.17

percentage of correct overlap

chord

fam

ily

Figure 4.12: Relative overlap for individual chords in the MBK setting with full chord dictio-nary.

gvmm!dpoufyu!npefm

Page 12: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Audio Chord Transcription I

‣ DBN models musical context [1][2]

‣ bass, key, metric position

‣ 2012 state of the art adaptation: Ni et al. [3]

metric pos.

key

chord

bass

bass chroma

treble chroma

Mi�1

Ki�1

Ci�1

Bi�1

Xbsi�1

Xtri�1

Mi

Ki

Ci

Bi

Xbsi

Xtri

significant improvement in chord transcription

Chapter 4. A Musical Probabilistic Model 82

1.5 2 2.5 3 3.5 4 4.5

Weller et al.

full−MBK

full−MB

full−M

full−plain

mean overlap rank

2 groups have mean column ranks significantly different from Weller et al.

(a) full models

1.5 2 2.5 3 3.5 4 4.5

Weller et al.

majmin−MBK

majmin−MB

majmin−M

majmin−plain

mean overlap rank

2 groups have mean column ranks significantly different from Weller et al.

(b) majmin models

Figure 4.11: Friedman significance analysis based on the song-wise RCO rank of the fourconfigurations with (a) full chords and (b) majmin chords. In each case the MIREX resultsof Weller et al. are given for comparison. Where confidence intervals overlap, the differencebetween methods cannot be considered significant. We can see that our best model actuallyranks higher on average than Weller’s, if not significantly so. See discussion in Section 4.4.1.

0 10 20 30 40 50 60 70 80 90

N

aug

dim

min7

maj7

7

maj6

maj/5

maj/3

min

maj 68.73

32.52

16.84

20.48

23.48

32.61

11.54

14.86

36.45

30.13

61.17

percentage of correct overlap

chord

fam

ily

Figure 4.12: Relative overlap for individual chords in the MBK setting with full chord dictio-nary.

gvmm!dpoufyu!npefm

Page 13: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Audio Chord Transcription II

‣ averaging features across repeated song segments [4]

‣ non-systematic noise is attenuated ➟ better results

chorus verse chorus bridge verse chorus bridge verse chorus outro

It Won’t Be Long

part n1 part A part B part A part B part A part n2

0 20 40 60 80 100 1200

0.5

1

time/s

ground truthsegmentation

automaticsegmentation

chord correctusing auto seg.

chord correctbaseline meth.

(a) whole song

C E E C E E C#:min E C#:min

C E C E C#:min E C#:min

C B E B C E:minB E B C#:min F#/5 E C#:min

100 102 104 106 108 110 112 114 116 118 120

−1

0

1

ground truthchords

auto chordsusing auto seg.

auto chordsbaseline meth.

(b) excerpt between 100s and 120s

Figure 6.1: “It Won’t Be Long”. Effect of using the repetition information (see discussion in Section 6.4.1): comparing the fully-automatic autobeat-autoseg method to the baseline method that does not use repetition information. (a) the two bottom bars are black at times where the chord has beenrecognised correctly (using MIREX-style evaluation) over the whole song. The two top bars display the manual segmentation (for reference) and theautomatic segmentation used to obtain the autobeat-autoseg results. (b) manually-annotated and automatically-extracted chord sequences for an excerpt ofthe song.

Page 14: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Audio Chord Transcription II

‣ averaging features across repeated song segments [4]

‣ non-systematic noise is attenuated ➟ better results

chorus verse chorus bridge verse chorus bridge verse chorus outro

It Won’t Be Long

part n1 part A part B part A part B part A part n2

0 20 40 60 80 100 1200

0.5

1

time/s

ground truthsegmentation

automaticsegmentation

chord correctusing auto seg.

chord correctbaseline meth.

(a) whole song

C E E C E E C#:min E C#:min

C E C E C#:min E C#:min

C B E B C E:minB E B C#:min F#/5 E C#:min

100 102 104 106 108 110 112 114 116 118 120

−1

0

1

ground truthchords

auto chordsusing auto seg.

auto chordsbaseline meth.

(b) excerpt between 100s and 120s

Figure 6.1: “It Won’t Be Long”. Effect of using the repetition information (see discussion in Section 6.4.1): comparing the fully-automatic autobeat-autoseg method to the baseline method that does not use repetition information. (a) the two bottom bars are black at times where the chord has beenrecognised correctly (using MIREX-style evaluation) over the whole song. The two top bars display the manual segmentation (for reference) and theautomatic segmentation used to obtain the autobeat-autoseg results. (b) manually-annotated and automatically-extracted chord sequences for an excerpt ofthe song.

Chapter 6. Using Repetition Cues 131

0

20

40

60

nu

mb

er

of

so

ng

s

−15 −10 −5 0 5 10 150

20

40

60

80

100

120

140

160

180

200

improvement in percentage points

so

ng

(a) autobeat-autoseg against autobeat-noseg

0

20

40

60

nu

mb

er

of

so

ng

s

−15 −10 −5 0 5 10 150

20

40

60

80

100

120

140

160

180

200

improvement in percentage points

so

ng

(b) manbeat-manseg against manbeat-noseg

Figure 6.7: Song-wise improvement in RCO for the methods using segmentation cues overthe respective baseline methods. The lower part of the figures shows the performance differ-ence per song, and the upper part summarises the same information in a histogram. Usingautobeat-autoseg improves performance on 71% of songs compared to autobeat-noseg (Fig-ure 6.7a); manbeat-manseg improves RCO scores for 63% of songs compared to the manbeat-noseg method (Figure 6.7b).

resulting from segmentation information. Here we can see the performance improvement on

a song-wise basis. For example, Figure 6.7a allows us to see immediately why the difference

between using and not using segmentation is significant: the autobeat-autoseg method performs

better than the autobeat-noseg method on the majority of songs (71%). Similarly Figure 6.7b

shows the song-wise improvement of manbeat-manseg over manbeat-noseg, where the method

using repetition cues increases performance in 63% of the songs.

Our main hypothesis—that repetition cues can be used to improve chord transcription—

has been confirmed with statistical significance. Next, we consider the difference between man-

ual and automatic beat extraction to find out whether the more practicable automatic approach

yields significantly lower results.

accuracy of most songs improves

Page 15: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Chordino & NNLS Chroma

✤ “NNLS Chroma” [5] Vamp plugin (e.g. for Sonic Visualiser):

✤ download:http://isophonics.net/nnls-chroma

✤ source:http://code.soundsoftware.ac.uk/projects/nnls-chroma

✤ contains Chordino – a basic chord estimator

Page 16: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

SongPrompter

...

Verse:Bm G D A Once you were my love, now just a friend,Bm G D A What a cruel thing to pretend.Bm G D A A mistake I made, there was a price to pay.Bm G D A In tears you walked away,

Verse:When I see you hand in hand with some otherI slowly go insane.Memories of the way we used to be ...Oh God, please stop the pain.

Chorus:D G Em AOh, once in a life timeD/F# G ANothing can last forever D/F# GI know it's not too late A7 F#/A#Would you let this be our fate? Bm G Asus4I know you'd be right but please stayA Don't walk away

Instrumental:Bm G D A Bm G A

Chorus:Oh, once in a life timeNothing can last foreverI know it's not too lateWould you let this be our fate?I know you'd be right but please stay.Don't walk away.

...

first verse: alllyrics and

chords given���

subsequentverse: only

lyrics; chordsare omitted

���

blank lineseparates song

segmentsXXX

heading definessegment type #

##

Figure 2: Excerpt adapted from “Once In A Lifetime”(RWC-MDB-P-2001 No. 82 [9]) in the chords and lyricsformat similar to that found in many transcriptions on theInternet.

The motivation for the integration of chords is the vastavailability of paired textual chord and lyrics transcrip-tions on the Internet through websites such as “UltimateGuitar” 1 and “Chordie” 2 . Though there is no formal def-inition of the format used in the transcriptions appearingon the Internet, they will generally look similar to the oneshown in Figure 2. It contains the lyrics of the song withchord labels written in the line above the correspondinglyrics line. Chords are usually written exactly over thewords they start on, and labels written over whitespace de-note chords that start before the next word. In our exam-ple (Figure 2) the lyrics of the verses are all accompaniedby the same chord sequence, but the chord labels are onlygiven for the first instance. This shortcut can be applied toany song segment type that has more than one instance, andtranscribers usually use the shorter format to save spaceand effort. Song segment names can be indicated abovethe first line of the corresponding lyrics block. Song seg-ments are separated by blank lines, and instrumental partsare given as a single line containing only the chord pro-gression.

The rest of the paper is structured as follows. Section 2describes the hidden Markov model we use for lyrics align-ment and also provides the results in the case of completechord information. Section 3 deals with the method thatcompensates for incomplete chord annotations by locatingphrase-level boundaries, and discusses its results. Future

1http://www.ultimate-guitar.com

2http://www.chordie.com

work is discussed in Section 4, and Section 5 concludesthe paper.

2. USING CHORDS TO AIDLYRICS-TO-AUDIO ALIGNMENT

This section presents our technique to align audio record-ings with textual chord and lyrics transcriptions such as theones described in Section 1. To show that the chord infor-mation does indeed aid lyrics alignment, we start with thecase in which complete chord information is given. Moreprecisely, we make the following assumptions:

complete lyrics Repeated lyrics are explicitly given.

segment names The names of song segments (e.g. verse,chorus, ...) are given above every lyrics block.

complete chords Chords for every song segment instanceare given.

This last assumption is a departure from the format shownin Figure 2, and in Section 3 we will show that it can berelaxed.

An existing HMM-based lyrics alignment system isused as a baseline and then adapted for the additional in-put of 12-dimensional chroma features using an existingchord model [10]. We will give a short outline of the base-line method (Section 2.1), and then explain the extensionto chroma and chords in Section 2.2. The results of thetechnique used in this section are given in Section 2.3.

2.1 Baseline Method

The baseline method [1] is based on a hidden Markovmodel (HMM) in which each phoneme is represented bythree hidden states, and the observed nodes correspondto the low-level feature, which we will call phoneme fea-ture. To be precise, given a phoneme state s, the 25 ele-ments of the phoneme feature vector xm with the distribu-tion Pm(xm|s) consist of 12 MFCCs, 12 �MFCCs and 1element containing the power difference (the subscript mstands for MFCC). These 12+12+1 elements are modelledas a 25-dimensional Gaussian mixture model with 16 mix-ture components. The transition probabilities between thethree states of a phoneme and the Gaussian mixtures aretrained on Japanese singing. For use with English lyrics,phonemes are retrieved using the Carnegie Mellon Uni-versity Pronouncing Dictionary 3 and then mapped to theirJapanese counterpart. A left-to-right layout is used for theHMM, i.e. all words appear in exactly the order provided.The possibility of pauses between words is modelled byintroducing optional “short pause” states, whose phonemefeature emissions are trained from the non-voiced parts ofthe songs.

Since the main lyrics are usually present only in thepredominant voice, the audio is pre-processed to eliminateall other sounds. To achieve this the main melody voiceis segregated in three steps: first, the predominant fun-damental frequency is detected using PreFEst [11]. The

3http://www.speech.cs.cmu.edu/cgi-bin/cmudict

Page 17: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

SongPrompter

LYRICS-TO-AUDIO ALIGNMENT AND PHRASE-LEVEL SEGMENTATIONUSING INCOMPLETE INTERNET-STYLE CHORD ANNOTATIONS

Matthias Mauch Hiromasa Fujihara Masataka GotoNational Institute of Advanced Industrial Science and Technology (AIST), Japan

{m.mauch, h.fujihara, m.goto}@aist.go.jp

ABSTRACT

We propose two novel lyrics-to-audio alignment methodswhich make use of additional chord information. In thefirst method we extend an existing hidden Markov model(HMM) for lyrics alignment [1] by adding a chord modelbased on the chroma features often used in automatic audiochord detection. However, the textual transcriptions foundon the Internet usually provide chords only for the firstamong all verses (or choruses, etc.). The second methodwe propose is therefore designed to work on these incom-plete transcriptions by finding a phrase-level segmenta-tion of the song using the partial chord information avail-able. This segmentation is then used to constrain the lyricsalignment. Both methods are tested against hand-labelledground truth annotations of word beginnings. We use ourfirst method to show that chords and lyrics complementeach other, boosting accuracy from 59.1% (only chromafeature) and 46.0% (only phoneme feature) to 88.0% (0.51seconds mean absolute displacement). Alignment perfor-mance decreases with incomplete chord annotations, butwe show that our second method compensates for this in-formation loss and achieves an accuracy of 72.7%.

1. INTRODUCTION

Few things can rival the importance of lyrics to the char-acter and success of popular songs. Words and musiccome together to tell a story or to evoke a particular mood.Even musically untrained listeners can relate to situationsand feelings described in the lyrics, and as a result veryfew hit songs are entirely instrumental [2]. Provided withthe recording of a song and the corresponding lyrics tran-script, a human listener can easily find out which positionin the recording corresponds to a certain word. We calldetecting these relationships by means of a computer pro-gram lyrics-to-audio alignment, a music computing taskthat has so far been solved only partially. Solutions tothe problem have a wide range of commercial applicationssuch as the computer-aided generation of annotations forkaraoke, song-browsing by lyrics, and the generation ofaudio thumbnails [3], also known as audio summarization.

The first system addressing the lyrics-to-audio align-

Copyright: c�2010 Matthias Mauch et al. This is an open-access article distributed

under the terms of the Creative Commons Attribution License, which permits unre-

stricted use, distribution, and reproduction in any medium, provided the original

author and source are credited.

audio

MFCCs

chroma

Figure 1: Integrating chord information in the lyrics-to-audio alignment process (schematic illustration). Thechords printed black represent chord changes, grey chordsare continued from a prior chord change. Word-chordcombinations are aligned with two audio features: anMFCC-based phoneme feature and chroma.

ment problem was a multimodal approach proposed byWang et al. [4], which has since been developed further[5]. The method makes relatively strong assumptions onthe form and meter (time signature) of the songs, which en-ables preliminary chorus-detection and beat-tracking stepsto aid the final low-level lyrics alignment step. In [1] a left-to-right hidden Markov model (HMM) architecture is usedto align lyrics to audio, based on observed mel frequencycepstral coefficients (MFCCs). Here too, several prepro-cessing steps such as singing melody segregation and vo-cal activity detection are employed, but these make fewerassumptions, and the more complex HMM models the evo-lution of single phonemes in time. Similar HMM-basedapproaches have been used in [6] and [7]. A special caseof lyrics alignment based on the speech melody of Can-tonese has been presented in [8]. These existing lyrics-to-audio alignment systems have used only two informationsources: the audio file and the lyrics.

In this paper we propose two novel techniques that in-tegrate additional textual chord information (Figure 1) intothe alignment framework:

1. an extension of the lyrics-to-audio alignmentparadigm to incorporate chords and chroma featuresin the ideal case of complete chord information, and

2. a three-step method that can recover missing chordinformation by locating phrase-level boundariesbased on the partially given chords.

Page 18: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

SongPrompter

‣ automatic alignment works best with speech and chord features [6]

‣ visual display from automatic alignment‣ lyrics, segmentation and chords

‣ audio playback‣ original audio‣ auto-extracted bass and drum track

Page 19: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

SongPrompter

‣ automatic alignment works best with speech and chord features [6]

‣ visual display from automatic alignment‣ lyrics, segmentation and chords

‣ audio playback‣ original audio‣ auto-extracted bass and drum track

lbsbplf!gps!hvjubsjtut"

Page 20: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

SongPrompter demo

Page 21: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Songle Web Service

Page 22: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Songle.jp

‣ web service [7]

‣ adding interaction

‣ engaging user experience

‣ insights through automatic annotations

‣ anyone can contribute – it’s social!

‣ use for MIR research

‣ crowd-sourcing more training data

‣ exposure to broader audience

Page 23: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Songle.jp

‣ web service [7]

‣ adding interaction

‣ engaging user experience

‣ insights through automatic annotations

‣ anyone can contribute – it’s social!

‣ use for MIR research

‣ crowd-sourcing more training data

‣ exposure to broader audience

LIVE AND

ONLINE

Page 24: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination
Page 25: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Driver’s Seat

‣ Last.fm already have genre tags, similarity

‣ we want a complement: intuitively understandable audio features

‣ ‘harmonic creativity’ (structural change [8])

‣ noisiness, energy, rhythmic regularity, ...

‣ Spotify app based on Last.fm audio API

Page 26: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Driver’s Seat

Bvejp

BvejpGfbuvsfBQJ

Tqpujgz!JEBQJ

TqpujgzBqqtBQJ

Fyusbdujpo

Esjwfs’t!Tfbu

Page 27: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Driver’s Seat

Page 28: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

DarwinTunes

✤ project by Bob MacCallum and Armand Leroi at Imperial College;paper: [9]

✤ use genetic algorithms to evolve short musical loops

1. selection process is web-based, “crowd-sourced” (>6000 unique voters)

2. evolutionary analysis based on fitness (votes) and phenotype (sound surface)

✤ sound surface: a scientific application of music informatics

out a new experiment. We randomly sampled 2,000 of the 50,480loops produced at any time during EP1’s evolution and, viaa Web interface, asked public consumers to rate them as before.Because consumers heard and rated loops sampled from theentire evolutionary trajectory in this experiment, their ratings canbe used to estimate the mean absolute musical appeal, M, of thepopulation at any time. This is analogous to bacterial experi-ments in which the fitness of an evolved strain is compared di-rectly with that of its ancestor (10). Fig. 1B shows that Mincreased rapidly for the first 500–600 generations but then cameto equilibrium. Thus, in our system, musical quality evolves, butit seems that it does not do so indefinitely.

What makes the loops of later generations so much morepleasing? The aesthetic value of a given piece of music dependson many different features, such as consonance, rhythm, andmelody (23). In recent years, music information retrieval (MIR)technology has permitted the automatic detection of some ofthese features (24–26); reasoning that our raters listen to, andlike, Western popular music, we measured the phenotypes of ourloops using two MIR algorithms designed to detect features inthis music. The first, Chordino, detects the presence of chordscommonly used in the Western repertoire (27). The fit of a loopto Chordino’s canonical chord models is given by a log-likelihoodvalue CL and is an estimate of the clarity of the chordal structure.The second, Rhythm Patterns (28), extracts a rhythmic signature,from which we derive a complexity measure, R. To validate thesealgorithms, we tested them on a standardized test set of specif-ically generated loops (SI Appendix, A.3).To examine the evolution of musical qualities in EP1, we

measured CL and R for every loop. We found that, like musicalappeal, these traits increased rapidly over the first 500–600generations but then appear to fluctuate around a long-termmean (Fig. 2 A and B). Given these dynamics, and because CLand R are measured without error, we are able to model theirevolution using a discrete version of the Ornstein–Uhlenbeck(O-U) process, according to which the change in the mean ofa character from one generation to the next is anticorrelated tohow far it is from a long-term mean:

Δ!z ! a"u!!z# $ ε;

where Δ!z is the difference between the means of each offspringand parental generation, !zo !!zp ; a is a constant such that ; a> 0 ;u is the long-term mean; and ε is a normally distributed randomvariable with a mean = 0. For both CL and R, the confidencelimits on the long-term mean do not include the initial values(P ! 1:0 !  10! 6 and P ! 2:0 !  10! 7, respectively), confirm-ing the visual impression that CL and R increased significantlyover the course of the experiment (Fig. 2 A and B and SI Ap-pendix, Table S6).Because musical appeal and its components all increase, they

are probably being selected. However, the trajectory of a Dar-winTunes population, like that of any evolving population,depends not only on selection but on stochastic sampling, theanalog of genetic drift. In experimental evolution, replicableresponses are a signature of selection (10, 29). Hence, to de-termine whether the increases in chordality and rhythmicity areidiosyncratic to the preferences of the particular set of con-sumers contributing to EP1, or perhaps are attributable tochance correlations between these characters and other charac-ters that were the true targets of selection, we repeated the ex-periment in a more controlled setting. To do this, we clonedadditional populations from the same base population that EP1started with and asked undergraduates to rate them. Thesepopulations, designated EP2 and EP3, were allowed to evolveindependently for about 400 generations and received an aver-age of 10,683 ratings. We found that CL and R also increasedrapidly in these populations, again to a plateau (SI Appendix, B.1and SI Appendix, Fig. S5). As controls, we generated 1,000 ad-ditional populations with the same origin as the experimentalpopulations and subject to the same variational processes anddemography for 400 generations but differing from them in thatratings were assigned randomly rather than by consumers. Wefound that mean CL and R of the selected populations weresignificantly higher than those of the unselected control pop-ulations by generation 100 (Fig. 2 C and D and SI Appendix, B.2).We also used the control populations to examine whether CLand R are intrinsically related to each other and found that theyare weakly correlated, r ! 0:26"± 0:016# [mean (±95% confi-dence interval)] (SI Appendix, B.3). Thus, although selection on

A

B

reproduction,recombination

& mutation

phenotype-production

& rating

selection

3.4 4.2 0.5

100N

20

10

100

Gn

2.3

20

n+1G

0 300 600 900 1200 1500 1800 2100 2400 2700generation

1.5

2.0

2.5

3.0

3.5

4.0

M (m

ean

ratin

g)

Fig. 1. (A) Evolutionary processes in DarwinTunes. Songs are represented astree-like structures of code. Each generation starts with 100 songs; however,for clarity, we only follow one-fifth of them. Twenty songs are randomlypresented to listeners for rating, and the remaining 80 survive until the nextgeneration; thus, at any time, the population contains songs of varying age.Of the 20 rated songs, the 10 best reproduce and the 10 worst die. Repro-ductives are paired and produce four progeny to replace themselves and thedead in the next generation. The daughters’ genomes are formed from theirparents’ genomes, subject to recombination and mutation. (B) Evolution ofmusical appeal. During the evolution of our populations, listeners could onlylisten to, and rate, songs that belonged to one or, at best, consecutivegenerations. Here, they were asked to listen to, and rate, a random sampleof all the songs that had previously evolved in the public population, EP1.Thus, these ratings can be used to estimate the mean absolute musical ap-peal, M, of the population at any time. To describe the evolution of M, wefitted an exponential function. Because the parameter that describes therate of increase of M is significantly greater than zero, M increases over thecourse of the experiment (SI Appendix, B.1).

2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1203182109 MacCallum et al.

Page 29: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

DarwinTunes

✤ Chordino log-likelihood and Rhythmic Complexity measures both indicate a drastic rise and subsequent stagnation

✤ plateau best explained by fragile features —despite the existence of better tunes transmission imposes limit

one of these features may influence the evolution of the other,they are largely independent. We cannot, however, preclude thepossibility that either feature is highly correlated with unmeasuredtraits that are more direct targets of selection.

Variation and Adaptation in DarwinTunes Populations. The increasein CL and R implies that selection is directional. Thus, why doour populations stop evolving? Remarkably, it is not merely thatthese traits cease to evolve: Musical appeal itself does also. Thispattern of fast-slow evolution or even stasis is often seen in bi-ological populations, whether in the laboratory, wild, or fossilrecord. Stasis can result from several different population ge-netic forces; however, it has often been difficult to distinguishamong them (10, 30–32). Because we know the complete histo-ries of the DarwinTunes populations, we can study the forcesdriving their evolutionary dynamics in detail. We first consideredthe possibility that DarwinTunes populations have arrived at anadaptive peak, such that selection, which was previously di-rectional, now stabilizes the population means. To investigatethis, we estimated selective surfaces using multivariate cubic-spline regression (33) and plotted adaptive walks on them. Fig.3A shows that EP1 has a single adaptive peak near high R and CLand that although it walks erratically up the slope toward thepeak, it does not reach it. Very rhythmic loops (very high R) maybe less fit than slightly less rhythmic ones; even so, it is clear thatEP1 has stopped evolving at least 1 SD in each dimension away

from its adaptive peak; thus, stasis is not attributable to an ab-sence of selection. Interestingly, the topology of the EP1 adaptivelandscape suggests that R and CL have a synergistic effect onfitness: high CL loops are especially fit when they have a high Ras well; a model with CL !R interaction explains significantlymore of the variation than one without it. A similar interaction isfound in EP2 but not EP3 (SI Appendix, B.4).We next considered the possibility that the populations have

simply run out of genetic variation and that they have becomefixed for all beneficial variants. Fig. 3 B and C show the fre-quency distributions of CL and R over the evolution of EP1. Therapid progress of the population before generation 1,000 is as-sociated with a decrease in frequency of loops with the lowestchordal clarity and rhythmic complexity, likely attributable toselection. However, as the population continues to evolve, newlow CL and R loops are reintroduced by mutation or recom-bination, and throughout the evolution of the populations, manyloops have higher CL and R values than the long-term O-Umean. Thus, the lack of progressive evolution after about gen-eration 500 is not attributable to fixation of high CL and R var-iants and complete exhaustion of genetic variation. This is alsotrue for EP2 and EP3 (SI Appendix, Fig. S8).

Applying the Price Equation. To probe the forces acting on thesepopulations further, we made use of the Price equation (34–37).The Price equation, a general description of all evolutionary

0 300 600 900 1200 1500 1800 2100 2400 2700

-4.0

-3.8

-3.6

-3.4

-3.2

-3.0

-2.8

0 300 600 900 1200 1500 1800 2100 2400 2700

700

750

800

850

900

950

CL

C

A

D

B

CL

0 50 100 150 200 250 300 350 400generation

-4.0

-3.9

-3.8

-3.7

-3.6

-3.5

-3.4

-3.3

-3.2

SelectedControl

0 50 100 150 200 250 300 350 400generation

700

750

800

850

900

SelectedControl

R

R

Fig. 2. Evolution of musical attributes. (A) Evolution of chordal clarity, CL, in EP1. (B) Evolution of rhythm, R, in the public population, EP1. Both featureswere fitted with an O-U model that includes a stochastic component. In the fits shown, the stochastic parameter, s, was set to zero for the sake of clarity;however, during model fitting, s was included as a freely varying parameter. SI Appendix, Fig. S5 shows equivalent plots for the replicate populations. (C)Evolution of CL in three selected populations (EP1–EP3) and 1,000 unselected control populations over 400 generations. (D) Evolution of R in three selectedpopulations (EP1–EP3) and 1,000 unselected control populations over 400 generations. Error bars represent 95% confidence intervals estimated by a linearmixed model. By generation 100, both CL and R are significantly elevated in the selected populations, compared with the control unselected populations.

MacCallum et al. PNAS Early Edition | 3 of 6

EVOLU

TION

PSYC

HOLO

GICALAND

COGNITIVESC

IENCE

S

Page 30: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

Zukunftsmusik

✤ Drum transcription. Improve drum transcription by “language modelling” from a large corpus of symbolic drum patterns

✤ Singing research. Make a user interface Tony for the quick and simple annotation of pitches in monophonic audio.

✤ How do singers correct pitch errors?

✤ Do we have a background tuning process in our heads?

✤ Collaborate with ethnomusicologists, musicians, psychologists...

Page 31: Teasing the Music out of Digital Data - Matthias Mauchmatthiasmauch.net/_pdf/DortmundNovember2012.pdf · Teasing the Music out of Digital Data ... auto-mixing, feedback-elimination

References

[1] Mauch, M., & Dixon, S. (2010). Simultaneous Estimation of Chords and Musical Context from Audio. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1280-1289.

[2] Mauch, M. (2010). Automatic Chord Transcription from Audio Using Computational Models of Musical Context. Queen Mary University of London.

[3] Ni, Y., McVicar, M., Santos-Rodriguez, R., & De Bie, T. (2012). An end-to-end machine learning system for harmonic analysis of music. IEEE Transactions on Audio, Speech, and Language Processing, in print.

[4] Mauch, M., Noland, K. C., & Dixon, S. (2009). Using Musical Structure to Enhance Automatic Chord Transcription. Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR 2009).

[5] Mauch, M., & Dixon, S. (2010). Approximate Note Transcription for the Improved Identification of Difficult Chords. Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR 2010).

[6] Mauch, M., Fujihara, H. & Goto, M. (2012). Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

[7] Goto, M., Yoshii, K., Fujihara, H., Mauch, M., & Nakano, T. (2011). Songle: A Web Service for Active Music Listening Improved by User Contributions. Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011) (pp. 311-316).

[8] Mauch, M., & Levy, M. (2011). Structural Change on Multiple Time Scales as a Correlate of Musical Complexity. Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011) (pp. 489-494).

[9] MacCallum, B., Mauch, M., Burt, A., & Leroi, A. M. (2012). Evolution of Music by Public Choice. Proceedings of the National Academy of Sciences of the United States of America, 109(30), 12081-12086.