language and intelligence

AUTOMATIC TRANSCRIPTION OF PIANO MUSIC - SARA CORFINI 1

LANGUAGE AND INTELLIGENCE

U N I V E R S I T Y O F P I S ADEPARTMENT OF COMPUTER SCIENCE

Automatic Transcription of Piano Music

Sara Corfini


INTRODUCTIONTrascribing recordings of piano music into a MIDI rapresentation

MIDI provides a compact representation of musical data

Score-following for computer-human interactive performance

“Signal-to-score” problem

A hidden Markov model approach to piano music transcription

A “state of nature” can be realized through a wide range of data configurations

Probabilistic data representation

Automatically learning this probabilistic relationship is more flexible than optimizing a particular model

Rules describing the musical structure can be more accurately represented as tendencies


THE MODELThe acustic signal is segmented into a sequence of frames (“snapshots” of sound)

For each frame a feature vector y1,…,yN is computed

Goal to assign a label to each frame describing its content

A generative probabilistic framework (a hidden Markov model)

outputs the observed sequence of features vectors y1,…,yNhidden variables labels

A Hidden Markov model is composed of two processes

X = X1,…,XN and Y = Y1,…,YN

X is the hidden (or label) process and describes the way a sequence of frame labels can evolve (a Markov chain)

We do not observe the X process directly, but rather the feature vector data

The likelihood of a given feature vector depends only on the corresponding label


THE LABEL PROCESS

GOAL to assign a label to each frame where each label ∈ L

Components of the label

the pitch configuration (chord)

“attack”, “sustain”, “rest” portions of a chord

We define a random process (a Markov chain) X1,…,XN that takes

value in the label set L

The probability of the process occupying a certain state (label) in a given frame depends only on the preceding state (label)

where p(x’|x) is the transition probability matrix and X1n = (X1,

…,XN)


THE LABEL PROCESSMarkov model for a single chord

Markov model for recognition problem

the final state of each chord model is connected to the initial state of each chord model

a silence model is constructed for the recorder space before and after the performance


THE OBSERVABLE PROCESS

Rather than observe the label process x1,…,xN, we observe feature

vector data y1,…,yN (probabilistically related to labels)

Assumption of HMM each visited state Xn produces a feature

vector Yn from a distribution that is characteristic of that state

Hence, given Xn, Yn, is conditionally independent of all other frame

labels and all other feature vectors



We compute a vector of features for each frame y = (y1,…,yK)

The components of this vector are conditionally independent given that state

The state are tied different states share the same feature distributions

Where the Tk(x) is constructed by hand

Hence we have



Tk(x) can be clarified by describing the computed features

y1 measures the total energy in the signal (to distinguish between the times when the pianist plays and when there is silence)

T1(x) = 0 for the silence and rest states T1(x) = 1 for the remaining states

Two probabilistic distributions:

p(yk|T1(x)=0) p(yk|T1(x)=1)

Partition of the label set generated by T1(x) : {x ∈ L : T1(x)=0}, {x ∈ L : T1(x)=1}



y2 measures the local burstiness of the signal (to distinguish between note “attacks” and steady state behaviour)

y2 computes several measures of burstiness (is a vector)

For this features, states can be partioned in three groups

T2(x) = 0 states at the beginning of each note (high burstiness)

T2(x) = 1 states corresponding to steady state behaviour (relatively low burstiness)

T2(x) = 2 silence states



y3,…,yK concerns the problem of distinguishing between the many possible pitch configuration

Each features of y3,…,yK is computed from a small frequency interval of the Fourier transformed frame data

For each window we compute

the empirical mean location of the harmonic (when there is a single harmonic in the window)

the empirical variance to distinguish probabilistically when there is a single harmonic (low variance) and when there is not (high variance)

State can be partinioned as Tk(x) = 0 states in which no notes contain energy in the

window Tk(x) = 1 states having several harmonics in the window Tk(x) = t states having a single harmonic at approximately

the same frequency in the window


TRAINING THE MODELSince the HMM formulation, the probability distribution can be trained in an unsupervised fashion

An iterative procedure (Baum-Welch algorithm) allows to automatically train from signal-score pairs

When the score is known, we can build a model for the hidden process

The algorithm

Starts from a neutral starting place (we begin with uniformly distributed output distributions)

Iterates the process of finding a probabilistic correspondence between model states and data frames

Next, we retrain the probability distribution using this corrispondence


TRAINING THE MODELOutput distributions on feature vectors are represented through decision trees

For each distribution p(yk|Tk(x)) we form a binary tree

Each non terminal node corresponds to a question yk,v < c

(where yk,v is the vth component of feature k)

An observation yk can be associated with a non terminal node by dropping the observation down the tree (evaluating the root question)

The process continues until it arrives at a terminal node,

denoted by Qk(yk)

As the training procedure evolves, the trees are re-estimated at each iteration to produce more informative probability distributions


RECOGNITIONThe traditional HMM approaches to recognitions seeks the most likely labeling of frames, given the data, through dynamic programming

This corresponds to find the best path through the graph, where

the reward in going from state xn-1 to xn in the nth iteration is given

by

The Viterbi algorithm constructs the optimal paths of lenght n from

the optimal paths of length n-1

The computational complexity grows with the square of the state-space which is completely intractable in this case

The state space is on the order of 108 (under restrictive assumptions on the possible collection of pitches and the number of notes in a chord)


RECOGNITIONWe use the data model constructed in the training phase to produce a condensed version of the state graph

For each frame n we perform a greedy search that seeks a plausible

collection of state x ∈ L for that frame

This is accomplished by searching for states x giving large value to

p(yn|x). The search is performed by

Finding the mostly likely 1-note hypotheses

Then considering 2-note hypotheses and so on

Each frame n will be associated with a possible collection of states

An

The state are blended by letting

The graph is constructed by restricting the full graph to the Bn set

Disadvantage if the true state at frame n is not captured by Bn,

then it cannot be recovered during recognition


EXPERIMENTSThe hidden Markov model has been trained by using data taken from various Mozart piano sonatas

The result concerns a performance of Sonata 18, K.570

Objective measure of performance edit distance

Recognition erroe rates are reported as

Note error rate 39% (184 substitutions, 241 deletions, 108 insertions)

If two adjacent recognized chords have a pitch in common, it is assumed that the note is not rearticulated

Inability to distinguish between chord homonyms


CONCLUSIONRecognition results leave room for improvements

Results may be useful in a number of Music Information Retrieval applications tolerant of errorful representaions

The current system works with no knowledge of the plausibility of various sequences of chord probability of chord sequence

Probabilistic model that models the likelihood of chord sequences

The current system makes almost no effort to model the acoustic characteristics of the highly informative note onsets

A more sophisticated “attack” model would help in recognizing the many repeated notes which the system currently misses


REFERENCESChristopher Raphael Automatic transcription of piano music In Proceedings of the 3rd Annual International Symposium on Music Information Retrieval (ISMIR), Michael Fingerhut, Ed., pp. 15-19, IRCAM - Centre Pompidou, Paris, France, October 2002.

language and intelligence

Documents