an end-to-end neural network for polyphonic piano music

13
1 An End-to-End Neural Network for Polyphonic Piano Music Transcription Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon Abstract—We present a supervised neural network model for polyphonic piano music transcription. The architecture of the proposed model is analogous to speech recognition systems and comprises an acoustic model and a music language model. The acoustic model is a neural network used for estimating the probabilities of pitches in a frame of audio. The language model is a recurrent neural network that models the correlations between pitch combinations over time. The proposed model is general and can be used to transcribe polyphonic music without imposing any constraints on the polyphony. The acoustic and language model predictions are combined using a probabilistic graphical model. Inference over the output variables is performed using the beam search algorithm. We perform two sets of experiments. We investigate various neural network architectures for the acoustic models and also investigate the effect of combining acoustic and music language model predictions using the proposed architec- ture. We compare performance of the neural network based acoustic models with two popular unsupervised acoustic models. Results show that convolutional neural network acoustic models yields the best performance across all evaluation metrics. We also observe improved performance with the application of the music language models. Finally, we present an efficient variant of beam search that improves performance and reduces run- times by an order of magnitude, making the model suitable for real-time applications. Index Terms—Automatic Music Transcription, Deep Learning, Recurrent Neural Networks, Music Language Models. EDICS Category: AUD-MSP, AUD-MIR, MLR-DEEP I. I NTRODUCTION A UTOMATIC Music Transcription (AMT) is a fundamen- tal problem in Music Information Retrieval (MIR). AMT aims to generate a symbolic, score-like transcription, given a polyphonic acoustic signal. Music transcription is considered to be a difficult problem even by human experts and current music transcription systems fail to match human performance [1]. Polyphonic AMT is a difficult problem because concur- rently sounding notes from one or more instruments cause a complex interaction and overlap of harmonics in the acoustic signal. Variability in the input signal also depends on the specific type of instrument being used. Additionally, AMT systems with unconstrained polyphony have a combinatori- ally very large output space, which further complicates the modeling problem. Typically, variability in the input signal is captured by models that aim to learn the timbral properties The authors are with the Centre for Digital Music, School of Electronic Engineering and Computer Science, Queen Mary University of London, E1 4NS, London, U.K. EB is supported by a Royal Academy of Engineering Research Fellowship (grant no. RF/128). Email: {s.s.sigtia,emmanouil.benetos,s.e.dixon}@qmul.ac.uk of the instrument being transcribed [2], [3], while the issues relating to a large output space are dealt with by constraining the models to have a maximum polyphony [4], [5]. The majority of current AMT systems are based on the principle of describing the input magnitude spectrogram as a weighted combination of basis spectra corresponding to pitches. The basis spectra can be estimated by various tech- niques such as non-negative matrix factorisation (NMF) and sparse decomposition. Unsupervised NMF approaches [6], [7] aim to learn a dictionary of pitch spectra from the training examples. However purely unsupervised approaches can often lead to bases that do not correspond to musical pitches, therefore causing issues with interpreting the results at test time. These issues with unsupervised spectrogram factorisation methods are addressed by incorporating harmonic constraints in the training algorithm [8], [9]. Spectrogram factorisation based techniques were extended with the introduction of probabilistic latent component analysis (PLCA) [10]. PLCA aims to fit a latent variable probabilistic model to normalised spectrograms. PLCA based models are easy to train with the expectation-maximisation (EM) algorithm and have been extended and applied extensively to AMT problems [11], [3]. As an alternative to spectrogram factorisation techniques, there has been considerable interest in discriminative ap- proaches to AMT. Discriminative approaches aim to directly classify features extracted from frames of audio to the output pitches. This approach has the advantage that instead of constructing instrument specific generative models, complex classifiers can be trained using large amounts of training data to capture the variability in the inputs. When using discriminative approaches, the performance of the classifiers is dependent on the features extracted from the signal. Recently, neural networks have been applied to raw data or low level representations to jointly learn the features and classifiers for a task [12]. Over the years there have been many experiments that evaluate discriminative approaches for AMT. Poliner and Ellis [13] use support vector machines (SVMs) to classify normalised magnitude spectra. Nam et. al. [14] superimpose an SVM on top of a deep belief network (DBN) in order to learn the features for an AMT task. Similarly, a bi-directional recurrent neural network (RNN) is applied to magnitude spectrograms for polyphonic transcription in [15]. In large vocabulary speech recognition systems, the infor- mation contained in the acoustic signal alone is often not sufficient to resolve ambiguities between possible outputs. A language model is used to provide a prior probability of the current word given the previous words in a sentence. Statistical language models are essential for large vocabulary speech arXiv:1508.01774v2 [stat.ML] 11 Feb 2016

Upload: vuongkien

Post on 30-Dec-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

1

An End-to-End Neural Network for PolyphonicPiano Music Transcription

Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon

Abstract—We present a supervised neural network model forpolyphonic piano music transcription. The architecture of theproposed model is analogous to speech recognition systems andcomprises an acoustic model and a music language model. Theacoustic model is a neural network used for estimating theprobabilities of pitches in a frame of audio. The language model isa recurrent neural network that models the correlations betweenpitch combinations over time. The proposed model is general andcan be used to transcribe polyphonic music without imposingany constraints on the polyphony. The acoustic and languagemodel predictions are combined using a probabilistic graphicalmodel. Inference over the output variables is performed using thebeam search algorithm. We perform two sets of experiments. Weinvestigate various neural network architectures for the acousticmodels and also investigate the effect of combining acoustic andmusic language model predictions using the proposed architec-ture. We compare performance of the neural network basedacoustic models with two popular unsupervised acoustic models.Results show that convolutional neural network acoustic modelsyields the best performance across all evaluation metrics. Wealso observe improved performance with the application of themusic language models. Finally, we present an efficient variantof beam search that improves performance and reduces run-times by an order of magnitude, making the model suitable forreal-time applications.

Index Terms—Automatic Music Transcription, Deep Learning,Recurrent Neural Networks, Music Language Models.

EDICS Category: AUD-MSP, AUD-MIR, MLR-DEEP

I. INTRODUCTION

AUTOMATIC Music Transcription (AMT) is a fundamen-tal problem in Music Information Retrieval (MIR). AMT

aims to generate a symbolic, score-like transcription, given apolyphonic acoustic signal. Music transcription is consideredto be a difficult problem even by human experts and currentmusic transcription systems fail to match human performance[1]. Polyphonic AMT is a difficult problem because concur-rently sounding notes from one or more instruments cause acomplex interaction and overlap of harmonics in the acousticsignal. Variability in the input signal also depends on thespecific type of instrument being used. Additionally, AMTsystems with unconstrained polyphony have a combinatori-ally very large output space, which further complicates themodeling problem. Typically, variability in the input signal iscaptured by models that aim to learn the timbral properties

The authors are with the Centre for Digital Music, School of ElectronicEngineering and Computer Science, Queen Mary University of London, E14NS, London, U.K.

EB is supported by a Royal Academy of Engineering Research Fellowship(grant no. RF/128).

Email: {s.s.sigtia,emmanouil.benetos,s.e.dixon}@qmul.ac.uk

of the instrument being transcribed [2], [3], while the issuesrelating to a large output space are dealt with by constrainingthe models to have a maximum polyphony [4], [5].

The majority of current AMT systems are based on theprinciple of describing the input magnitude spectrogram asa weighted combination of basis spectra corresponding topitches. The basis spectra can be estimated by various tech-niques such as non-negative matrix factorisation (NMF) andsparse decomposition. Unsupervised NMF approaches [6], [7]aim to learn a dictionary of pitch spectra from the trainingexamples. However purely unsupervised approaches can oftenlead to bases that do not correspond to musical pitches,therefore causing issues with interpreting the results at testtime. These issues with unsupervised spectrogram factorisationmethods are addressed by incorporating harmonic constraintsin the training algorithm [8], [9]. Spectrogram factorisationbased techniques were extended with the introduction ofprobabilistic latent component analysis (PLCA) [10]. PLCAaims to fit a latent variable probabilistic model to normalisedspectrograms. PLCA based models are easy to train withthe expectation-maximisation (EM) algorithm and have beenextended and applied extensively to AMT problems [11], [3].

As an alternative to spectrogram factorisation techniques,there has been considerable interest in discriminative ap-proaches to AMT. Discriminative approaches aim to directlyclassify features extracted from frames of audio to the outputpitches. This approach has the advantage that instead ofconstructing instrument specific generative models, complexclassifiers can be trained using large amounts of trainingdata to capture the variability in the inputs. When usingdiscriminative approaches, the performance of the classifiers isdependent on the features extracted from the signal. Recently,neural networks have been applied to raw data or low levelrepresentations to jointly learn the features and classifiers fora task [12]. Over the years there have been many experimentsthat evaluate discriminative approaches for AMT. Poliner andEllis [13] use support vector machines (SVMs) to classifynormalised magnitude spectra. Nam et. al. [14] superimposean SVM on top of a deep belief network (DBN) in order tolearn the features for an AMT task. Similarly, a bi-directionalrecurrent neural network (RNN) is applied to magnitudespectrograms for polyphonic transcription in [15].

In large vocabulary speech recognition systems, the infor-mation contained in the acoustic signal alone is often notsufficient to resolve ambiguities between possible outputs. Alanguage model is used to provide a prior probability of thecurrent word given the previous words in a sentence. Statisticallanguage models are essential for large vocabulary speech

arX

iv:1

508.

0177

4v2

[st

at.M

L]

11

Feb

2016

2

recognition [16]. Similarly to speech, musical sequences ex-hibit temporal structure. In addition to an accurate acousticmodel, a model that captures the temporal structure of music ora music language model (MLM), can potentially help improvethe performance of AMT systems. Unlike speech, languagemodels are not common in most AMT models due to thechallenging problem of modelling the combinatorially largeoutput space of polyphonic music. Typically, the outputs ofthe acoustic models are processed by pitch specific, two-statehidden Markov models (HMMs) that enforce smoothing andduration constraints on the output pitches [3], [13]. However,extending this to modelling the high-dimensional outputs ofa polyphonic AMT system has proved to be challenging,although there are some studies that explore this idea. Adynamic Bayesian network is used in [17], to estimate priorprobabilities of note combinations in an NMF based transcrip-tion framework. Similarly in [18], a recurrent neural network(RNN) based MLM is used to estimate prior probabilities ofnote sequences, alongside a PLCA acoustic model. A sequencetransduction framework is proposed in [19], where the acousticand language models are combined in a single RNN.

The ideas presented in this paper are extensions of thepreliminary experiments in [20]. We propose an end-to-endarchitecture for jointly training both the acoustic and the lan-guage models for an AMT task. We evaluate the performanceof the proposed model on a dataset of polyphonic piano music.We train neural network acoustic models to identify the pitchesin a frame of audio. The discriminative classifiers can intheory be trained on complex mixtures of instrument sources,without having to account for each instrument separately.The neural network classifiers can be directly applied tothe time-frequency representation, eliminating the need fora separate feature extraction stage. In addition to the deepfeed-forward neural network (DNN) and RNN architecturesin [20], we explore using convolutional neural nets (Con-vNets) as acoustic models. ConvNets were initially proposedas classifiers for object recognition in computer vision, buthave found increasing application in speech recognition [21],[22]. Although ConvNets have been applied to some problemsin MIR [23], [24], they remain unexplored for transcriptiontasks. We also include comparisons with two state-of-the-art spectrogram factorisation based acoustic models [3], [8]that are popular in AMT literature. As mentioned before,the high dimensional outputs of the acoustic model posea challenging problem for language modelling. We proposeusing RNNs as an alternative to state space models likefactorial HMMs [25] and dynamic Bayesian networks [17],for modeling the temporal structure of notes in music. RNNbased language models were first used alongside a PLCAacoustic model in [18]. However, in that setup, the languagemodel is used to iteratively refine the predictions in a feedbackloop resulting in a non-causal and theoretically unsatisfactorymodel. In the hybrid framework, approximate inference overthe output variables is performed using beam search. Howeverbeam search can be computationally expensive when usedto decode long temporal sequences. We apply the efficienthashed beam search algorithm proposed in [26] for inference.The new inference algorithm reduces decoding time by an

order of magnitude and makes the proposed model suitablefor real-time applications. Our results show that convolutionalneural network acoustic models outperform the remainingacoustic models over a number of evaluation metrics. Wealso observe improved performance with the application ofthe music language models.

The rest of the paper is organised as follows: Section IIdescribes the neural network models used in the experiment,Section III discusses the proposed model and the inference al-gorithm, Section IV details model evaluation and experimentalresults. Discussion, future work and conclusions are presentedin Section V.

II. BACKGROUND

In this section we describe the neural network models usedfor the acoustic and language modelling. Although neuralnetworks are an old concept, they have recently been appliedto a wide range of machine learning problems with greatsuccess [12]. One of the primary reasons for their recentsuccess has been the availability of large datasets and large-scale computing infrastructure [27], which makes it feasible totrain networks with millions of parameters. The parameters ofany neural network architecture are typically estimated withnumerical optimisation techniques. Once a suitable cost func-tion has been defined, the derivatives of the cost with respectto the model parameters are found using the backpropagationalgorithm [28] and parameters are updated using stochasticgradient descent (SGD) [29]. SGD has the useful propertythat the model parameters are iteratively updated using smallbatches of data. This allows the training algorithm to scaleto very large datasets. The layered, hierarchical structure ofneural nets makes end-to-end training possible, which impliesthat the network can be trained to predict outputs from low-level inputs without extracting features. This is in contrast tomany other machine learning models whose performance isdependent on the features extracted from the data. Their abilityto jointly learn feature transformations and classifiers makesneural networks particularly well suited to problems in MIR[30].

A. Acoustic Models

1) Deep Neural Networks: DNNs are powerful machinelearning models that can be used for classification and regres-sion tasks. DNNs are characterised by having one or morelayers of non-linear transformations. Formally, one layer of aDNN performs the following transformation:

hl+1 = f(Wlhl + bl). (1)

In Equation 1, Wl, bl are the weight matrix and bias forlayer l, 0 ≤ l ≤ L and f is some non-linear functionthat is applied element-wise. For the first layer, h0 = x,where x is the input. In all our experiments, we fix f tobe the sigmoid function (f(x) = 1

1+e−x ). The output ofthe final layer hL is transformed according to the givenproblem to yield a posterior probability distribution over theoutput variables P (y|x, θ). The parameters θ = {Wl, bl}L0 ,

3

(a) DNN (b) RNN (c) ConvNet

Fig. 1. Neural network architectures for acoustic modelling.

are numerically estimated with the backpropagation algorithmand SGD. Figure 1a shows a graphical representation of theDNN architecture, the dashed arrows represent intermediatehidden layers. For acoustic modelling, the input to the DNNis a frame of features, for example a magnitude spectrogramor the constant Q transform (CQT) and the DNN is trained topredict the probability of pitches present in the frame p(yt|xt)at some time t.

2) Recurrent Neural Networks: DNNs are good classi-fiers for stationary data, like images. However, they are notdesigned to account for sequential data. RNNs are naturalextensions of DNNs, designed to handle sequential or temporaldata. This makes them more suited for AMT tasks, since con-secutive frames of audio exhibit both short-term and long-termtemporal patterns [31]. RNNs are characterised by recursiveconnections between the hidden layer activations at some timet and the hidden layer activations at t− 1, as shown in Figure1b. Formally, the hidden layer of an RNN at time t performsthe following computation:

htl+1 = f(W fl h

tl +W r

l ht−1l + bl). (2)

In Equation 2, W fl is the weight matrix from the input to

the hidden units, W rl is the weight matrix for the recurrent

connection and bl are the biases for layer l. From Equation2, we can see that the recursive update of the hidden stateat time t, implies that ht is implicitly a function of all theinputs till time t, xt0. Similar to DNNs, RNNs are madeup of one or more layers of hidden units. The outputs ofthe final layer are transformed with a suitable function toyield the desired distribution over the ouputs. The RNN

parameters θ ={W f

l ,Wrl , bl

}L

0are calculated using the back

propagation through time algorithm (BPTT) [32] and SGD.For acoustic modelling, the RNN acts on a sequence of inputfeatures to yield a probability distribution over the outputsP (yt|xt0), where xt0 = {x0, x1, . . . , xt}.

3) Convolutional Networks: ConvNets are neural nets witha unique structure. Convolutional layers are specifically de-signed to preserve the spatial structure of the inputs. In a

convolutional layer, a set of weights act on a local regionof the input. These weights are then repeatedly applied to theentire input to produce a feature map. Convolutional layersare characterised by the sharing of weights across the entireinput. As shown in Figure 1c, ConvNets are comprised ofalternating convolutional and pooling layers, followed by oneor more fully connected layers (same as DNNs). Formally, therepeated application of the shared weights to the input signalconstitutes a convolution operation:

hj,k = f(∑r

Wr,jxr+k−1 + bj). (3)

The input x is a vector of inputs from different channels,for example RGB channels for images. Formally, x ={x0, x1, . . .}, where each input xi represents an input channel.Each input band xi has an associated weight matrix. All theweights of a convolutional layer are collectively representedas a four dimensional tensor. Given an m × n region from afeature map h, the max pooling function returns the maximumactivation in the region. At any time t, the input to the ConvNetis a window of 2k + 1 feature frames xt+k

t−k. The outputsof the final layer yield the posterior distribution distributionP (yt|xt+k

t−k).There are several motivations for using ConvNets for

acoustic modelling. There are many experiments in MIR thatsuggest that rather than classifying a single frame of input,better prediction accuracies can be achieved by incorporatinginformation over several frames of inputs [26], [33], [34].Typically, this is achieved either by applying a context windowaround the input frame or by aggregating information overtime by calculating statistical moments over a window offrames. Applying a context window around a frame of lowlevel spectral features, like the short time fourier transform(STFT) would lead to a very high dimensional input, which isimpractical. Secondly, taking mean, standard deviation or otherstatistical moments makes very simplistic assumptions aboutthe distribution of data over time in neighbouring frames.ConvNets, due to their architecture [12], can be directlyapplied to several frames of inputs to learn features along

4

both, the time and the frequency axes. Additionally, whenusing an input representation like the CQT, ConvNets canlearn pitch-invariant features, since inter-harmonic spacings inmusic signals are constant across log-frequency. Finally, theweight sharing and pooling architecture leads to a reductionin the number of ConvNet parameters, compared to a fullyconnected DNN. This is a useful property given that very largequantities of labelled data are difficult to obtain for most MIRproblems, including AMT.

B. Music Language Models

Given a sequence y = yt0, we use the MLM to define a priorprobability distribution P (y). yt is a high-dimensional binaryvector that represents the notes being played at t (one time-stepof a piano-roll representation). The high dimensional natureof the output space makes modelling yt a challenging prob-lem. Most post-processing algorithms make the simplifyingassumption that all the pitches are independent and model theirtemporal evolution with independent models [13]. However,for polyphonic music, the pitches that are active concurrentlyare highly correlated (harmonies, chords). In this section, wedescribe the RNN music language models first introduced in[35].

1) Generative RNN: The RNNs defined in the earliersections were used to map a sequence of inputs x to a sequenceof outputs y. At each time-step t, the RNN outputs theconditional distribution P (yt|xt0). However RNNs can be usedto define a distribution over some sequence y by connectingthe outputs of the RNN at t− 1 to the inputs of the RNN att, resulting in a distribution of the form:

P (y) = P (y0)∏t>0

P (yt|yt−10 ) (4)

Although an RNN predicts yt conditioned on the highdimensional inputs yt−10 , the individual pitch outputs yt(i)are independent, where i is the pitch index (Section IV-C).As mentioned earlier, this is not true for polyphonic music.Boulanger-Lewandowski et. al. [35] demonstrate that ratherthan predicting independent distributions, the parameters ofa more complicated parametric output distribution can beconditioned on the RNN hidden state. In our experiments, weuse the RNN to output the biases of a neural autoregressivedistribution estimator (NADE) [35].

2) Neural Autogressive Distribution Estimator: TheNADE is a distribution estimator for high dimensional binarydata [36]. The NADE was initially proposed as a tractablealternative to the restricted Boltzmann machine (RBM). TheNADE estimates the joint distribution over high dimensionalbinary variables as follows:

P (x) =∏i

P (xi|xi−10 ).

The NADE is similar to a fully visible sigmoid belief network[37], since the conditional probability of xi is a non-linearfunction of xt0. The NADE computes the conditional distribu-tions according to:

hi = σ(W:,<ixi−10 + bh) (5)

P (xi|xi−10 ) = σ(Vihi + biv) (6)

where W,V are weight matrices, W:,<i is a submatrix ofW that denotes the first i − 1 columns and bh, bv are thehidden and visible biases, respectively. The gradients of thelikelihood function P (x) with respect to the model parametersθ = {W,V, bh, bv} can be found exactly, which is not possiblewith RBMs [36]. This property allows the NADE to be readilycombined with other models and the models can be jointlytrained with gradient based optimisers.

3) RNN-NADE: In order to learn high dimensional, tem-poral distributions for the MLM, we combine the NADE andan RNN, as proposed in [35]. The resulting model yields asequence of NADEs conditioned on an RNN, that describe adistribution over sequences of polyphonic music. The jointmodel is obtained by letting the parameters of the NADEat each time step be a function of the RNN hidden stateθtNADE = f(ht). ht is the hidden state of final layer of theRNN (Equation 2) at time t. In order to limit the number offree parameters in the model, we only allow the NADE biasesto be functions of the RNN hidden state, while the remainingparameters (W,V ) are held constant over time. We computethe NADE biases as a linear transformation of the RNN hiddenstate plus an added bias term [35]:

btv = bv +W1ht (7)

bth = bh +W2ht (8)

W1 and W2 are weight matrices from the RNN hidden stateto the visible and hidden biases, respectively. The gradientswith respect to all the model parameters can be easily com-puted using the chain rule and the joint model is trained usingthe BPTT algorithm [35].

III. PROPOSED MODEL

In this section we review the proposed neural networkmodel for polyphonic AMT. As mentioned earlier, the modelcomprises an acoustic model and a music language model.In addition to the acoustic models in [20], we propose theuse of ConvNets for identifying pitches present in the inputaudio signal and compare their performance to various otheracoustic models (Section IV-F). The acoustic and languagemodels are combined under a single training objective using ahybrid RNN architecture, yielding an end-to-end model forAMT with unconstrained polyphony. We first describe thehybrid RNN model, followed by a description of the proposedinference algorithm.

A. Hybrid RNN

The hybrid RNN is a graphical model that combines thepredictions of any arbitrary frame level acoustic model, withan RNN-based language model. Let x = xT0 be a sequenceof inputs and let y = yT0 be the corresponding transcriptions.The joint probability of y, x can be factorised as follows:

5

Fig. 2. Graphical Model of the Hybrid Architecture

P (y, x) = P (y0 . . . yT , x0 . . . xT ) (9)

= P (y0)P (x0|y0)T∏

t=1

P (yt|yt−10 )P (xt|yt).

The factorisation in Equation 9 makes the following indepen-dence assumptions:

P (yt|yt−10 , xt−10 ) = P (yt|yt−10 ) (10)

P (xt|yt0, xt−10 ) = P (xt|yt) (11)

These independence assumptions are similar to the as-sumptions made in HMMs [38]. Figure 2 is a graphicalrepresentation of the hybrid model. In equation 9, P (xt|yt)is the emission probability of an input, given output yt. UsingBayes’s rule, the conditional distribution can be written asfollows:

P (y|x) ∝ P (y0|x0)T∏

t=1

P (yt|yt−10 )P (yt|xt), (12)

where the marginals P (yt) and priors P (y0), P (x0) areassumed to be fixed w.r.t. the model parameters.

With this reformulation of the joint distribution, we observethat the conditional distribution P (y|x) is directly proportionalto the product of two distributions. The prior distributionP (yt|yt−10 ) is obtained using a generative RNN (SectionII-B1) and the posterior distribution over note-combinationsP (yt|xt) can be modelled using any frame based classifier.The hybrid RNN graphical model is similar to an HMM, wherethe state transition probabilities for the HMM P (yt|yt−1) havebeen generalised to include connections from all previousoutputs, resulting in the P (yt|yt−10 ) terms in Equation 12.

For the problem of automatic music transcription, the inputtime-frequency representation forms the input sequence x,while the output piano-roll sequence y denotes the transcrip-tions. The priors P (yt|yt−10 ) are obtained from the RNN-NADE MLM, while the posterior distributions P (yt|xt) areobtained from the acoustic models. The models can then betrained by finding the derivatives of the acoustic and languagemodel objectives with respect to the model parameters andtraining using gradient descent. The independent training of

the acoustic and language models is a useful property sincedatasets available for music transcription are considerablysmaller in size as compared to datasets in computer vision andspeech. However large corpora of MIDI music are relativelyeasy to find on the internet. Therefore in theory, the MLMscan be trained on large corpora of MIDI music, analogous tolanguage model training in speech.

B. Inference

At test time, we would like to find the mode of theconditional output distribution:

y∗ = argmaxy

P (y|x) (13)

From Equation 12, we observe that the priors P (yt|yt−10 ),tie the predictions of the acoustic model P (yt|xt) to all thepredictions made till time t. This prior term encourages coher-ence between predictions over time and allows musicologicalstructure learnt by the language models to influence successivepredictions. However, this more general structure leads toa more complex inference (or decoding) procedure at testtime. This is due to the fact that at time t, the history yt−10

has not been optimally determined. Therefore, the optimumchoice of yt depends on all the past model predictions.Proceeding greedily in a chronological manner by selectingyt that optimises P (yt|xt) does not necessarily yield goodsolutions. We are interested in solutions that globally optimisep(y|x). But exhaustively searching for the best sequence isintractable since the number of possible configurations of yt isexponential in the number of output pitches (2n for n pitches).

Beam search is a graph search algorithm that is commonlyused to decode the conditional outputs of an RNN [39],[19], [26]. Beam search scales to arbitrarily long sequencesand the computational cost versus accuracy trade-off can becontrolled via the width of the beam. The inference algo-rithm is comprised of the following steps: at any time t,the algorithm maintains at most w partial solutions, wherew is the beam width or the beam capacity. The solutionsin the beam at t correspond to sub-sequences of length t.Next, all possible descendants of the w partial solutions inthe beam are enumerated and then sorted in decreasing orderof log-likelihood. From these candidate solutions, the top wsolutions are retained as beam entries for further search. Beamsearch can be readily applied to problems where the numberof candidate solutions at each step is limited, like speechrecognition [40] and audio chord estimation [26]. However,using beam search for decoding sequences with a large outputspace is prohibitively inefficient.

When the space of candidate solutions is large, the algorithmcan be constrained to consider only K new candidates foreach partial solution in the beam, where K is known as thebranching factor. The procedure for selecting the K candidatescan be designed according to the given problem. For the hybridarchitecture, from Equation 12 we note:

P (yt0|xt0) ∝ P (yt−10 |xt−10 )P (yt|yt−10 )P (yt|xt) (14)

6

At time t, the partial solutions in the beam correspond toconfigurations of yt−10 . Therefore given P (yt−10 |xt−10 ), the Kconfigurations that maximise P (yt|yt−10 )P (yt|xt) would bea suitable choice of candidates for yt. However for manyfamilies of distributions, it might not be possible to enumerateyt in decreasing order of likelihood. In [19], the authorspropose forming a pool of K candidates by drawing randomsamples from the conditional output distributions. However,random sampling can be inefficient and obtaining independentsamples can be very expensive for many types of distributions.As an alternative, we propose to sample solutions from theposterior distribution of the acoustic model P (yt|xt) [20].There are 2 main motivations for doing this. Firstly, the outputsof the acoustic model are independent class probabilities.Therefore, it is easy to enumerate samples in decreasing orderof log-likelihood [19]. Secondly, we avoid the accumulationof errors in the RNN predictions over time [41]. The RNNmodels are trained to predict yt, given the true outputs yt−10 .However at test time, outputs sampled from the RNN are fedback as inputs at the next time step. This discrepancy betweenthe training and test objectives can cause prediction errors toaccumulate over time.

Algorithm 1 High Dimensional Beam SearchFind the most likely sequence y given x with a beam widthw and branching factor K.beam← new beam objectbeam.insert(0, {})for t = 1 to T do

new beam← new beam objectfor l, s,ma,ml in beam do

for k = 1 to K doy′ = ma.next most probable()l′ = logPl(y

′|s)Pa(y′|xt)− logP (y′)

m′l ← ml with yt := y′

m′a ← ma with x := xt+1

new beam.insert(l + l′, {s, y′},ma,ml)beam← new beam

return beam.pop()

Although generating candidates from the acoustic modelyields good results, it requires the use of large beam widths.This makes the inference procedure computationally slow andunsuitable for real-time applications [20]. In this study, wepropose using the hashed beam search algorithm proposed in[26]. Beam search is fundamentally limited when decodinglong temporal sequences. This is due to the fact that solutionsthat differ at only a few time-steps, can saturate the beam. Thiscauses the algorithm to search a very limited space of possiblesolutions. This issue can be solved by efficient pruning. Thehashed beam search algorithm improves efficiency by pruningsolutions that are similar to solutions with a higher likelihood.The metric that determines the similarity of sequences can bechosen in a problem dependent manner and is encoded in theform of a locality sensitive hash function [26]. In Algorithm1, we outline the beam search algorithm algorithm used forour experiments, while Algorithm 2 describes the hash tablebeam object. In Algorithms 1 and 2, s is a sequence yt0, l is

log-likelihood of s, ma,ml are acoustic and language modelobjects and fh is the hash function.

There are two key differences between Algorithm 1 and thealgorithm in [20]. First, the priority queue that stores the beamis replaced by a hash table beam object (see Algorithm 2).Secondly, for each entry in the beam we evaluate K candidatesolutions. This is in contrast to the algorithm in [20], whereonce the beam is full, only w candidate solutions are evaluatedper iteration. It might appear that the hashed beam searchalgorithm might be more expensive, since it evaluates w ∗Kcandidates instead of w candidates. However, by efficientlypruning similar solutions, the algorithm yields better results formuch smaller values of w, resulting in a significant increasein efficiency (Section IV-F, Figure 3).

Algorithm 2 Description of beam objects given w, fh, kInitialise beam objectbeam.hashQ = defaultdict of priority queues∗

beam.queue = indexed priority queue of length w∗∗

Insert l, s into beamkey= fh(s)queue = beam.queuehashQ = beam.hashQ[key]fits in queue = not queue.full() or l ≥queue.min()fits in hashQ = not hashQ.full() or l ≥hashQ.min()if fits in queue and fits in hashQ then

hashQ.insert(l, s)if hashQ.overfull() then

item = hashQ.del min()queue.remove(item)

queue.insert(l, s)if queue.overfull() then

item = queue.del min()beam.hashQ[fh(item.s)].remove(item)

∗ A priority queue of length k maintains the top k entriesat all times.∗∗ An indexed priority queue allows efficient random accessand deletion.

Algorithm 2 describes the hash table beam object. Thehashed beam search algorithm offers several advantages com-pared to the standard beam search algorithm. The notion ofsimilarity of solutions can be encoded in the form of hashfunctions. For music transcription, we choose the similarityfunction to be the last n frames in a sequence s. n = 1corresponds to a dynamic programming like decoding (similarto HMMs) where all sequences with the same final state yt areconsidered to be equivalent, and the sequence with the highestlog-likelihood is retained. n = len(sequence) correspondsto regular beam search. Additionally, the hash beam searchalgorithm can maintain ≥ 1 solution per hash key through aprocess called chaining [42].

IV. EVALUATION

In this section we describe how the performance of theproposed model is evaluated for a polyphonic transcriptiontask.

7

A. Dataset

We evaluate the proposed model on the MAPS dataset [43].The dataset consists of audio and corresponding annotationsfor isolated sounds, chords and complete pieces of pianomusic. For our experiments, we use only the full musicalpieces for training and testing the neural network acousticmodels and MLMs. The dataset consists of 270 pieces ofclassical music and MIDI annotations. There are 9 cate-gories of recordings corresponding to different piano typesand recording conditions, with 30 recordings per category. 7categories of audio are produced by software piano synthesis-ers, while 2 sets of recordings are obtained from a YamahaDisklavier upright piano. Therefore the dataset consists of 210synthesised recordings and 60 real recordings.

We perform 2 sets of investigations in this paper. The firstset of experiments investigate the effect of the RNN MLMson the predictions of the acoustic models. For this task, wedivide the entire dataset set into 4 disjoint train/test splits, as toensure that the folds are music piece-independent. Specifically,for some of the musical pieces in the dataset, audio for eachpiece is rendered using more than one piano. Therefore whilecreating the splits, we ensure that the training and test data donot contain any overlapping pieces1. For each split, we select80% of the data for training (216 musical pieces) and theremaining for testing (54 pieces). From each training split, wehold out 26 tracks as a validation set for selecting the hyper-parameters for the training algorithm (Section IV-D). All thereported results are mean values of the evaluation metrics overthe 4 splits. From now on, this evaluation configuration willbe named as Configuration 1.

Although the above experimental setup is useful for inves-tigating the effectiveness of the RNN MLMs, the training setcontains examples from piano models which are used for test-ing. This is usually not true in practice, where the instrumentmodels/sources at test time are unknown and usually do notcoincide with the instruments used for training. A majorityof experiments with the MAPS dataset train and test modelon disjoint instrument types [3], [2], [44]. We thus performa second set of experiments to compare performance of thedifferent neural network acoustic models in a more realisticsetting. We train the acoustic models using the 210 trackscreated using synthesized pianos (180 tracks for training and30 tracks for validation) and we test the acoustic models onthe 60 audio recordings obtained from Yamaha Disklavierpiano recordings (models ‘ENSTDkAm’ and ‘ENSTDkCl’ inthe MAPS database). In this experiment, we do not applythe language models since the train and test sets containoverlapping musical pieces. In addition to the neural networkacoustic models, we include comparisons with two state-of-the-art unsupervised acoustic models [3], [8] for bothexperiments. This instrument source-independent evaluationconfiguration will be named from now on as Configuration2.

1Details available at: http://www.eecs.qmul.ac.uk/∼sss31/TASLP/info.html

B. Metrics

We use both frame and note based metrics to assess theperformance of the proposed system [45]. Frame-based eval-uations are made by comparing the transcribed binary outputand the MIDI ground truth frame-by-frame. For note-basedevaluation, the system returns a list of notes, along with thecorresponding pitches, onset and offset time. We use the F-measure, precision, recall and accuracy for both frame andnote based evaluation. Formally, the frame-based metrics aredefined as:

P =

T∑t=1

TP [t]

TP [t] + FP [t]

R =

T∑t=1

TP [t]

TP [t] + FN [t]

A =

T∑t=1

TP [t]

TP [t] + FP [t] + FN [t]

F =2 ∗ P ∗ RP +R

where TP[t] is the number of true positives for the event att, FP is the number of false positives and FN is the number offalse negatives. The summation over T is carried out over theentire test data. Similarly, analogous note-based metrics canbe defined [45]. A note event is assumed to be correct if itspredicted pitch onset is within a ±50 ms range of the groundtruth onset.

C. Preprocessing

We transform the input audio to a time-frequency rep-resentation which is then input to the acoustic models. In[20], we used the magnitude short-time Fourier transform(STFT) as input to the acoustic models. However, here weexperiment with the constant Q transform (CQT) as the inputrepresentation. There are two motivations for this. Firstly,the CQT is fundamentally better suited as a time-frequencyrepresentation for music signals, since the frequency axis islinear in pitch [46]. Another advantage of using the CQT isthat the resulting representation is much lower dimensionalthan the STFT. Having a lower dimensional representationis useful when using neural network acoustic models as itreduces the number of parameters in the model.

We downsample the audio to 16 kHz from 44.1 kHz. Wethen compute CQTs over 7 octaves with 36 bins per octaveand a hop size of 512 samples, resulting in a 252 dimensionalinput vector of real values, with a frame rate of 31.25 framesper second. Additionally, we compute the mean and standarddeviation of each dimension over the training set and transformthe data by subtracting the mean and diving by the standarddeviation. These pre-processed vectors are used as inputs to theacoustic model. For the language model training, we samplethe MIDI ground truth transcriptions of the training data atthe same rate as the audio (32 ms). We obtain sequences of88 dimensional binary vectors for training the RNN-NADE

8

language models. The 88 outputs correspond to notes A0-C8on a piano.

The test audio is sampled at a frame rate of 100 Hz yielding100 ∗ 30 = 3000 frames per test file. For 54 test files over 4splits, we obtain a total of 648, 000 frames at test time2.

D. Network Training

In this section we describe the details of the trainingprocedure for the various acoustic model architectures and theRNN-NADE language model. All the acoustic models have88 units in the output layer, corresponding to the 88 outputpitches. The outputs of the final layer are transformed bya sigmoid function and yield independent pitch probabilitiesP (yt(i) = 1|x). All the models are trained by maximising thelog-likelihood over all the examples in the training set.

1) DNN Acoustic Models: For DNN training, we constrainall the hidden layers of the model to have the same numberof units to simplify searching for good model architectures.We perform a grid search over the following parameters:number of layers L ∈ {1, 2, 3, 4}, number of hidden unitsH ∈ {25, 50, 100, 125, 150, 200, 250}, hidden unit activationsact ∈ {ReLU, sigmoid} where ReLU is the rectified linearunit activation function [48]. We found Dropout [49] to beessential for improving generalisation performance. A Dropoutrate of 0.3 was used for the input layer and all the hiddenlayers of the network. Rather than using learning rate andmomentum update schedules, we use ADADELTA [50] toadapt the learning over iterations. In addition to Dropout,we use early stopping to minimise overfitting. Training wasstopped if the cost over the validation set did not decrease for20 epochs. We used mini batches of size 100 for the SGDupdates.

2) RNN Acoustic Models: For RNN training, we constrainall the hidden layers to have the same number of units. Weperform a grid search over the following parameters: L ∈{1, 2, 3}, H ∈ {25, 50, 100, 150, 200, 250}. We fix the hiddenactivations of the recurrent layers to be the hyperbolic tangentfunction. We found that ADADELTA was not particularly wellsuited for training RNNs. We use an initial learning rate of0.001 and linearly decrease it to 0 over 1000 iterations. Weuse a constant momentum rate of 0.9. The training sequencesare further divided into sub-sequences of length 100. The SGDupdates are made one sub-sequence at a time, without anymini batching. Similar to the DNNs, we use early stoppingand stop training if validation cost does not decrease after 20iterations. In order to prevent gradient explosion in the earlystages of training, we use gradient clipping [51]. We clippedthe gradients, when the norm of the gradient was greater than5.

3) ConvNet Acoustic Models: The input to the ConvNetis a context window of frames and the target is the centralframe in the window [26]. The frames at the beginning andend of the audio are zero padded so that a context windowcan be applied to each frame. Although pooling can be

2It should be noted that carrying out statistical significance tests on a tracklevel is an over-simplification in the context of multi-pitch detection, as arguedin [47].

performed along both axes, we only perform pooling overthe frequency axis. We performed a grid search over thefollowing parameters: window size ws ∈ {3, 5, 7, 9} numberof convolutional layers Lc ∈ {1, 2, 3, 4}, number of filters perlayer nl ∈ {10, 25, 50, 75, 100}, number of fully connectedlayers Lfc ∈ {1, 2, 3}, number of hidden units in fullyconnected layers H ∈ {200, 500, 1000}. The convolutionactivation functions were fixed to be the hyperbolic tangentfunctions, while all the fully connected layer activations wereset to the sigmoid function. The pooling size is fixed to beP = (1, 3) for all convolutional layers. Dropout with rate0.5 is applied to all convolutional layers. We tried a largepermutation of window shapes for the convolutional layer andthe following subset of window shapes yielded good results:w ∈ {(3, 3), (3, 5), (5, 5), (3, 25), (5, 25), (3, 75), (5, 75)}. Weobserved that classification performance deteriorated sharplyfor longer filters along the frequency axis. 0.5 Dropout wasapplied to all the fully connected layers. The model parameterswere trained with SGD and a batch size of 256. An initiallearning rate of 0.01 was linearly decreased to 0 over 1000iterations. A constant momentum rate 0.9 was used for all theupdates. We stopped training if the validation error did notdecrease after 20 iterations over the entire training set.

4) RNN-NADE Language Models: The RNN-NADEmodels were trained with SGD and with sequences oflength 100. We performed a grid search over the fol-lowing parameters: number of recurrent units HRNN ∈{50, 100, 150, 200, 250, 300} and number of hidden units forthe NADE HNADE ∈ {50, 100, 150, 200, 250, 300}. Themodel was trained with an initial learning rate of 0.001 whichwas linearly reduced to 0 over 1000 iterations. A constantmomentum rate of 0.9 was applied throughout training.

We selected the model architectures by performing a gridsearch over the parameter values described earlier in thesection. The various models were evaluated on one train/testsplit and the best performing architecture was then used forall other experiments.

E. Comparative ApproachesFor comparative purposes, two state-of-the-art polyphonic

music transcription methods were used for experiments [3],[8]. In both cases, the non-binary pitch activation output ofthe aforementioned methods was extracted, for performing anin-depth comparison with the proposed neural network models.The multi-pitch detection method of [8] is based on non-negative matrix factorization (NMF) and operates by decom-posing an input time-frequency representation as a series ofbasis spectra (representing pitches) and component activations(indicating pitch activity across time). This method modelseach basis spectrum as a weighted sum of narrowband spectrarepresenting a few adjacent harmonic partials, enforcing har-monicity and spectral smoothness. As input time-frequencyrepresentation, an Equivalent Rectangular Bandwidth (ERB)filterbank is used. Since the method relies on a dictionary of(hand-crafted) narrowband harmonic spectra, system parame-ters remain the same for the two evaluation configurations.

The multiple-instrument transcription method of [3] is basedon shift-invariant PLCA (a convolutive and probabilistic coun-

9

Post Processing Thresholding HMM Hybrid ArchitectureAcoustic Model Frame Note Frame Note Frame Note

Benetos [3] 64.20 65.22 64.84 66.05 65.10 66.48Vincent [8] 58.95 68.5 60.37 68.87 59.78 69.00

DNN 67.54 60.02 68.32 62.26 67.92 63.18RNN 68.38 63.84 68.09 64.50 69.25 65.24

ConvNet 73.57 65.35 73.75 66.20 74.45 67.05

TABLE IF-MEASURES FOR MULTIPLE PITCH DETECTION ON THE MAPS DATASET, USING EVALUATION CONFIGURATION 1.

P R AAcoustic Model Frame Note Frame Note Frame Note

Benetos [3] 59.54 73.51 69.51 60.67 48.47 49.03Vincent [8] 52.71 79.93 69.04 60.69 43.04 52.92

DNN 65.66 62.62 70.34 63.75 51.76 45.33RNN 67.89 64.64 70.66 65.85 54.38 48.18

ConvNet 72.45 67.75 76.56 66.36 58.87 50.07

TABLE IIPRECISION, RECALL AND ACCURACY FOR MULTIPLE PITCH DETECTION ON THE MAPS DATASET USING THE HYBRID ARCHITECTURE

(w = 10,K = 4, k = 2, fh(yt0) = yt), USING EVALUATION CONFIGURATION 1.

Acoustic Model Benetos [3] Vincent [8] DNN RNN ConvNetF-measure (Frame) 59.31 59.60 59.91 57.67 64.14F-measure (Note) 54.29 59.12 49.43 49.20 54.89

TABLE IIIF-MEASURES FOR ACOUSTIC MODELS TRAINED ON SYNTHESISED PIANOS AND TESTED ON REAL RECORDINGS (EVALUATION CONFIGURATION 2).

terpart of NMF). In this model, the input time-frequencyrepresentation is decomposed into a series of basis spectraper pitch and instrument source which are shifted acrosslog-frequency, thus supporting tuning changes and frequencymodulations. Outputs include the pitch activation distributionand the instrument source contribution per pitch. Contrary tothe parametric model of [8], the basis spectra are pre-extractedfrom isolated musical instrument sounds. As in the proposedmethod, the input time-frequency representation of [3] is theCQT. For the investigations with MLMs (configuration 1),the PLCA models are trained on isolated sound examplesfrom all 9 piano models from the MAPS database (in orderfor the experiments to be comparable with the proposedmethod). For the second set of experiments which investigatethe generalisation capabilities of the models (configuration 2),the PLCA acoustic model is trained on isolated sounds fromthe sysnthesised pianos and tested on recordings created usingthe Yamaha Disklavier piano.

F. Results

In this section we present results from the experiments onthe MAPS dataset. As mentioned before, all results are themean values of various metrics computed over the 4 differenttrain/test splits. The acoustic models yield a sequence ofprobabilities for the individual pitches being active (posteri-ograms). The post-processing methods are used to transformthe posteriograms to a binary piano-roll representation. The

Model ArchitectureDNN L = 3, H = 125RNN L = 2, H = 200

ConvNet ws = 7, Lc = 2, Lfc = 2, w1 = (5, 25), P1 = (1, 3)w2 = (3, 5), P2 = (1, 3), n1 = n2 = 50, h1 = 1000, h2 = 200

RNN-NADE HRNN = 200, HNADE = 150

TABLE IVMODEL CONFIGURATIONS FOR THE BEST PERFORMING ARCHITECTURES.

various performance metrics (both frame and note based) arethen computed by comparing the outputs of the systems to theground truth.

We consider 3 kinds of post-processing methods. The sim-plest post-processing method is to apply a threshold to theoutput pitch probabilities obtained from the acoustic model.We select the threshold that maximises the F-measure over theentire training set and use this threshold for testing. Pitcheswith probabilities greater than the threshold are set to 1, whilethe remaining pitches are set to 0. The second post-processingmethod considered uses individual pitch HMMs for post-processing similar to [13]. The HMM parameters (transitionprobabilities, pitch marginals) are obtained by counting thefrequency of each event over the MIDI ground truth data. Thebinary pitch outputs are obtained using Viterbi decoding [38],where the scaled likelihoods are used as emission probabilities.Finally, we combine the acoustic model predictions with theRNN-NADE MLMs and obtain binary transcriptions using

10

Fig. 3. Effect of beam width (w) on F-measure using evaluation Configura-tion 1. k = 2,K = 4, fh = yt.

beam search.In Table I, we present F-scores (both frame and note

based) for all the acoustic models and the 3 post-processingmethods using Configuration 1. From the table, we note thatall the neural network models outperform the PLCA and NMFmodels in terms of frame-based F-measure by 3% − 9%.The DNN and RNN acoustic model performance is similar,while the ConvNet acoustic model clearly outperforms all theother models. The ConvNets yield an absolute improvementof ∼ 5% over the other neural network models, while out-performing the spectrogram factorisation models by ∼ 10%in frame-wise F-measure. For the note-based F-measure, theRNN and ConvNet models perform better than the DNNacoustic model. This is largely due to the fact that these modelsinclude context information in their inputs, which implicitlysmooths the output predictions.

We compare the different post-processing methods for Con-figuration 1 by observing the rows of Table I. We note thatthe MLM leads to improved performance on both frame-basedand note-based F-measure for all the acoustic models. Theperformance increase is larger on the note-based F-measure.The relative improvement in performance is maximum for theDNN acoustic model, compared to the RNN and the ConvNet.This could be due to the fact that the independence assumptionin Equation 11 is violated by the RNN and ConvNet, whichinclude context information while making predictions. Thisleads to some factors being counted twice and we observe asmaller performance improvement in this case. From Rows1 and 2 of Table I we observe that the RNN-NADE MLMyields a performance increase for the PLCA and NMF acousticmodels, though the relative improvement is less as compared tothe neural network acoustic models. This might be due to thefact that unlike the neural network models, these models arenot trained to maximise the conditional probability of outputpitches given the acoustic inputs. Another contributing factoris the fact that the PLCA and NMF posteriograms represent theenergy distribution over pitches rather than explicit pitch prob-

abilities, which results in many activations being greater than1. This discrepancy in the scale of the acoustic and languagepredictions leads to an unequal weighting of predictions whenused in the hybrid RNN framework. In Table I we observe thatthe acoustic model in [8] outperforms all other acoustic modelson the note-based F-measure, while the frame based F-measureis significantly lower. This can be attributed to the use of anERB filterbank input representation, which offers improvedtemporal resolution over the CQT for lower frequencies.

In Table II, we present additional metrics (precision, recalland accuracy) for the all the acoustic models after decodingwith an RNN-MLM, using Configuration 1. We observe thatthat the NMF and PLCA models have low frame-basedprecision and high recall and the converse for the note-based precision. For the neural network models, we observesmaller differences between the both frame-based and note-based precision and recall values. Amongst all the neuralnetwork models, we observe that the ConvNet outperformsall the other models on all the metrics.

In Table III, we present F-measures for experiments wherethe acoustic models are trained on synthesised data and testedon real data (Configuration 2). From the table we note thatframe based F-measure for the DNN and RNN models issimilar to the PLCA model and the model in [8]. We notethat the ConvNet outperforms all other models on the frame-based F-measure by ∼ 5%. On the note based evaluations,we observe that both RNN and DNN are outperformed by allthe other models. The ConvNet performance is similar to thePLCA model, while the acoustic model from [8] again hasbest performance on the note based metrics.

We now discuss details of the inference algorithm. The highdimensional hashed beam search algorithm has the followingparameters: the beam width w, the branching factor K, numberof entries per hash table entry k and the similarity metricfh (Algorithm 2). We observed that a value of K ≥ 4produced good results. Larger values of K do not yield asignificant performance increase and result in much longerrun times, therefore we set K = 4 for all experiments.We observed that small values of k (number of solutionsper hash table entry), 1 ≤ k ≤ 4 produced good results.Decoding accuracies deteriorate sharply for large values of k,as observed in [26]. Therefore, we set the number of entriesper hash key k = 2 for all experiments. We let the similaritymetric be the last n emitted symbols, fh(yt0) = ytt−n. Weexperimented with varying the values of n and observedthat we were able to achieve good performance for small n,1 ≤ n ≤ 5. We did not observe any performance improvementfor large n, therefore for all experiments we fix fh(yt0) = yt.Figure 3 is a plot showing the effect of beam width w ontranscription performance. The results are average values ofdecoding accuracies over 4 splits. We compare performanceof the hashed beam search with the high dimensional beamsearch in [20]. From Figure 3 we observe that the hashed beamsearch algorithm is able to achieve performance improvementwith significantly smaller beam-widths. For instance, the highdimensional beam search algorithm takes 20 hours to decodethe entire test set with w = 100, while the hashed beam searchtakes 22 minutes, with w = 10 and achieves better decoding

11

(a) ConvNet Posteriogram (b) ConvNet Transcription

(c) Ground Truth

Fig. 4. a) Pitch-activation (posteriogram) matrix for the first 30 seconds of track MAPS MUS-chpn op27 2 AkPnStgb produced by a ConvNet acousticmodel. b) Binary piano-roll transcription obtained from posteriogram in a) after post processing with RNN MLM and beam search. c) Corresponding groundtruth piano roll representation.

accuracy.Figure 4 is a graphical representation of the outputs of a

ConvNet acoustic model. We observe that some of the longernotes are fragmented and the offsets are estimated incorrectly.One reason for this is that the ground truth offsets don’tnecessarily correspond to the offset in the acoustic signal (dueto effects of the sustain pedal), implying noisy offsets in theground truth. We also observe that the model does not makemany harmonic errors in its predictions.

V. CONCLUSIONS AND FUTURE WORK

In this paper, we present a hybrid RNN model for poly-phonic AMT of piano music. The model comprises a neuralnetwork acoustic model and an RNN based music languagemodel. We propose using a ConvNet for acoustic modelling,

which to the best of the authors’ knowledge, has not beenattempted before for AMT. Our experiments on the MAPSdataset demonstrate that the neural network acoustic models,especially the ConvNet, outperform 2 popular acoustic modelsfrom the AMT literature. We also observe that the RNN MLMsconsistently improve performance on all evaluation metrics.The proposed inference algorithm with the hash beam searchis able to yield good decoding accuracies with significantlyshorter run times, making the model suitable for real-timeapplications.

We now discuss some of the limitations of the proposedmodel. As discussed earlier, one of the main contributing fac-tors to the success of deep neural networks has been the avail-ability of very large datasets. However datasets available forAMT research are considerably smaller than datasets available

12

in speech, computer vision and natural language processing(NLP). Therefore the applicability of deep neural networks foracoustic modelling is limited to datasets with large amountsof labelled data, which is not common in AMT (at leastin non-piano music). Although the neural network acousticmodels perform competitively, their performance could befurther improved in many ways. Noise or deformations canbe added to training examples to encourage the classifiers tobe invariant to commonly encountered input transformations.Additionally, the CQT input representation can be replaced bya representation with higher temporal resolution (like the ERBor a variable-Q transform), to improve performance on notebased metrics.

The abundance of musical score data and recent progressin NLP tasks with neural networks provide strong motivationfor further investigations into MLMs for AMT. Althoughour results demonstrate some improvement in transcriptionperformance with MLMs, there are several limitations andopen questions that remain. The MLMs are trained on binaryvectors sampled from the MIDI ground truth. Depending onthe sampling rate, most note events are repeated many timesin this representation. The MLMs are trained to predict thenext frame of notes, given an input sequence of binary notecombinations. In cases where the same notes are repeatedmany times, log-likelihood can be trivially maximised byrepeating previous inputs. This causes the MLM to perform asmoothing operation, rather than imposing any kind of musicalstructure on the outputs. A potential solution would be toperform beat-aligned language modelling for the training andthe test data, rather than sampling the MIDI at some arbitrarysampling rate. Additionally, RNNs can be extended to includeduration models for each of their pitch outputs, similar tosecond order HMMs. However, this is a challenging problemand currently remains unexplored. It would also be interestingto encourage RNNs to learn longer temporal note patterns byinterfacing RNN controllers with external memory units [52]and also to incorporate a notion of timing or metre in the inputrepresentation for the MLMs.

The effect of tonality on the performance of the MLMsshould be further investigated. The MLMs should ideally beinvariant to transpositions of a musical piece to differentpitches. The MIDI ground truth can be easily transposed toany tonality. MLMs can be trained on inputs with transposedtonalities or individual MLMs for each key can be trained.Additionally, the fully connected input layer of the RNN MLMcan be substitued with a convolutive layer, with convolutionsalong the pitch axis to encourage the network to be invariantto pitch transpositions.

Another limitation of the proposed hybrid model is that theconditional probability in Equation 11 is derived by assumingthat the predictions at time t are only a function of the inputat t and independent of all other inputs and outputs. Theviolation of this assumption leads to certain factors beingcounted twice and therefore reduces the impact of the MLMs.The results clearly demonstrate that improvements with theMLM are maximum when the acoustic model is frame-based.The improvements are comparatively lower when combinedwith predictions from an RNN or ConvNet acoustic model.

This is problematic since the ConvNet acoustic models yieldthe best performance.

REFERENCES

[1] A. Klapuri and M. Davy, Signal processing methods for music tran-scription. Springer Science & Business Media, 2007.

[2] T. Berg-Kirkpatrick, J. Andreas, and D. Klein, “Unsupervised transcrip-tion of piano music,” in Advances in Neural Information ProcessingSystems (NIPS), 2014, pp. 1538–1546.

[3] E. Benetos and S. Dixon, “A shift-invariant latent variable model forautomatic music transcription,” Computer Music Journal, vol. 36, no. 4,pp. 81–94, 2012.

[4] A. P. Klapuri, “Multiple fundamental frequency estimation based onharmonicity and spectral smoothness,” IEEE Transactions on Speechand Audio Processing, vol. 11, no. 6, pp. 804–816, 2003.

[5] V. Emiya, R. Badeau, and B. David, “Automatic transcription of pianomusic based on hmm tracking of jointly-estimated pitches,” in 16thEuropean on Signal Processing Conference. IEEE, 2008, pp. 1–5.

[6] P. Smaragdis and J. C. Brown, “Non-negative matrix factorization forpolyphonic music transcription,” in 2003 IEEE Workshop on Applica-tions of Signal Processing to Audio and Acoustics. IEEE, 2003, pp.177–180.

[7] S. A. Abdallah and M. D. Plumbley, “Polyphonic music transcriptionby non-negative sparse coding of power spectra,” in Proceedings of the5th International Society for Music Information Retrieval Conference(ISMIR), 2004, pp. 318–325.

[8] E. Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectraldecomposition for multiple pitch estimation,” IEEE Transactions onAudio, Speech, and Language Processing., vol. 18, no. 3, pp. 528–537,2010.

[9] N. Bertin, R. Badeau, and E. Vincent, “Enforcing harmonicity andsmoothness in Bayesian non-negative matrix factorization applied topolyphonic music transcription,” IEEE Transactions on Audio, Speech,and Language Processing., vol. 18, no. 3, pp. 538–549, 2010.

[10] P. Smaragdis, B. Raj, and M. Shashanka, “A probabilistic latent vari-able model for acoustic modeling.” Advances in models for acousticprocessing, NIPS, vol. 148, 2006.

[11] G. C. Grindlay and D. P. Ellis, “A probabilistic subspace modelfor multi-instrument polyphonic transcription,” in Proceedings of the11th International Society for Music Information Retrieval Conference(ISMIR). International Society for Music Information Retrieval, 2010,pp. 21–26.

[12] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,no. 7553, pp. 436–444, 2015.

[13] G. E. Poliner and D. P. Ellis, “A discriminative model for polyphonicpiano transcription,” EURASIP Journal on Applied Signal Processing,vol. 2007, no. 1, pp. 154–154, 2007.

[14] J. Nam, J. Ngiam, H. Lee, and M. Slaney, “A classification-basedpolyphonic piano transcription approach using learned feature repre-sentations.” in Proceedings of the 12th International Society for MusicInformation Retrieval Conference (ISMIR), 2011, pp. 175–180.

[15] S. Bock and M. Schedl, “Polyphonic piano note transcription with recur-rent neural networks,” in IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 121–124.

[16] L. Rabiner and B.-H. Juang, “Fundamentals of speech recognition,”1993.

[17] S. Raczynski, E. Vincent, and S. Sagayama, “Dynamic Bayesian net-works for symbolic polyphonic pitch modeling,” IEEE Transactions onAudio, Speech, and Language Processing, vol. 21, no. 9, pp. 1830–1840,2013.

[18] S. Sigtia, E. Benetos, S. Cherla, T. Weyde, A. S. d. Garcez, and S. Dixon,“An RNN-based music language model for improving automatic musictranscription,” in Proceedings of the 15th International Society for MusicInformation Retrieval Conference (ISMIR), 2014.

[19] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “High-dimensional sequence transduction,” in IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2013,pp. 3178–3182.

[20] S. Sigtia, E. Benetos, N. Boulanger-Lewandowski, T. Weyde, A. S.d’Avila Garcez, and S. Dixon, “A Hybrid Recurrent Neural Network forMusic Transcription,” in IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP), Brisbane, Australia, April2015, pp. 2061–2065.

13

[21] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, “Applyingconvolutional neural networks concepts to hybrid NN-HMM model forspeech recognition,” in IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 4277–4280.

[22] O. Abdel-Hamid, L. Deng, and D. Yu, “Exploring convolutional neuralnetwork structures and optimization techniques for speech recognition.”in INTERSPEECH, 2013, pp. 3366–3370.

[23] J. Schluter and S. Bock, “Improved musical onset detection withconvolutional neural networks,” in IEEE International Conference onAcoustics, Speech and Signal Processing. IEEE, 2014, pp. 6979–6983.

[24] E. J. Humphrey and J. P. Bello, “Rethinking automatic chord recog-nition with convolutional neural networks,” in Machine Learning andApplications (ICMLA), 2012 11th International Conference on, vol. 2.IEEE, 2012, pp. 357–362.

[25] E. Vincent and X. Rodet, “Music transcription with ISA and HMM,”in Independent Component Analysis and Blind Signal Separation.Springer, 2004, pp. 1197–1204.

[26] S. Sigtia, N. Boulanger-Lewandowski, and S. Dixon, “Audio chordrecognition with a hybrid neural network,” in Proceedings of the16th International Society for Music Information Retrieval Conference(ISMIR), 2015.

[27] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior,P. Tucker, K. Yang, and Q. V. Le, “Large scale distributed deep net-works,” in Advances in Neural Information Processing Systems (NIPS),2012, pp. 1223–1231.

[28] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning represen-tations by back-propagating errors,” Cognitive Modeling, vol. 5, p. 3,1988.

[29] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller, “Efficientbackprop,” in Neural networks: Tricks of the Trade. Springer, 2012,pp. 9–48.

[30] E. J. Humphrey, J. P. Bello, and Y. LeCun, “Feature learning anddeep architectures: new directions for music informatics,” Journal ofIntelligent Information Systems, vol. 41, no. 3, pp. 461–481, 2013.

[31] D. Eck and J. Schmidhuber, “Finding temporal structure in music: Bluesimprovisation with lstm recurrent networks,” in Neural Networks forSignal Processing, 2002. Proceedings of the 2002 12th IEEE Workshopon. IEEE, 2002, pp. 747–756.

[32] P. J. Werbos, “Backpropagation through time: what it does and how todo it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.

[33] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Audio chordrecognition with recurrent neural networks.” in Proceedings of the13th International Society for Music Information Retrieval Conference(ISMIR), 2013, pp. 335–340.

[34] J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kegl, “Aggre-gate features and adaboost for music classification,” Machine learning,vol. 65, no. 2-3, pp. 473–484, 2006.

[35] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modelingtemporal dependencies in high-dimensional sequences: Application topolyphonic music generation and transcription,” in Proceedings of the29th International Conference on Machine Learning (ICML), 2012, pp.1159–1166.

[36] H. Larochelle and I. Murray, “The neural autoregressive distributionestimator,” in International Conference on Artificial Intelligence andStatistics, 2011, pp. 29–37.

[37] R. M. Neal, “Connectionist learning of belief networks,” ArtificialIntelligence, vol. 56, no. 1, pp. 71–113, 1992.

[38] L. R. Rabiner, “A tutorial on hidden Markov models and selectedapplications in speech recognition,” Proceedings of the IEEE, vol. 77,no. 2, pp. 257–286, 1989.

[39] A. Graves, “Sequence transduction with recurrent neural networks,” inRepresentation Learning Workshop, ICML, 2012.

[40] N. Boulanger-Lewandowski, J. Droppo, M. Seltzer, and D. Yu, “Phonesequence modeling with recurrent neural networks,” in IEEE In-ternational Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2014, pp. 5417–5421.

[41] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled samplingfor sequence prediction with recurrent neural networks,” in Advances inNeural Information Processing Systems, 2015, pp. 1171–1179.

[42] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introductionto algorithms. MIT press Cambridge, 2001, vol. 2.

[43] V. Emiya, R. Badeau, and B. David, “Multipitch estimation of pianosounds using a new probabilistic spectral smoothness principle,” IEEETransactions on Audio, Speech, and Language Processing, vol. 18, no. 6,pp. 1643–1654, 2010.

[44] K. O’Hanlon and M. D. Plumbley, “Polyphonic piano transcriptionusing non-negative matrix factorisation with group sparsity,” in IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2014, pp. 3112–3116.

[45] M. Bay, A. F. Ehmann, and J. S. Downie, “Evaluation of multiple-F0estimation and tracking systems.” in Proceedings of the 9th InternationalSociety for Music Information Retrieval Conference (ISMIR), 2009, pp.315–320.

[46] J. C. Brown, “Calculation of a constant q spectral transform,” TheJournal of the Acoustical Society of America, vol. 89, no. 1, pp. 425–434, 1991.

[47] E. Benetos, “Automatic transcription of polyphonic music exploitingtemporal evolution,” Ph.D. dissertation, Queen Mary University ofLondon, Dec. 2012.

[48] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neuralnetworks,” in International Conference on Artificial Intelligence andStatistics, 2011, pp. 315–323.

[49] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks from over-fitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.1929–1958, 2014.

[50] M. D. Zeiler, “Adadelta: An adaptive learning rate method,” arXivpreprint arXiv:1212.5701, 2012.

[51] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, “Advances inoptimizing recurrent networks,” in IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP). IEEE, 2013, pp.8624–8628.

[52] E. Grefenstette, K. M. Hermann, M. Suleyman, and P. Blunsom,“Learning to transduce with unbounded memory,” arXiv preprintarXiv:1506.02516, 2015.