neural models for personalized jazz improvisations...jazz improvisations. to tackle this objective,...

56
Neural Models for Personalized Jazz Improvisations Nadav Bhonker Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

Upload: others

Post on 06-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Neural Models for PersonalizedJazz Improvisations

    Nadav Bhonker

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Neural Models for PersonalizedJazz Improvisations

    Research Thesis

    Submitted in partial fulfillment of the requirementsfor the degree of Master of Science in Computer Science

    Nadav Bhonker

    Submitted to the Senateof the Technion — Israel Institute of TechnologyCheshvan 5780 Haifa November 2019

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • This research was carried out under the supervision of Prof. Ran El-Yaniv, in theFaculty of Computer Science.

    Acknowledgements

    I would like to thank my advisor, Ran, for always having an open door at his office, forspending countless hours of discussion and for making the path enjoyable. I would alsolike to thank my wife, Maayan, for her constant and invaluable support throughout myresearch.

    The generous financial help of the Technion is gratefully acknowledged.

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Contents

    List of Figures

    List of Tables

    Abstract 1

    Abbreviations and Notations 3

    1 Introduction 5

    2 Related Work 92.1 Deep Symbolic Jazz Improvisation . . . . . . . . . . . . . . . . . . . . . 102.2 Note representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Continuous Response Measurement Interface (CRDI) . . . . . . . . . . . 11

    3 Problem Statement 13

    4 Learning to Generate Personalized Jazz Improvisations with NeuralModels 154.1 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 User Preference Metric Learning . . . . . . . . . . . . . . . . . . . . . . 154.3 Optimized Music Generation via Planning . . . . . . . . . . . . . . . . . 16

    5 Implementation 195.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2 User Preference Elicitation . . . . . . . . . . . . . . . . . . . . . . . . . 21

    5.2.1 Musical Preference Labeling System . . . . . . . . . . . . . . . . 215.3 User Preference Metric Learning . . . . . . . . . . . . . . . . . . . . . . 22

    6 Experiments 236.1 Harmony Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.2 Analyzing Personalized Models . . . . . . . . . . . . . . . . . . . . . . . 246.3 Harmonic Guided Generation . . . . . . . . . . . . . . . . . . . . . . . . 266.4 Plagiarism Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 7 Conclusion and Future Directions 297.1 Implications for Human Music Generation . . . . . . . . . . . . . . . . . 297.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • List of Figures

    1.1 Generated Musical Excerpt . . . . . . . . . . . . . . . . . . . . . . . . . 6

    5.1 Note Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3 Digital CRDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    6.1 User Score vs. Beam Width . . . . . . . . . . . . . . . . . . . . . . . . . 256.2 Prediction Histogram and Risk-Coverage Curve . . . . . . . . . . . . . . 256.3 Plagiarism by n-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • List of Tables

    6.1 Harmony Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.2 Cross-Artist Plagiarism . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Abstract

    Learning to generate music is an ongoing AI challenge. A more difficult challenge is thegeneration of musical pieces that match human-specific preferences. In this work wefocus on personalized, symbol-based, monophonic generation of harmony-constrainedjazz improvisations. To tackle this objective, we introduce a pipeline consisting of thefollowing steps: supervised learning using a corpus of solos (a language model), high-resolution user preference metric learning, and optimized generation using planning(beam search). Our corpus consists of hundreds of original jazz solos performed bysaxophone giants such as Charlie Parker, Stan Getz, Sonny Stitt and Dexter Gordon.We present an extensive empirical study in which we apply this pipeline to extractindividual models as implicitly defined by several human listeners. Our approach en-ables an objective examination of subjective personalized models whose performance isquantifiable. The results indicate that it is possible to model and optimize personalizedjazz preferences. In addition, our subjective impression is that the resulting computer-generated solos are often musically appealing. We also perform a plagiarism analysisto ensure that the generated solos are genuine rather than a concatenation of phrasespreviously seen in the corpus. Numerous MP3 demonstrations of solos generated byour system are provided in the supplementary material.

    1

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 2

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Abbreviations and Notations

    xt = (st, ct) : An input consisting of a note and it’s corresponding contextX1:τ : A sequence of inputsD ={Xi1:τi}

    Mi=1

    : A training dataset

    fθ(X1:t−1, ct) =Pr(xt|X1:t−1, ct)

    : A context dependant language model

    Y1:τ =y1y2· · · yτ

    : sequence of labels associated with a sequence of inputs

    gϕ(Xi1:t, ci1:t) =ŷi

    : a user-preference model

    ψ : a function used to sample from fθLSTM : Long Short-Term Memory NetworkCRDI : Continuous Response Measurement Interface

    3

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 4

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Chapter 1

    Introduction

    Ever since the invention of the computer, researchers and artists have been interestedin utilizing computers for producing different forms of art, and notably for composingmusic [24]. The explosive growth of deep learning models over the past several years hasexpanded the possibilities for musical generation, leading to a line of work that pushedforward the state-of-the-art [31, 29, 21, 5]. Another recent trend is the developmentand offerings of consumer services such as Spotify and Pandora, aiming to providepersonalized streams of existing music content. Perhaps the crowning achievement ofsuch personalized services would be for the content itself to be generated explicitly tomatch each individual user’s taste. In this work we focus on the task of generating userpersonalized, monophonic, symbolic jazz improvisations. To the best of our knowledge,this is the first work that aims at generating personalized jazz solos using deep learningtechniques.

    Even though it has been nearly two decades since the first work on generating(symbolic) blues by Eck and Schmidhuber [12], research in this domain has been ratherscarce. The common approach for generating music with neural networks is mostlythe same as language modeling. Given a context of existing symbols (e.g., characters,words, music notes), the network is trained to predict the next symbol. Thus, oncethe network learns the distribution of sequences from the training set, it can gener-ate novel sequences by sampling from the network output and feeding the result backinto itself. The products of such models are sometimes evaluated using user studies(crowdsourcing). Such studies assess the quality of generated music by asking userstheir opinion, and computing the mean opinion score (MOS). While such methods maymeasure the overall quality of the generated music, they tend to average-out the per-sonal preferences of the evaluators. Another, more quantitative but rigid approach forevaluation of generated music is to compute a metric based on musical theory princi-ples. For example, Hild et al. [23] considered Bach style harmonized choral generationand the generated pieces were evaluated by a senior music professor using the samecriteria used to evaluate undergraduate students. While such metrics can, in principle,be defined for classical music, it is less suitable for jazz improvisation, which does not

    5

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 3

    C7 Cm7

    Cm7

    F7

    F7

    Cm7

    F7 B♭maj7

    Fm7 B♭7

    B♭6

    Cm7 F7

    Cm7 F7

    E♭9 E♭9

    B♭6

    B♭6

    B♭6 B♭6

    B♭6 B♭6

    B♭6

    Cm7

    Cm7

    Cm7

    B♭6

    Fm7

    F7

    F7

    Cm7

    B♭7 Fm7

    E♭9

    B♭6

    A♭9

    B♭6

    F7

    B♭6

    F7

    E♭9

    B♭maj7

    E♭9

    B♭6

    B♭7

    B♭maj7

    B♭6

    Am7♭57 Am7♭57

    B♭6

    D7 D7

    2

    Figure 1.1: A short excerpt generated by the language model.

    adhere to such strict rules. To generate personalized jazz improvisations, we proposea framework consisting of the following elements: (a) language model learning; (b)user preference learning; and (c) optimized music generation via planning. As manyjazz teachers would recommend, the key to attaining great improvisation skills is bystudying and emulating great musicians. Following this advice, we train a harmony-conditioned language model that composes entire solos. To this end, we use a train-ing dataset of hundreds of professionally transcribed jazz improvisations performed bysaxophone giants such as Charlie Parker, Phil Woods and Cannonball Adderley (seedetails in Section 5 and in the supplementary material). In this dataset, each solo is amonophonic note sequence given in symbolic form (MusicXML) accompanied by a syn-chronized harmony sequence. The language model training stage consists of training along short-term memory (LSTM) network [27] or transformer network [49] whose inputconsists of both chord context and note history. The resulting model after this stagealone is capable of generating high fidelity improvisation phrases (this is a subjectiveimpression of the authors). Figure 1.1 presents a short excerpt generated from thislanguage model.

    Considering that different people have different musical tastes, our goal in this thesisis to go beyond straightforward generation by this model and optimize the generationtoward personalized preferences. For this purpose, we determine a user’s preference bymeasuring the level of his or her satisfaction throughout the solos using a continuousresponse interface. This is accomplished by playing, for the user, computer generatedsolos (from the language model) and recording his or her good/bad feedback in real-time throughout each solo. Once we have gathered sufficient data about the user’spreference, consisting of two aligned sequences (for the solos and feedback), we train arecurrent regression model using this user’s preferences. A key feature of our techniqueis that the resulting model can be objectively evaluated using hold-out user preferencesequences (along with their corresponding solos). A big hurdle in accomplishing thisstep is that the signal elicited from the user is inevitably extremely noisy. To reducethis noise, we apply selective prediction techniques [16, 17] to distill cleaner predictionsfrom the user’s preference model. Thus, we allow this model to abstain whenever it isnot sufficiently confident. The fact that it is possible to extract from humans continuousresponse preference signal on music phrases and use it to train (and test) a model withnon-trivial predictive capabilities is interesting in itself (and new, to the best of ourknowledge).

    Equipped with a personalized user preference metric (via the trained model), in thelast stage we employ a variant of beam search [39], to generate optimized jazz solos from

    6

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • the initial language model. We numerically demonstrate that the resulting generatedpieces indeed substantially increase performance in terms of the learned personalizedmetric. For each user, we individually apply the last three stages of this process wherethe preference elicitation stage takes several hours of tagging per user. We applied theproposed pipeline on four users, all of whom are amateur jazz musicians. We presentnumerical analyses of the results showing that a personalized metric can be trained andthen used to optimize solo generation.

    To summarize, our contributions include: (1) a useful monophonic neural model forgeneral jazz improvisation within any desired harmonic context; (2) a viable methodol-ogy for eliciting and learning high resolution human preferences for music; (3) person-alized optimization of jazz solo generation. Alongside the thesis, we provide numerousMP3 files of automatically generated solos of well-known jazz standards from and out-side of our training corpus (see the supplementary material).

    7

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 8

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Chapter 2

    Related Work

    Algorithmic musical composition is the process of composing music automatically bya computer. This task is almost as old as the computer itself, dating back to the1950s [24]. Many methods have been published over the years using many differentapproaches, some of which we review here. Most methods usually compute statisticsof a corpus, use music theory knowledge or both.

    Fernández and Vico [13] divide algorithmic musical composition into six differentcategories:

    • Grammars – Grammar methods define rules to expand high-level symbols intomore detailed sequences of symbols, representing a formal language. Grammarsare defined either manually based on music theory [33, 42] or by learning from acorpus [9, 18]. This approach suffers from the difficulty of generating a good setof rules for producing meaningful compositions.

    • Knowledge-Based Systems – These methods are rule-based systems that rep-resent musical knowledge as structured symbols. Hiller Jr. and Isaacson [24]apply the classical rules for counterpoint. Löthe [35] use rules for composingminuets that were extracted from a classical textbook. Similar to grammar meth-ods, a limitation of these methods is that they require knowledge in the differentmusical domains.

    • Markov Chains – In these methods, the musical melody or harmony is consid-ered a stochastic process. The modeling of an artist or a genre is then obtained bytraining a note-to-note predictor. To generate a melody, at time t, the methodsselect the note with the highest likelihood, given the note at time t− 1. Methodsutilizing such techniques date back to the 1950s [24]. Usually, such methodsare extended to n-order Markov chains, considering the last n notes. The mainshortcoming of these methods is expressed in different models of different order.Low order models tend to capture local information such as nuances and style,though they lack the ability to capture structure, while high order Markov chainstend to overfit, thus plagiarizing their training data.

    9

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • • Artificial Neural Networks (ANNs) – These methods include training anANN on a large corpus as a next-note predictor, similarly to Markov chains.ANNs were first described in 1989 in the works of Todd [46] and Duff [11].Another use of ANNs is as a fitness function for evolutionary methods. Our workfalls into this category, into which we elaborate with more detail in Section 2.1.

    • Evolutionary methods – Inspired by theories of natural selection, these meth-ods produce new composing algorithms by repeating the process of first generatingmany compositions and then selecting and reproducing them with variations inorder to maximize a fitness function. The fitness function is usually derived frommusic theory or trained using an ANN. Of all the above methods, these were themost popular for music generation in the 1990s and 2000s.

    • Self-Similarity and Cellular Automata – These methods consist of generationof an audio signal from a source of 1/f noise (or pink noise). According to Vossand Clarke [51], the generated audio was reported to be musically pleasant. Suchmethods are mainly used only for raw material generation for composers ratherthan a form of music on its own.

    While the above categorizations are convenient, in practice, most algorithms consistof several elements taken from multiple categories.

    The musical tasks that the algorithms presented above try to solve mostly fall intofour classes:

    • Producing melodies – generation of a sequential set of musical notes followingsome style.

    • Harmonization – producing an appropriate musical scale for a given melody.

    • Jazz improvisation – producing melodies that adhere to a given harmony or othergiven musical context; such systems are often interactive.

    • Structural music – producing music following a strict set of rules that define agenre.

    2.1 Deep Symbolic Jazz Improvisation

    Here we confine the discussion to closely related works that mainly concern jazz impro-visation using deep learning techniques over symbolic data. In this narrower context,most works follow a “generation by prediction” paradigm, whereby a model trained topredict the next symbol is used to greedily generate sequences. Eck and Schmidhu-ber [12] presented the first work on blues improvisation by straightforwardly applyingLSTMs on a small training set. While the demonstration results they presented may

    10

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • seem limited at a distance of nearly two decades1, they were the first to demonstratelong-term structure captured by neural networks. One approach to improving a naïvegreedy generation from a language model is by using a mixture of experts. For ex-ample, Franklin [14] trained an ensemble of neural networks, one specialized for eachmelody, and then selected from among them at generation time using reinforcementlearning (RL) utilizing a handcrafted reward function. Johnson et al. [32] generatedimprovisations by training a network consisting of two experts, each focusing on a dif-ferent note representation. The first representation specifies a note as an interval fromits preceding note, and the second, as an interval to the root of the chord currentlybeing played. The experts were combined using the product of experts technique [26].2

    Other remotely related non-jazz works have attempted to produce context-dependentmelodies [31, 5, 53, 20, 37, 19]. Finally, following the introduction of the MAESTROdataset [22], two works [29, 22] demonstrated that by using massive amounts of data,it is possible to generate melodies consisting of long-term structure and human-likeexpressivity.

    2.2 Note representation

    There are several note representation methods. In piano roll, notes are encoded as asequence of equally spaced time slices corresponding to musical events. The duration ofall time slices are usually equal to the shortest note required. This representation hasbeen used in several recent works such as [31, 5, 53]. One advantage of this method isits ability to conveniently represent polyphonic music (where several notes are encodedtogether in the same event). The encodings, however, have a high level of redundancy.An alternate, less common approach is to use note steps, a sequence of pitch-durationpairs [14, 38, 52, 54]. This encoding is more compact than the piano roll; however, itcan only encode monophonic lines. For example, a note with duration of four-quartersis represented by one symbol in note step representation, whereas in piano roll it maybe represented using 16 symbols (if the minimal duration is a 16th note). There areother methods, such as textual representation, which are not widely used [45].

    2.3 Continuous Response Measurement Interface (CRDI)

    A common method for collecting continuous measurements from human subjects lis-tening to music is the continuous response digital interface (CRDI), first reportedby Robinson [43]. CRDI was invented in the late 1980s and has become the standardfor measuring human perception of music. CRDI has been successful in measuring avariety of signals from humans such as emotional response [44], tone quality and into-nation [36], beauty in a vocal performance [25], preference for music of other cultures [4]

    1Listen to their generated pieces at www.iro.umontreal.ca/~eckdoug/blues/index.html.2Listen to the generated solos at www.cs.hmc.edu/~keller/jazz/improvisor/iccc2017/

    11

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

    www.iro.umontreal.ca/~eckdoug/blues/index.htmlwww.cs.hmc.edu/~keller/jazz/improvisor/iccc2017/

  • and appreciation of the aesthetics of jazz music [7]. Using CRDI, listeners are requiredto rate different elements of the music by adjusting a dial (which looks similar to avolume control dial present in amplifiers).

    12

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Chapter 3

    Problem Statement

    In this chapter we state the problem in mathematical terms. We denote X1:τ =x1x2· · ·xτ ∈ X τ as a sequence of inputs, xt = (st, ct) consisting of a note, st ∈ Sand it’s context, ct ∈ C. Each note, st, in turn, consists of a pitch and a duration atindex t and S represents a predefined set of pitch-duration combinations (i.e., notes).The context, ct, represents the chord that is played with note st, where C is the set ofall possible chords. The context may contain additional information such as the offsetof the note within a measure (see details in Section 5). Let D = {Xi1:τi}

    Mi=1 denote a

    training dataset consisting of M solos of arbitrary lengths. In our work, these are theaforementioned jazz improvisations. We define a context-dependent language modelfθ(X1:t−1, ct) = Pr(xt|X1:t−1, ct), as the estimator of the probability of a symbol giventhe previous symbols and their contexts, where θ are the parameters of the model.Note that we provide the context of the element to be predicted. This is similar toa human jazz improviser that is informed of the chord over which his next note willbe played. For any solo, X1:τ , we also consider an associated sequence of annotationY1:τ = y1y2· · · yτ with the interpretation that yt represents the quality of the prefixX1:t by some metric. In our case, yt may be a score measuring compliance with theharmony or a measure of preference as indicated by a user. We use D̃ to denote anew dataset of solos and their corresponding annotations. Given the dataset, D̃, weestimate gϕ(Xi1:t, ci1:t) = ŷi, where gϕ is the user-preference model and ϕ are the learnedparameters. We denote by ψ a function that is used to sample from fθ. In our case,this will be a beam search variant.

    The objective in this work is to learn a model, fθ, capable of composing originaljazz solos; learn a model, gϕ, that captures the musical preference of a specific user;and use a function ψ to sample new solos from fθ that satisfy the user’s preference asestimated by gϕ.

    13

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 14

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Chapter 4

    Learning to GeneratePersonalized Jazz Improvisationswith Neural Models

    In this section we describe the three steps in our pipeline.

    4.1 Language Modeling

    In the language learning step, we use supervised learning to train a language modelfrom a given corpus of transcribed jazz solos. Specifically, we train a context-dependentfunction fθ(X1:t) = Pr(Xt+1|X1:t, ct+1). The language model can be implemented usingmany architectures such as recurrent neural networks (RNNs), convolutional networks(CNNs) [47, 53, 5] or attention-based models [2]. In this work we implemented two dif-ferent models: an LSTM network [27] and an attention network based on Transformer-XL [10] (see details in Section 5).

    4.2 User Preference Metric Learning

    In the preference elicitation stage, a user is exposed to solos from sampled from fθ aswell as solos from the dataset, D. We use the standard CRDI to collect the user’spreferences regarding the solos played. We use again supervised learning to train ametric function gϕ. This function should predict user preference scores for any solo,given its harmonic context. To elevate the accuracy of gϕ, we utilize selective predictionwhereby we ignore predictions whose confidence is too low. We use the predictionmagnitude as a proxy for confidence. Given two confidence threshold parameters, βlowand βhigh, we set g′ϕ,βlow,βhigh(X

    i1:t) to be gϕ(Xi1:t) if βhigh < gϕ(Xi1:t) or gϕ(Xi1:t) > βlow

    and 0 otherwise. See more details in Section 6.2.

    15

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 4.3 Optimized Music Generation via Planning

    In our experiments we noticed that using vanilla beam search results in undesiredbehavior, such as the generation consisting of the same two notes in repeat. Thisbehavior was also observed in text in [34, 50, 28].

    To overcome this issue, when optimizing the generation from fθ, we apply a variantof beam search, ψ, whose objective scores are obtained from non-rejected predictions ofgϕ. A pseudo code of the ψ procedure is presented in Algorithm 1. We denote by X =[X11:t, X21:t, ..., Xb1:t]b a running batch (beam) of width b containing the most promisingcandidate sequences found so far by the algorithm. The sequences are all initialized withthe starting sequence. In our case, this is the melody of the jazz standard. At every timestep, t, we produce a probability distribution of the next note of every sequence in Xby passing the b sequences through the network fθ(Xi1:t, cit+1), conditioned on previouselements and the context cit+1. As opposed to typical applications of beam search,rather than choosing the notes with the highest probability from Pr(xt+1|Xi1:t, ct+1),we instead independently and randomly sample them. We then calculate the score ofthe extended candidates using the preference metric, gϕ. Thereafter, we choose thek sequences with the highest scores and discard the rest. Finally, we duplicate theleading k sequences b/k times to maintain a full beam of b sequences. We extend thestandard beam algorithm to include a horizon parameter, h, so as to allow for longerterm predictions when extending candidate sequences in the beam. Thus, when h > 1rather than sampling a single note xit+1 for each candidate, i, we sample h possiblesuffixes Xit+1:(t+h+1). The use of larger horizons may lead to sub-optimal optimizationbut increases variability.

    In terms of time complexity, for every time step, we forward the b sequences oflength t through the two networks fθ and gϕ. A naïve implementation would amountto O(b · t2) time complexity and O(b · t) space complexity. Using simple bookkeeping,passing the entire sequences through the networks can be avoided, thus reducing timecomplexity to O(b · t) as well. We note that in modern frameworks, it is natural toefficiently implement the multiple forwards through the network in parallel as one wouldforward a batch.

    16

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Algorithm 1: Score-based beam searchInput: fθ, gϕ, b, k,X1:τOutput: X1:(τ+T )X = [X11:τ , X21:τ , ..., Xb1:τ ]b;for t = τ...τ + T do

    for i = 1..b doPr(xt+1|Xi1:t, ct+1) = fθ(Xi1:t, cit+1);xit+1 ∼ Pr(xit+1|Xi1:t, cit+1) ;score(Xi1:txit+1) = gϕ(Xi1:txit+1);

    xt+1 = [x1t+1, x2t+1, ..., xbt+1] ;Xtop = k-argmax

    Xxt

    score(Xxt+1) ∈ X k×t;

    X = [X1top, ..., Xktop, X1top, ..., Xtop,k]b ∈ X b×t;

    17

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 18

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Chapter 5

    Implementation

    In this chapter we provide implementation details of our personalized generation pipeline.Our corpus consists of 284 professionally transcribed solos of (mostly) bebop saxophoneplayers of the early 20th century. These are Charlie Parker, Sonny Stitt, CannonballAdderly, Dexter Gordon, Sonny Rollins, Stan Getz, Phil Woods and Gene Ammons.We considered only solos that are in 4/4 tempo and include chords in their transcrip-tion. The solos are provided in musicXML format. As opposed to MIDI, this formatallows the inclusion of chord symbols.1

    Note representation

    We represent notes using a representation that resembles notes in music sheet. Anillustration of our representation is presented in Figure 5.1.Pitch The pitch is encoded as a one-hot vector ranging from 0 to 128. Indices 0-127match the pitch range of the MIDI standard.2 Index 128 corresponds to the rest sym-bol.Duration The duration of each note is encoded using a one-hot vector. The possibleduration is built according to the those present in the dataset, where durations smallerthan 1/24 are removed.Offset The offset of the note lies within the measure is quantized to 48 “ticks” per(four-beat) measure. This corresponds to a duration of 1/12 of a beat. This is similarto the learnt positional-encoding used in translation [15].Chord The chord is represented by a concatenation of four pitches that comprise it.These are usually the 1st, 3rd, 5th, and 7th degrees of the root of the chord. This repre-sentation explicitly assumes that chords are played using their 7th form, as common injazz music. This chord representation allows the flexibility of representing rare chords

    1The solos were purchased from SaxSolos.com [1]; we are thus unable to publish them. However, inthe supplementary material we provide a complete list of solos used for training, which are, of course,available from the above vendor.

    2The notes appearing in the corpus all belong to a much smaller range; however, the MIDI rangestandard was maintained for simplicity.

    19

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 62 128 74 71 72

    4 2 4 2 4

    0 12 18 30 36

    67 67 67 67 67

    71 71 71 71 71

    74 74 74 74 74

    78 78 78 78 78

    Pitch

    Duration

    Offset

    Chord

    Figure 5.1: An example of a measure in music notation and its vector representation.Integers are converted to a one-hot representation.

    such as sixth, diminished and augmented chords3.

    5.1 Network Architecture

    The architecture of the network used to estimate fθ is as follows (see illustration inFigure 5.2). The network’s input is the note xt and context ct, which consists of aconcatenation of: a pitch, a duration, the offset and a chord. The pitch, durationand offset are each represented by learned embedding layers. The chord is encoded byusing the embedding of the pitches comprising it. While notes at different octaves havedifferent embeddings, the chord pitch embeddings are always taken from the octave inwhich most notes in the dataset reside. The input, xt, is passed to a temporal network:either an LSTM or a Transformer-XL. The temporal network’s output is then passed totwo heads. Each head consists of two fully-connected layers with a sigmoid activationin between. The first layer transforms the output of the LSTM/transformer so that itis the same size as the embedding of the pitch (or duration), and the second transformsthe output into the number of possible pitches (or durations). Following Inan et al.[30], Press and Wolf [41], we tie the weights of the final fully-connected layers to thoseof the embedding. Finally, the outputs of the two heads pass through a softmax layerand are trained to minimize the negative log-likelihood of the corpus. To enrich ourdataset while encouraging harmonic context dependence, we augment our dataset bytransposing to all 12 keys.

    3in 6th chords the 7th note is, naturally, replaced by the 6th note.

    20

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Figure 5.2: The network architecture for the next character prediction. Each note isrepresented by a pitch, a duration and the four vectors comprising the chord.

    Figure 5.3: Digital CRDI controlled by a user to provide continuous preference feedback.

    5.2 User Preference Elicitation

    Using the trained language model we created a dataset to be labeled by users, consistingof 124 improvisations. These solos were divided into three groups of roughly the samesize: solos from the original corpus, solos generated by the language model from musicstandards present, and generated solos not present in the training set. The length ofeach solo is two choruses, or twice the length of the melody. For each standard, wegenerated a backing track that includes a rhythm section and an harmonic instrumentto play along the generated or original improvisation using Band-in-a-Box [40]. Thisdataset amounts to approximately five hours of played music.

    5.2.1 Musical Preference Labeling System

    We created a system inspired by CRDI that is entirely digital, replacing the analog dialwith strokes of a keyboard moving a digital dial. The digital dial appears in Figure 5.3.While the original CRDI had a range of 255 values, our initial experiments found thatquantizing the values to five levels was easier for users. We recorded the location of thedial at every time step and aligned it to the note being played at the same moment.

    21

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 5.3 User Preference Metric Learning

    During training, for each sequence Xi1:τi we estimate the yiτi , corresponding to the label

    the user provided at the last note in the sequence. We choose the last label, ratherthan the mode or mean, because of delayed feedback. When a user decides to changethe position of the dial, it is because he has just heard a sequence of notes that heconsiders to be more (or less) pleasing than those he heard previously.

    We estimate the function gϕ using a network similar to that used in the languagemodeling phase. After the sequence X1:τ is processed by the LSTM/transformer, weapply scaled dot-product attention [48] over τ time steps followed by fully-connectedand tanh layers. The input and LSTM/transformer layers of the network are initializedusing the weights θ of the language model. Furthermore, the weights of the embeddinglayers are frozen during training. The labels are linearly scaled down to the range[−1, 1]. Since the data in D̃ is small and unbalanced, we use stratified sampling oversolos to divide the dataset into train and validation sets. We then use bagging to createan ensemble of five models for the final estimate.

    22

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Chapter 6

    Experiments

    We applied the proposed pipeline to generate personalized models for each of the fourusers, all of which are amateur jazz musicians. All users listened to the same trainingdataset of solos generated by the language model (see Section 5). Each user providedcontinuous feedback for each solo using CRDI. In this section we analyze the results.To appreciate the diversity of the generated solos, listen to seven solos generated foruser-4 the tune Recorda-Me in the supplementary material.

    6.1 Harmony Coherency

    We begin by evaluating the extent to which the language model was able to capturethe context of chords, which we term harmony coherency. We define two harmonycoherency metrics using either “scale match” or “chord match”. These metrics aredefined as the percent of time within a measure where notes match pitches of the scaleor the chord being played, respectively. We rely on a standard definition of matchingscales to chords using the chord-scale system [8]. While most notes in a solo shouldmatch the underlying chord, some “non-chord” notes are often incorporated. Commonexamples of their uses are chromatic lines, “approach notes” and enclosures (see [6]for details). Therefore, as we do not expect a perfect match to the scale or chordaccording to pure music rules, we take as a baseline the average matching statisticsof these quantities for each of the jazz artists in our dataset. The harmony coherencystatistics of our language model are computed over the dataset used for the preferencemetric learning (generated by the language model), which includes chord progressionsnot seen during language modeling stage. These baselines are reported in Table 6.1.It is evident that our model exhibits harmony coherency in the ‘ballpark’ of the jazzartists even on chord progressions not previously seen. Additionally, we demonstrateharmony coherency by presenting an improvisation over a popular non-jazz tune calledDespacito (included in the supplementary material).

    23

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Table 6.1: Harmony coherency: The average chord and scale matches computed forselected artists in the dataset and for the language model. A higher number indicatesa high coherency level. Scores for remaining artists are similar to those presented inthe table.

    OursName Adderly Gordon Getz Parker Rollins Stitt Seen UneenChord 0.50 0.54 0.53 0.52 0.52 0.53 0.53 0.52Scale 0.78 0.83 0.81 0.80 0.81 0.83 0.82 0.81

    6.2 Analyzing Personalized Models

    In this section, we only focus on user-1 (in the supplementary material we includeresults for all users). We analyze the quality of our preference metric function gϕ byplotting a histogram of the network’s predictions applied on a validation set. ConsiderFigure 6.2(a). Green histogram bars correspond to sequences labeled as positive bythe user, yellow as neutral, and red as negative. We can crudely divide the histograminto three areas: the right-hand side region corresponds to mostly positive sequencespredicted with high accuracy; the center region corresponds to high confusion betweenpositive and negative; and the left one, to mostly negative sequences predicted withhigh accuracy. While the overall error of the preference model is high (0.4 MSE wherethe regression domain is [-1,1]), it is still useful since we are interested in its predictionsin the high (green) spectrum for the forthcoming optimization stage. While trading-off coverage, we increase prediction accuracy using selective prediction (see [16, 17])by allowing our classifier to abstain when it is not sufficiently confident. To this end,we ignore predictions whose magnitude is between two rejection thresholds. Basedon preliminary observations, we fix the rejection threshold to maintain 50% coverageover the validation set. In Figure 6.2(b) we present a risk-coverage curve for user-1(see definition in [16]). The curve is computed by moving two thresholds, across thehistogram in Figure 6.2(b), and at each two point, for data not between the thresholds,calculating the risk (error of classification to three categories: positive, negative andnegative) and the coverage (percent of data maintained). In Figure 6.1 we demonstratethe score of user-1 as a function of the beam search width. For this user, without beamsearch, the average score the preference model gϕ predicts for generated solos of thelanguage model fθ is nearly zero. As we introduce beam search and increase the beamwidth, the performance increases up to an optimal point from which it decreases. Thisshows that the use of beam search is beneficial in optimizing the preference metric.

    24

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 1_1 4_2 8_4 16_4 16_8 32_8 32_16beam width

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    aver

    age

    scor

    e

    user_1 score_by_beam_width

    Figure 6.1: User Score vs. Beam Width: x-axis - combinations of beam width andparameter k used (respectively). Shaded area represents the 95% percentile of theconfidence interval.

    1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.000

    50

    100

    150

    200

    250 positiveneutralnegative

    (a) Histogram of predictions

    0.0 0.2 0.4 0.6 0.8 1.0coverage

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    erro

    r

    user_1, auc: 0.503

    -0.69

    -0.55

    -0.48

    -0.41

    -0.38

    -0.31

    -0.28

    -0.24

    -0.17

    -0.07

    -0.04

    -0.00

    0.03

    0.10

    0.27

    0.58

    0.89

    threshold

    (b) Risk-coverage curve

    Figure 6.2: (a) Predictions of the preference model on sequences from a validation set.Green: sequences labeled with a positive score; yellow: neutral; red: negative. Bluevertical line indicates threshold used for selective prediction. (b) Risk-coverage curvefor the model using.

    25

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 6.3 Harmonic Guided Generation

    An alternate known approach to the one we propose is to maximize a predefined “rewardfunction” based on music theory, similarly to [31, 14] rather than a user-specific metric.This approach may lead to undesired results, similar to the phenomenon of “rewardhacking” common in RL [3]. To examine a simple baseline using this method, wegenerated samples from our language model that are optimized to play notes withinthe harmony. Then we used the harmony coherence metric (for scales), as discussed inSection 6.4, and applied beam search with it. The resulting optimized solos successfullymaximized this harmony coherency metric. One can listen to such generated solosthat appear in the supplementary material. Perhaps unsurprisingly, the use of thishandcrafted metric resulted in a degraded performance where the solos were biased toprefer long notes that match the chord.

    6.4 Plagiarism Analysis

    One major concern is the extent to which our language model plagiarizes. To quantifyplagiarism in a solo with respect to a set of “source” solos, we measure two metrics:a) the longest sub-sequence in that solo that also appears in any other solo in thesource, and b) for every n-gram of notes in a solo, the percent of n-grams present insolos by another artist1. In our calculations two sequences that are identical up totransposition are considered the same. This statistic is also applied to any jazz artistin our dataset to form a baseline for the “typical” amount of copying exhibited byhumans. We considered the set of solos of each artist as a source set and computedthe plagiarism level of each artist with respect to the rest. For our language model, wequantify the plagiarism level with respect to the entire corpus. We present the resultsfor the first metric in Table 6.2. The average plagiarism level of our language modelwith respect to the set of harmonic progressions present in the training set is 10.3. Thismeans that on average over all examined generated solos, the largest sub-sequence ofconsecutive notes copied from the corpus is of length 10.3. The plagiarism level withrespect to unseen harmonic progressions is 8.6. Interestingly, both values lie within thehuman plagiarism range found in the dataset (see the table). The values of the secondmetric are presented in Figure 6.3. Here we can see that the solos of our model fallwithin the same range of human performers for every n-gram. This fact indicates thatour synthetic artist can be accused of plagiarism as much as some of the famous jazzgiants.

    1We measure all n-grams up to 14 as at this level the percent was near-zero for all artists

    26

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Table 6.2: Plagiarism among 8 jazz saxophone giants. Element xa,b in the table is theaverage largest sub-sequence in a solo of artist a (row names) found in any solo of artistb (column names).

    Name Adderley Gordon Getz Parker Rollins Stitt Woods Ammons MeanAdderley 6.2 10.0 8.5 7.7 9.7 9.5 6.2 8.7 8.3Gordon 6.0 11.0 8.2 6.9 7.6 7.4 5.6 7.2 7.5

    Getz 5.1 7.9 8.9 6.3 6.7 7.1 5.5 7.0 6.8Parker 5.6 7.8 7.9 20.3 7.3 7.5 5.2 20.9 10.3Rollins 5.9 7.5 7.3 7.1 7.5 6.9 5.7 6.5 6.8Stitt 6.3 10.7 10.7 8.7 8.7 15.2 6.0 9.2 9.4

    Woods 5.6 8.7 8.7 8.0 8.0 8.0 7.1 8.3 7.8Ammons 4.9 8.1 8.6 6.9 6.7 7.4 5.4 8.6 7.1

    2 4 6 8 10 12 14n-grams

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    % p

    lagi

    arize

    d

    Plagarism by n-gramOurscannonballgordongetzparkerrollinsstittwoodsammons

    Figure 6.3: The percent of n-grams of an artist’s solo appearing in solos of other artists.Our model’s solos’ common n-grams are in the same order as of those in the dataset.

    27

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • 28

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Chapter 7

    Conclusion and Future Directions

    We presented a novel method to generate jazz solos over given harmony progressions.The solos are optimized to match personal preferences that are learned from humansusing supervised learning of their continuous preference response. To distill the noisyhuman preference models we used a selective prediction approach. We applied the pro-posed pipeline over four users and presented numerical evidence to the partial successof the process. Our work raises many questions and directions for future research.

    7.1 Implications for Human Music Generation

    As computer-generated music becomes better and better, we may as whether we arereaching a point where human musicians are redundant. We break our answer to thisquestion in two. The first is technical, the second is about the nature of music. Musicalcomposition and performance has many aspects. The composition and choice of notes,the focus of this work, being only one of them. Others include the constructing of aunique musical style, the performance itself, tone generation and expressively (timbre,velocity, vibrato, etc.), interaction with other musicians and many others. While wemay see improvements in these fields in the future, many of these problems are muchmore challenging. Such open questions are: How to synthesize musical tones thatsounds natural? How to design an algorithm that interacts with humans? How toincorporate musical knowledge into generated music?

    While we may teach a computer statistically how to generate solos (and generallymusic), the manner in which we do so is widely different than that humans follow. Acomputer learns to generate patterns conditioned on a limited musical context, tryingto imitate a given corpus. A (human) player learn how to play and create music byspending thousands of hours practice trying to “generate” music himself, continuouslyimproving. The musician assesses the quality of his music and corrects it accordingto some internal quality estimate. This estimate changes over time and depends onmany of the characteristics of the musician himself, such as the amount of traininghe has, his culture, his emotional state, etc. One could argue that we may mitigate

    29

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • this gap by adjusting the training algorithm to to mimic that of humans (perhaps usingReinforcement Learning). Though it seems that we won’t be able to capture and mimicthe individual intrinsic quality assessment of each human.

    7.2 Future Directions

    While our generated solos are locally coherent and often interesting/pleasing, they lackthe qualities of professional jazz solo related to general structure such as motif devel-opment and variations. A major open direction would be to overcome this challengingobjective. Preliminary models we have trained based on a smaller datasets were sub-stantially weaker. Can a much larger dataset make a substantially better model? Inorder to get such a large corpus it might be required to abandon the symbolic approachthat we use here and rely on audio recordings which can be gathered in much largerquantities. However, this direction is a huge challenge at this time possibly requiringdifferent neural architectures and techniques to handle raw audio. Finally, our workemphasizes the need to develop effective methodologies and techniques to extract anddistill noisy human feedback that will be required for developing many personalizedapplications.

    30

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Bibliography

    [1] Sax solos. https://saxsolos.com/. Accessed: 2019-05-16.

    [2] R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones. Character-Level Lan-guage Modeling with Deeper Self-Attention. CoRR, abs/1808.04444, 2018. URLhttp://arxiv.org/abs/1808.04444.

    [3] D. Amodei, C. Olah, J. Steinhardt, P. F. Christiano, J. Schulman, and D. Mané.Concrete Problems in AI Safety. CoRR, abs/1606.06565, 2016. URL http://arxiv.org/abs/1606.06565.

    [4] R. V. Brittin. Listeners’ Preference for Music of Other Cultures: ComparingResponse Modes. Journal of Research in Music Education, 44(4):328–340, 1996.

    [5] K. Chen, W. Zhang, S. Dubnov, G. Xia, and W. Li. The Effect of Explicit StructureEncoding of Deep Neural Networks for Symbolic Music Generation. In 2019 Inter-national Workshop on Multilayer Music Representation and Processing (MMRP),pages 77–84. IEEE, 2019.

    [6] J. Cocker. Elements of the Jazz Language for the Developing Improviser. Miami:CPP Belwin, 1991.

    [7] J. C. Coggiola. The Effect of Conceptual Advancement in Jazz Music Selectionsand Jazz Experience on Musicians’ Aesthetic Response. Journal of Research inMusic Education, 52(1):29–42, 2004.

    [8] M. Cooke, D. Horn, and J. Cross. The Cambridge Companion to Jazz. CambridgeUniversity Press, 2002.

    [9] P. P. Cruz-Alcázar and E. Vidal-Ruiz. Learning regular grammars to model mu-sical style: Comparing different coding schemes. In International Colloquium onGrammatical Inference, pages 211–222. Springer, 1998.

    [10] Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhut-dinov. Transformer-xl: Attentive language models beyond a fixed-length context.arXiv preprint arXiv:1901.02860, 2019.

    31

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

    https://saxsolos.com/http://arxiv.org/abs/1808.04444http://arxiv.org/abs/1606.06565http://arxiv.org/abs/1606.06565

  • [11] M. Duff. Backpropagation and Bach�s 5th cello suite (Sarabande). In Proceedingsof the International Joint Conference on Neural Networks, page 575, 1989.

    [12] D. Eck and J. Schmidhuber. Learning the Long-Term Structure of the Blues. InInternational Conference on Artificial Neural Networks, pages 284–289. Springer,2002.

    [13] J. D. Fernández and F. Vico. AI Methods in Algorithmic Composition: A Com-prehensive Survey. Journal of Artificial Intelligence Research, 48:513–582, 2013.

    [14] J. A. Franklin. Jazz Melody Generation Using Recurrent Networks and Rein-forcement Learning. International Journal on Artificial Intelligence Tools, 15(04):623–650, 2006.

    [15] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional Se-quence to Sequence Learning. In Proceedings of the 34th International Conferenceon Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017,pages 1243–1252, 2017. URL http://proceedings.mlr.press/v70/gehring17a.html.

    [16] Y. Geifman and R. El-Yaniv. Selective Classification for Deep Neural Networks.In Advances in Neural Information Processing Systems, pages 4878–4887, 2017.

    [17] Y. Geifman and R. El-Yaniv. SelectiveNet: A Deep Neural Network with an Inte-grated Reject Option. In International Conference on Machine Learning (ICML),2019.

    [18] J. Gillick, K. Tang, and R. M. Keller. Learning Jazz Grammars. Proceedings ofthe SMC, pages 125–130, 2009.

    [19] G. Hadjeres and F. Nielsen. Interactive Music Generation with Positional Con-straints using Anticipation-RNNs. CoRR, abs/1709.06404, 2017. URL http://arxiv.org/abs/1709.06404.

    [20] G. Hadjeres, F. Pachet, and F. Nielsen. DeepBach: a Steerable Model forBach Chorales Generation. In Proceedings of the 34th International Confer-ence on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 Au-gust 2017, pages 1362–1371, 2017. URL http://proceedings.mlr.press/v70/hadjeres17a.html.

    [21] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C. A. Huang, S. Dieleman,E. Elsen, J. Engel, and D. Eck. Enabling Factorized Piano Music Modeling andGeneration with the MAESTRO Dataset. CoRR, abs/1810.12247, 2018. URLhttp://arxiv.org/abs/1810.12247.

    32

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

    http://proceedings.mlr.press/v70/gehring17a.htmlhttp://proceedings.mlr.press/v70/gehring17a.htmlhttp://arxiv.org/abs/1709.06404http://arxiv.org/abs/1709.06404http://proceedings.mlr.press/v70/hadjeres17a.htmlhttp://proceedings.mlr.press/v70/hadjeres17a.htmlhttp://arxiv.org/abs/1810.12247

  • [22] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman,E. Elsen, J. Engel, and D. Eck. Enabling Factorized Piano Music Modeling andGeneration with the MAESTRO Dataset. In International Conference on LearningRepresentations, 2019. URL https://openreview.net/forum?id=r1lYRjC9F7.

    [23] H. Hild, J. Feulner, and W. Menzel. HARMONET: A Neural Net for HarmonizingChorales in the Style of JS Bach. In NIPS, pages 267–274, 1991.

    [24] L. A. Hiller Jr. and L. M. Isaacson. Musical Composition with a High SpeedDigital Computer. In Audio Engineering Society Convention 9. Audio EngineeringSociety, 1957.

    [25] E. Himonides. Mapping a Beautiful Voice: The Continuous Response MeasurementApparatus (CReMA). Journal of Music, Technology & Education, 4(1):5–25, 2011.

    [26] G. E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence.Neural Computation, 14(8):1771–1800, 2002.

    [27] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural computation,9(8):1735–1780, 1997.

    [28] A. Holtzman, J. Buys, M. Forbes, and Y. Choi. The curious case of neural textdegeneration. arXiv preprint arXiv:1904.09751, 2019.

    [29] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne, A. M. Dai,M. D. Hoffman, and D. Eck. Music Transformer: Generating Music with Long-Term Structure. arXiv preprint arXiv:1809.04281, 2018.

    [30] H. Inan, K. Khosravi, and R. Socher. Tying Word Vectors and Word Classifiers: ALoss Framework for Language Modeling. In 5th International Conference on Learn-ing Representations, ICLR 2017, Toulon, France, April 24-26, 2017, ConferenceTrack Proceedings, 2017. URL https://openreview.net/forum?id=r1aPbsFle.

    [31] N. Jaques, S. Gu, R. E. Turner, and D. Eck. Tuning Recurrent Neural Networkswith Reinforcement Learning. In 5th International Conference on Learning Rep-resentations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop TrackProceedings, 2017. URL https://openreview.net/forum?id=Syyv2e-Kx.

    [32] D. D. Johnson, R. M. Keller, and N. Weintraut. Learning to Create Jazz MelodiesUsing a Product of Experts. In Proceedings of the Eighth International Conferenceon Computational Creativity (ICCC’17), Atlanta, GA, 19, page 151, 2017.

    [33] R. M. Keller and D. R. Morrison. A grammatical approach to automatic impro-visation. In Proceedings, Fourth Sound and Music Conference, Lefkada, Greece,July. “Most of the soloists at Birdland had to wait for Parker�s next record in orderto find out what to play next. What will they do now, 2007.

    33

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

    https://openreview.net/forum?id=r1lYRjC9F7https://openreview.net/forum?id=r1aPbsFlehttps://openreview.net/forum?id=Syyv2e-Kx

  • [34] W. Kool, H. Van Hoof, and M. Welling. Stochastic beams and where to findthem: The gumbel-top-k trick for sampling sequences without replacement. arXivpreprint arXiv:1903.06059, 2019.

    [35] M. Löthe. Knowledge Based Automatic Composition and Variation of Melodies forMinuets in Early Classical Style. In Annual Conference on Artificial Intelligence,pages 159–170. Springer, 1999.

    [36] C. K. Madsen and J. M. Geringer. Comparison of Good Versus Bad Tone Qual-ity/Intonation of Vocal and String Performances: Issues Concerning Measurementand Reliability of the Continuous Response Digital Interface. Bulletin of the Coun-cil for Research in Music education, pages 86–92, 1999.

    [37] H. H. Mao, T. Shin, and G. W. Cottrell. DeepJ: Style-Specific Music Generation. In12th IEEE International Conference on Semantic Computing, ICSC 2018, LagunaHills, CA, USA, January 31 - February 2, 2018, pages 377–382, 2018. doi: 10.1109/ICSC.2018.00077. URL https://doi.org/10.1109/ICSC.2018.00077.

    [38] M. C. Mozer. Neural Network Music Composition by Prediction: Exploring theBenefits of Psychoacoustic Constraints and Multi-Scale Processing. ConnectionScience, 6(2-3):247–280, 1994.

    [39] P. Norvig. Paradigms of Artificial Intelligence Programming: Case Studies inCommon LISP. Morgan Kaufmann, 1992.

    [40] PG Music Inc. Band-in-a-box. URL https://www.pgmusic.com/.

    [41] O. Press and L. Wolf. Using the Output Embedding to Improve Language Models.In Proceedings of the 15th Conference of the European Chapter of the Associationfor Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017,Volume 2: Short Papers, pages 157–163, 2017. URL https://aclanthology.info/papers/E17-2025/e17-2025.

    [42] D. Quick. Generating music using concepts from schenkerian analysis and chordspaces. Technical report, Tech. rep., Yale University, 2010.

    [43] C. R. Robinson. Differentiated Modes of Choral Performance Evaluation UsingTraditional Procedures and a Continuous Response Digital Interface device. PhDthesis, Florida State University, 1988.

    [44] E. Schubert. Continuous Measurement of Self-Report Emotional Response to Mu-sic. Music and Emotion: Theory and Research, pages 394–414, 2001.

    [45] B. Sturm, J. F. Santos, and I. Korshunova. Folk Music Style Modelling by Recur-rent Neural Networks with Long Short-Term Memory Units. In 16th InternationalSociety for Music Information Retrieval Conference, 2015.

    34

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

    https://doi.org/10.1109/ICSC.2018.00077https://www.pgmusic.com/https://aclanthology.info/papers/E17-2025/e17-2025https://aclanthology.info/papers/E17-2025/e17-2025

  • [46] P. M. Todd. A connectionist approach to algorithmic composition. ComputerMusic Journal, 13(4):27–43, 1989.

    [47] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu. WaveNet: AGenerative Model for Raw Audio. In The 9th ISCA Speech Synthe-sis Workshop, Sunnyvale, CA, USA, 13-15 September 2016, page 125,2016. URL http://www.isca-speech.org/archive/SSW_2016/abstracts/ssw9_DS-4_van_den_Oord.html.

    [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,and I. Polosukhin. Attention is All you Need. In Advances in Neural Informa-tion Processing Systems 30: Annual Conference on Neural Information ProcessingSystems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010, 2017.URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.

    [49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,and I. Polosukhin. Attention is All You Need. In Advances in Neural InformationProcessing Systems, pages 5998–6008, 2017.

    [50] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, andD. Batra. Diverse beam search: Decoding diverse solutions from neural sequencemodels. arXiv preprint arXiv:1610.02424, 2016.

    [51] R. F. Voss and J. Clarke. ’’1/f noise’’in music: Music from 1/f noise. The Journalof the Acoustical Society of America, 63(1):258–263, 1978.

    [52] C. Walder. Modelling Symbolic Music: Beyond the Piano Roll. In Asian Confer-ence on Machine Learning, pages 174–189, 2016.

    [53] L. Yang, S. Chou, and Y. Yang. MidiNet: A Convolutional Generative Adver-sarial Network for Symbolic-Domain Music Generation. In Proceedings of the18th International Society for Music Information Retrieval Conference, ISMIR2017, Suzhou, China, October 23-27, 2017, pages 324–331, 2017. URL https://ismir2017.smcnus.org/wp-content/uploads/2017/10/226_Paper.pdf.

    [54] L. Yu, W. Zhang, J. Wang, and Y. Yu. SeqGAN: Sequence Generative AdversarialNets with Policy Gradient. In Proceedings of the Thirty-First AAAI Conferenceon Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA.,pages 2852–2858, 2017. URL http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14344.

    35

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

    http://www.isca-speech.org/archive/SSW_2016/abstracts/ssw9_DS-4_van_den_Oord.htmlhttp://www.isca-speech.org/archive/SSW_2016/abstracts/ssw9_DS-4_van_den_Oord.htmlhttp://papers.nips.cc/paper/7181-attention-is-all-you-needhttps://ismir2017.smcnus.org/wp-content/uploads/2017/10/226_Paper.pdfhttps://ismir2017.smcnus.org/wp-content/uploads/2017/10/226_Paper.pdfhttp://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14344http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14344

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • חיזוי שמודל מכיוון מהתווים. אחד לכל המשתמש של ההנאה מידת את ומשערך השפה ממודל תווים

    מחיזויים להתעלם לנו המאפשר סלקטיבי חיזוי של בשיטות השתמשנו רועשים, חיזויים פולט ההנאה

    נמוכה. שלנו המודל של הביטחון מידת כאשר

    טכניקות באמצעות אישית מותאמים ג'אז אלתורי ליצור שמנסה הראשונה העבודה זו ידיעתנו, למיטב

    מספר על התהליך את מפעילים אנחנו בו מקיף אמפירי מחקר מציגים אנחנו עמוקה. למידה של

    סובייקטיבי. מדד למקסם מצליחים אנחנו האם אובייקטיבית בחינה מאפשר שלנו התהליך נבדקים.

    זה. לטעם בהתאם חדשים סולואים ולייצר מאזינים של אישי טעם ללמוד שניתן מעידות התוצאות

    מוזיקלית. מבחינה לאוזן נעימות התהליך ע"י שנוצרות שהתוצאות היא הסובייקטיבית התרשמותנו

    מייצר שהוא מהסולואים כמה כלומר, פלגיאט. שלנו ההאלגוריתם בה המידה מה בחנו כך, על נוסף

    מסקנותינו האימון. בזמן נחשף הוא אליהם מהסולואים מועתקים מתוכם וכמה וחדשנים ייחודיים הם

    עצמם. הנגנים של מזו שונה אינה האלגוריתם של ההעתקה שמידת היא

    ii

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • תקציר

    בשנים מוזיקה. ובפרט אמנות ליצרת במחשב בשימוש התעניינו ואמנים חוקרים המחשב, המצאת מאז

    והשימושים האפשרויות התרחבו עמוקה, למידה במודלי וגובר ההולך השימוש בעקבות האחרונות,

    של אישית התאמה שמציעים שרותים של התרחבות היא שמתרחשת נוספת מגמה בתחום. מחשב של

    ביותר הגדול והאתגר יתכן .Pandoraו Spotify למשל משתמשיהם, של האישי לטעמם בהתאם תוכן

    מפורשת. בצורה מסוים משתמש עבור תוכן של יצירה הוא בתחום

    אלתורי ליצור נוירונים ברשתות שהשתמשה הראשונה העבודה מאז עשורים שני כמעט שחלפו למרות

    באמצעות מעריכים כלל בדרך למוזיקה מודלים של התוצרים את דליל. די בתחום המחקר ג'אז,

    מאפשרות אלו ששיטות בעוד השונות. היצירות את שמדרגים אנושיים למאזינים התוצרים השמעת

    נוספת גישה משתמש. כל של האישי טעמו את זונחת היא משתמשים, של ממוצעת העדפה למדוד

    לסגנונות בעיקר מתאימה כזו גישה מוזיקליים. כללים באמצעות היא מודלים של טיב לשערוך

    לקבוע קשה שכן זו, קטגוריה תחת נופלים לא ג'אז אלתורי ברורים. חוקים קיימים בהם מוזיקליים

    טוב. ג'אז אלתור למהו ברורים חוקים

    קורפוס של מפוקחת למידה א) הבאים: מהשלבים המורכב תהליך יצרנו זו מטרה עם להתמודד כדי

    ע"י היא טובה ג'אז אלתור יכולת לפיתוח המפתח רבים, ג'אז מורי של לעצתם בהתאם סולואים. של

    ממאות המורכב קורפוס בנינו זו. עצה בעקבות הולכים אנחנו זו בעבודה הגדולים. הנגנים של חיוקי

    וסוני גורדון דקסטר גטץ, סטן פארקר, צ'ארלי ביניהם הסקסופוניסטים, גדולי ידי על שנוגנו סולואים

    בשלב ההרמוניה. את שמייצגים האקורדים רצף בתוספת מונופוני תווים כרצף מיוצג סולו כל סטיט.

    התהליך הקורפוס. של ההסתברות את למקסם Attention מסוג או LSTM מסוג נוירונים רשת אימנו זה

    של התוצר תו. לכל המתאים והאקורד שלפניו התווים של אוסף בהינתן בודד תו חיזוי ע"י נעשה

    ארוך. טמפורלי מבנה בעלי ג'אז אלתורי לייצר המסוגל מודל הוא הזה השלב

    מוזיקלי טעם יש שונים לאנשים כאמור, גבוהה. ברזולוציה אישית מותאמת איכות מטריקת למידת ב)

    ע"י שנוצרו סולואים של חדש קורפוס יצרנו משתמש, כל של הטעם את ללמוד שנוכל כדי שונה.

    בכל דרגו אשר משתמשים למספר הסולואים את השמענו מכן, לאחר א'. בשלב שנלמד השפה מודל

    תווים של רצף המתארות סדרות שתי יצרנו כלומר ממנו. שלהם ההנאה מידת מהי קטע בכל רגע

    שיודע מודל אימנו משתמש לכל מהם. אחד מכל מסויים משתמש של ההנאה מידת ואת מוזיקליים

    הנ"ל. הרצפים שני באמצעות שלו ההנאה מידת את לחזות

    סולואים ייצרנו הסופי, בשלב שנלמדה. האיכות למטריקת בהתאם חדשים סולואים של יצירה ג)

    שנלמד המודל באמצעות ספציפי משתמש של ההנאה מידת את ממקסמות אשר השפה ממודל חדשים

    מספר דוגם בגנרציה, צעד בכל אשר חיפוש אלגוריתם באמצעות נעשתה החדשה הגנרציה ב'. בשלב

    i

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • המחשב. למדעי בפקולטה אל-יניב, רן פרופסור של בהנחייתו בוצע המחקר

    תודות

    שיחה של שעות אינספור על פתוחה, תמיד למשרדו שהדלת כך על רן, שלי, למנחה להודות רוצה אני

    לכל המתמדת התמיכה על מעיין, לאשתי, גם להודות רוצה אני למהנה. הדרך את שהפך כך ועל

    שלי. המחקר תקופת אורך

    בהשתלמותי. הנדיבה הכספית התמיכה על לטכניון מודה אני

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • אישית מותאמים ג'אז אלתורי יצירתנוירונים רשתות באמצעות

    מחקר על חיבור

    התואר לקבלת הדרישות של חלקי מילוי לשם

    המחשב במדעי למדעים מגיסטר

    בהונקר נדב

    לישראל טכנולוגי מכון — הטכניון לסנט הוגש

    2019 נובמבר חיפה התש"פ חשוון

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

  • אישית מותאמים ג'אז אלתורי יצירתנוירונים רשתות באמצעות

    בהונקר נדב

    Technion - Computer Science Department - M.Sc. Thesis MSC-2020-15 - 2020

    List of FiguresList of TablesAbstractAbbreviations and Notations1 Introduction2 Related Work2.1 Deep Symbolic Jazz Improvisation2.2 Note representation2.3 Continuous Response Measurement Interface (CRDI)

    3 Problem Statement4 Learning to Generate Personalized Jazz Improvisations with Neural Models4.1 Language Modeling4.2 User Preference Metric Learning4.3 Optimized Music Generation via Planning

    5 Implementation5.1 Network Architecture5.2 User Preference Elicitation5.2.1 Musical Preference Labeling System

    5.3 User Preference Metric Learning

    6 Experiments6.1 Harmony Coherency6.2 Analyzing Personalized Models6.3 Harmonic Guided Generation6.4 Plagiarism Analysis

    7 Conclusion and Future Directions7.1 Implications for Human Music Generation7.2 Future Directions

    Bibliography