music generation using generative adversarial …...music generation using generative adversarial...

Music Generation Using Generative Adversarial Networks

Diogo de Almeida Mousaco Pinho

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisor: Prof. Rodrigo Martins de Matos Ventura

Examination Committee

Chairperson: Prof. João Fernando Cardoso Silva SequeiraSupervisor: Prof. Rodrigo Martins de Matos Ventura

Member of the Committee: Prof. Pedro Manuel Quintas Aguiar

June 2018

Acknowledgments

In first, I would like to thank my supervisor Prof. Rodrigo Ventura for the support and encouragement to

explore two topics of major personal interest, namely Machine Learning and Music.

I would like to thank in particular Pedro Ferreira, Jose Corujeira, Lia Laporta and Raquel Miranda for

spending their time discussing my work at some point.

Finally, I would also like to thank my family and friends for the motivation and support provided during

this thesis.

iii

Declaration

I declare that this document is an original work of my own authorship and that it fulfills all the require-

ments of the Code of Conduct and Good Practices of the Universidade de Lisboa.

v

Resumo

A ideia de uma maquina ser capaz de gerar musica e, de certa forma, intrigante. O processo de

composicao musical implica a manipulacao de sons de base ou notacao para criar estruturas mais

complexas. Nesta tese e proposto um sistema de geracao baseado em formas de onda que represen-

tam compassos musicais, tirando partido de tecnicas de Machine Learning. Um pre-processamento das

amostras de audio e executado, e consiste na transformacao das formas de onda numa representacao

tempo-frequencia, geralmente utilizada para lidar com sinais de musica. Um modelo generativo do

estado da arte foi implementado com o objectivo de criar trechos identicos aos do dataset, o qual e

composto por compassos com duracao de 2 segundos. O modelo original e conhecido como Gener-

ative Adversarial Network (GAN), mas a variante implementada beneficia de camadas convolucionais

na arquitetura das redes e e chamado Deep Convolutional Generative Adversarial Network. Varias

abordagens com diferentes arquiteturas e hiperparametros sao implementadas de forma a avaliar a

capacidade do modelo de cumprir os objectivos propostos. Atraves de um user study conclui-se que

os trechos de musica gerados pelo sistema implementado nao sao ruıdo, e que sao musicalmente

agradaveis.

Palavras-chave: Geracao de Musica, Generative Adversarial Networks, Deep Learning,

Deep Convolutional Generative Adversarial Networks, Aprendizagem Automatica

vii

Abstract

The idea of a machine being able to generate music is somehow intriguing. The music composition

process implies the manipulation of baseline sounds or notation to create more complex structures. In

this thesis a generation system based on raw waveforms representing musical bars is proposed, taking

advantage of Machine Learning techniques. A preprocessing of the audio samples is performed, con-

sisting on a transformation of the waveforms into a time-frequency representation, commonly used to

deal with music signals. A state-of-the-art generative model was implemented with the purpose of cre-

ating music segments similar to those in the dataset, which is composed by 2 second long music bars.

The original model is known Generative Adversarial Network (GAN) but the approached variant bene-

fits from convolutional layers in its network’s architecture and is called Deep Convolutional Generative

Adversarial Network. Several approaches were made regarding different architectures and hyperparam-

eters, in order to evaluate the model’s capability of meeting the proposed objectives. By means of an

user study it is concluded that the music segments generated by the implemented system are not noise,

and are actually musically pleasing.

Keywords: Music Generation, Generative Adversarial Networks, Deep Learning, Deep Convo-

lutional Generative Adversarial Networks, Machine Learning

ix

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theoretical Background 5

2.1 Music Theoretical Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Rhythm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Melody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Harmony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Constant-Q Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 CQT Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

xi

2.5.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.3 Fully-connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.4 CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 Deep Convolutional Generative Adversarial Network . . . . . . . . . . . . . . . . . . . . . 22

2.6.1 DCGAN approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6.2 Detailed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Proposed Models 29

3.1 Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Model 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Implementation and Results 35

4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Overall Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 Validation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 User Study 45

5.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Conclusions 51

6.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Bibliography 53

A Survey 57

xii

List of Tables

2.1 Original DCGAN generator convoltuional layer’s specifications. . . . . . . . . . . . . . . . 26

2.2 Original DCGAN discriminator convoltuional layer’s specifications. . . . . . . . . . . . . . 27

3.1 Model 1 convoltuional layers’ specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . 31



4.1 Table of CQT parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 Friedmans test for Q1 and Q2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2 Wilcoxon signed-rank tests for Q1 and Q2. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

xiii

List of Figures

2.1 Log-spectrogram example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Artificial Neuron and Artificial Neural Networks design. . . . . . . . . . . . . . . . . . . . . 10

2.3 Activation functions plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Artificial Neural Networks (ANN) with notation . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Generative Adversarial Network (GAN) layout. . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Informal explanation of minimax training algorithm. . . . . . . . . . . . . . . . . . . . . . . 17

2.7 Example of neuron disposition in the first convolutional layer. . . . . . . . . . . . . . . . . 19

2.8 Dimensioning the output of the convolutional layer. (1) . . . . . . . . . . . . . . . . . . . . 20

2.9 Example of Max Pooling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.10 Example of CNN architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.11 Dimensioning the output of the convolutional layer. (2) . . . . . . . . . . . . . . . . . . . . 23

2.12 Dimensioning the output of a fractional-strided convolutional layer. (1) . . . . . . . . . . . 24

2.13 Dimensioning the output of a fractional-strided convolutional layer. (2) . . . . . . . . . . . 24

2.14 Original DCGAN’s generator network architecture. . . . . . . . . . . . . . . . . . . . . . . 26

2.15 Original DCGAN’s discriminator network architecture. . . . . . . . . . . . . . . . . . . . . 27

3.1 Model 1 networks architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31



4.1 Audio segment representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Implemented system high-level architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 CQT validation high-level architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4 CQT validation test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.5 MNIST real and generated samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.6 Discriminator and generator losses trained on the MNIST dataset. . . . . . . . . . . . . . 40

4.7 Training the generative model with only one sample as input. . . . . . . . . . . . . . . . . 40

4.8 Generations of the one training sample validation test for all the proposed models. . . . . 41

4.9 Discriminator and generator losses regarding the validation test. . . . . . . . . . . . . . . 42

4.10 Generations of the trained generative model for all the proposed models. . . . . . . . . . 44

xv

4.11 Discriminator and generator losses regarding the trained generative model’s best perfor-

mance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1 Relative frequency per sample group regarding Q1. . . . . . . . . . . . . . . . . . . . . . 47

5.2 Boxplot of the average rank per sample group regarding Q2. . . . . . . . . . . . . . . . . 47

xvi

List of Acronyms

AN Artificial Neuron

ANN Artificial Neural Networks

BPM Beats Per Minute

CNN Convolutional Neural Network

CPU Central Processing Unit

CQT Constant-Q Transform

CUDA Compute Unified Device Architecture

D Discriminator

DCGAN Deep Convolutional GAN

DFT Discrete Fourier Transform

DNN Deep Neural Networks

G Generator

GAN Generative Adversarial Network

GD Gradient Descent

GPU Graphics Processing Unit

GRU Gated Recurrent Unit

IDE Integrated Development Environment

LReLU Leaky Rectified Linear Unit

LSTM Long Short-Term Memory

MIDI Musical Instrument Digital Interface

MSE Means Squared Error

RAM Random Access Memory

xvii

ReLU Rectified Linear Unit

RNN Recurrent Neural Network

SGD Stochastic Gradient Descent

WGAN Wassrstein GAN

xviii

Chapter 1

Introduction

1.1 Motivation

Music can be interpreted as the art of mixing sounds and silences in a way that produces beauty,

is harmonic, and expresses emotions. Those responsible for composing music intend to spark some

emotions on the listener. As a composer myself, I have dealt with the underlying struggles of expressing

one’s ideas into melodies, harmonies or even rhythms. Having this in mind, one can wonder whether

if it is possible to achieve the same purpose when the source is not a human being, but actually some

algorithm. This thought may lead one to deeper questions such as if it is mandatory that a machine has

emotions in order to create music that provokes emotions on humans.

The algorithms’ decision making processes have always been based on a priori conditions over pre-

defined features of data structures. However, this approach has been proven to have viable alternatives

with the recent developments of artificial neural networks, which learn to make decisions by themselves

over features defined by themselves, based solely on a dataset and a classification. Deep learning algo-

rithms is the trendiest technology in these days and they are believed to revolutionize the field in a near

future.

The music generation framework have already been subject of study using this deep learning algo-

rithms technology. Most of these works aims at generating musical notation, which regards the compo-

sition perspective, i.e. how musical notes are sequenced. However, if instead of concerning notation,

one considers creating an actual sound, the task gets much more challenging.

1.2 Objectives

In this thesis the task of generating a music segment waveform is addressed. A generation system

based on raw waveforms is designed and implemented. The output is expected to represent music

segments that should follow certain constraints imposed by the input data, such as being a 2 second

long segment.

The successfulness of this system will be evaluated based on the sound fidelity, i.e. the generated

1

segment should not be a noise sound, and based on how musically pleasing the sound is. Using a

generic dataset including sequences of different patterns from every music genre, the system’s outcome

should be unconfined and as unbiased as it can possibly be, challenging the human creativity. However,

a more humble approach is considered with the purpose of creating a valid audio sample.

Ultimately, this thesis should unveil the suitableness of the proposed state-of-the-art deep learning

generative models to address the task of generating audio segments.

1.3 Related work

Algorithmic composition is a subject of work that dates back to 1959 [1]. However, the recent devel-

opments of Deep Neural Networks (DNN), which have proven astonishing results in learning from big

datasets, allowed this topic of music generation to be further developed. Over the past couple of years,

tons of proposed models addressing music generation have been published, all of them on deep learning

algorithms [2]–[10].

A big part of the neural network based models for music generation use Recurrent Neural Network

(RNN) and some variants, once the music generation process can be seen as creating dependent

sequences [5]–[7]. RNNs are neural networks that possess a directed connection from the output of

a unit in a certain layer with another unit’s input in another layer, closer to the input one, just as a

closed loop system. As an example of these variants, Nayebi and Vitelli [6] present the Gated Recurrent

Unit (GRU) and the Long Short-Term Memory (LSTM) architectures to address music composition.

According to their work, the LSTM was the only musically plausible among the experimented ones. The

used data in these experiments was preprocessed to be represented in a time-frequency domain.

Another way of representing data is directly from the time-domain audio waveforms, and it is used in

WaveNet. Introduced by Oord, Dieleman, Zen, et al. [4], WaveNet is a fully probabilistic and autoregres-

sive model, where all previous audio samples will condition the distribution of the next one. WaveNet

is actually a Convolutional Neural Network (CNN), where dilated (or a trous) causal convolutions are

present in all layers. This kind of convolution consists on each filter taking every n-th element of the

input matrix (n corresponds to a layer), rather than the whole elements. The WaveNet model reached

state-of-the-art performance when applied to text-to-speech and foresees promising results when ap-

plied to music modeling, proving that CNN are a valid option to generate music, alongside RNN.

Yang, Chou, and Yang [2] proposed a model based on a CNN, but trained with a Generative Adver-

sarial Network (GAN) [11], called MIDINET. This model works over a piano-roll representation of Musical

Instrument Digital Interface (MIDI) data, which is an encoding protocol of musical features such as sound

and silence, notes, tempo, among others. MIDI is by far the most used method to compute music struc-

tures, once its complexity can be as simple as a binary code for whether a note is played on a certain

time-step or not, disregarding the audio source variable. Back to the MIDINET model, the GAN uses

an adversarial learning algorithm with two networks. One of them is called the generator which aims at

converting a random noise sample into a realistic artificial data. The other is called discriminator and is

just a classifier that tells whether the sample comes from real data or from the generator network. In this

2

case, both of them contain convolutional layers on its architecture, such as CNNs do, which is a variant

of the original GAN model, proposed by Radford, Metz, and Chintala [12] and known as Deep Convo-

lutional GAN (DCGAN). This GAN alone generates bars and does not consider temporal dependencies

between them, although this issue is dealt with a conditioner Convolutional Neural Network (CNN).

Another relevant model, very similar to the latter, is the MuseGAN [3]. This also works over MIDI

data, and uses a GAN model variant that minimizes the Wasserstein distance, namely Wassrstein GAN

(WGAN). The way temporal dependencies are dealt with here is by combining time-dependent and

time-independent random vectors to generate conditional bars.

1.4 Contributions

According to the proposed work, the main contributions are the following:

• An approach to music generation based on raw audio data, with machine learning techniques.

• An implementation of a generative model with two deep neural network architectures.

• An approach to convolutional strategies to exploit feature locality.

• A statistical analysis of a user study to validate the results.

1.5 Thesis Outline

This thesis is structured as follows:

Chapter 2 establishes the background over theoretical music concepts, a signal processing algo-

rithm, some neural network basic concepts, and also covers some more high-level structures of

neural networks.

Chapter 3 describes the design and implementation features of the proposed models.

Chapter 4 analyses the suitableness of the proposed generative model to approach the matter of

generating music, and presents the parameters that provided the best trained model.

Chapter 5 describes the study developed to validate the results of this work.

Chapter 6 concludes the thesis and suggests avenues for future work.

3

Chapter 2

Theoretical Background

In this chapter a theoretical introduction is addressed, concerning the basic structures and knowledge

for a clear understanding of the remainder sections. In Section 2.1 a light music theoretical overview is

made in order to evaluate some musical details in the used data structures. In Section 2.2, the Constant-

Q Transform, a time-frequency representation, is analyzed in detail. In Section 2.3, the neural networks

subject is introduced and the basic principles of how they work are covered. Lastly, Sections 2.4 to 2.6

cover more particular Deep Neural Networks.

2.1 Music Theoretical Overview

In physics, sound can be defined as a wave of pressure that propagates on a transmission medium (e.g.

air or water). When this wave reaches the human ear it is processed by the brain, which defines the

human hearing capability. In order to be classified as music, these waves that are heard should follow

some structure.

The first question that rises is how to define the baseline of music, i.e. a musical note. Musical notes

are a notation that maps specific frequencies. In western music [13], there are 12 notes assigned with

the following upper-case letters: {C, C], D, D], E, F , F], G, G], A, A], B }. All the keys in a piano

are a direct representation of consecutive sets like the latter. To distinguish different sets a number is

assigned, and it is known as an octave representation. So, two consecutive sets with the corresponding

octave are written as {C4, C]4, D4, D]4, E4, F4, F]4, G4, G]4, A4, A]4, B4, C5, C]5, D5, D]5, E5, F5,

F]5, G5, G]5, A5, A]5, B5 }. As a standard of tuning, the A4 musical note maps the 440 Hz frequency.

The mathematical formulation to derive the next note corresponding frequency, whose distance from

one to another is called in music a half-step, (in this case A]4) is just a multiplication by a factor of 21/12

(i.e. A]4 = 440 ∗ 21/12 = 466.16 Hz).

After having the frequency mapping of musical notes defined, music notation will be used when

useful in the remainder of this section.

This overview will focus on the tree main elements of music theory: Rhythm, Harmony and Melody

[14].

5

2.1.1 Rhythm

The temporal element in music is called rhythm. It may be defined as the placement, in time, of sounds

to create patterns. When one taps his foot to the music, which is known as ”keeping the beat”, he is

actually following the rhythmical pulse of the music. There are still a few rhythmic terms that defines

some important features of rhythm:

• Duration: This is a pretty straightforward concept that is defined by how long a sound or silence

lasts. These durations are named after the ratio of a note duration against a bar duration, i.e. one

bar can comprehend 1 whole note, 2 half notes, 4 quarter notes, and so on.

• Tempo: Tempo just measures the speed of the beat, usually addressed as frequency variable in

the form of Beats Per Minute (BPM).

• Meter: It consists on organized accent patterns of beats. One example of where this rhythmic

characteristic can be noticed is in a particular music genre called Waltz, where the tempo may

vary, but the meter is always based on a 3 beat bar where the first beat is accentuated strongly.

2.1.2 Melody

An horizontal series of notes (over time) can be defined as a melody. The melody is what actually gives

a musical sense to sound. One simple example of this is the difference between talking and singing.

The singing process consists on saying words and producing notes at the same time. One can tell this

difference by inspecting what happens when one tries to sing but does not know the lyrics.

The way notes are sequenced is not exactly random. It is usually based on predefined groups of five

or seven notes, called scales.

2.1.3 Harmony

The harmony is kind of a vertical stack of notes, and its job is to give a context to the melody notes so

that it feels pleasant. This stacking of notes at the same time creates a chord, and a series of chords

creates a chord progression. As it happened with melody, the choice of notes to stack is not random.

Again, usually these stacked notes match the ones in a scale and follow some chord structure such as

the three-note ones: major, minor, augmented and diminished.

When all the notes in a chord progression belong to a certain scale, it is called a diatonic chord

progression. This means that any scale note in the melody will sound harmonically pleasant over this

progression.

2.2 Constant-Q Transform

Sounds consisting of harmonic frequency components, when plotted against log-frequency, have this

property that the distance between these frequency components is the same independently of the

6

fundamental frequency [15]. The spacing between two consecutive harmonics follows this sequence

log(2

1), log(

3

2), .... Thus, the absolute position depends on the fundamental frequency, but the relative

position of the harmonics are constant. As a result, when represented in the frequency domain, this

components create a pattern which only depends on the instrumental source of the sound.

The CQT representation was the chosen one for preprocessing raw audio wave forms and its math-

ematical model is described next.

2.2.1 Mathematical Model

As stated in [16], the CQT, XCQ(k, n), of a discrete time-domain signal x(n) is defined by:

XCQ(k, n) =

bn+Nk/2c∑j=bn−Nk/2c

x(j)a∗k(j − n+Nk/2) (2.1)

where k = 1, 2, ...,K indexes the frequency bins of the CQT, b.c represents the floor operation which

rounds down the argument and a∗k(n) denotes the complex conjugate of ak(n). The latter, ak(n), are

complex-valued waveforms, here also called time-frequency atoms, given by:

ak(n) =1

Nkw

(n

Nk

)exp[−i2πnfk

fs] (2.2)

where fk is the center frequency of bin k, fs is the sampling rate, and w(.), is a continuous window

function, sampled at points determined by the argument. The window function is zero outside the range

[0, 1]. The window lengths Nk ∈ R in 2.1 and 2.2 are real values and are inversely proportional to fk so

that the Q-factor is the same for all bins k. This Q-factor can be interpreted as the center frequency to

band-width ratio.

In Schorkhuber and Klapuri [16], the center frequencies, fk, follows:

fk = f12(k−1)/B (2.3)

where f1 is the lowest-frequency bin’s center frequency and B is the number of bins per octave. The

parameter B regulates the frequency resolution.

The Q-factor of any bin k is given by:

Q =fk

∆fk=Nkfk∆wfs

(2.4)

where ∆fk is the −3 dB bandwidth of the frequency response of the atom ak(n), and ∆w is the −3 dB

bandwidth of the mainlobe of the window function’s, w(.), spectrum.

In order to introduce the minimum frequency smearing, the bandwidth, ∆fk, should be as small as

possible, which can be obtained by having a large Q-factor. Still, the Q-factor cannot be arbitrarily set,

or else it would not be possible to analyze portions of the spectrum between bins. Thus, the best value

of Q that still allows signal reconstruction is given by:

7

Q =q

∆w(21/B − 1)(2.5)

where q ∈ [0, 1] is a scaling factor, typically set to 1. Setting q < 1 will improve the time resolution but

decrease the frequency resolution.

Combining the equations 2.5 and 2.4, and solving for Nk, the following equation arises:

Nk =qfs

fk(21/B − 1)(2.6)

having no longer dependency on ∆w.

To increase computational efficiency of the CQT while allowing signal reconstruction from the coef-

ficients, the atoms should be placed Hk samples apart. Hk is referred as ”hop size”, and to achieve a

reasonable reconstruction of the signal, its values should be 0 < Hk <12Nk.

2.2.2 CQT Application

Schorkhuber and Klapuri [16] proposed a method to efficiently compute the CQT, based on the algorithm

proposed by Brown and Puckette [15]. This proposed method [16], is not only more computational

efficient, but it also allows the computation of the inverse CQT. Both the CQT and its inverse, are

implemented in a python packaged called LibROSA [17].

The CQT will always be computed with the same fixed parameters. The sample frequency was set

to fs = 44100 Hz. The minimum and maximum frequency were set to the corresponding frequencies

of the notes C3 and C7 (the letter C corresponds to a musical note, and the numbers correspond to

different octaves as mentioned in section 2.1), respectively, as they comprehend a range where all the

data lies. Each octave is represented by 85 frequency bins, which results in a total of 340 bins for all

the used octaves. The ”hop size”, Hk, is set to 256 and the scaling factor to q = 0.6. For a 2 second

long audio signal this corresponds to 345 atoms. Then the transformed data should have dimensions

340× 345 complex values. Still, in order guarantee that latter computations would not be affected, each

complex value had to be separated in two real ones, regarding the real and imaginary parts. Therefore,

the structure of the data after this transform is applied will be 3-dimensional shape, with dimensions

340× 345× 2.

This values were chosen based on two criteria. Firstly a decent reconstruction had to be possible

so that further processes over the data would not be compromised. Secondly, once this data will be fed

into a heavy computation network, as it can be concluded in further sections, the dimensions should be

as reduced as possible.

As an example of this representation, in figure 2.1 is plotted the power log-spectrogram of a 2 second

music sample. Note that this log-spectrogram representation will be used quite often in further sections,

but once both x and y axis are fixed and the z axis (color bar) is just an amplitude scaling, they might not

be represented.

8

Figure 2.1: Log-spectrogram example.

2.3 Neural Networks

Artificial Neural Networks (ANN) are inspired on the biological neural structures that can be found on

the human brain. These networks contain organized layers of interconnected units, where each unit is

called an Artificial Neuron (AN). ANNs are mainly used for tasks in which it is difficult to derive logical

constraints explicitly, such as pattern recognition and predictive analysis [18].

The first computational model of an AN was developed in 1943 by McCulloch and Pitts [19], a neuro-

scientist and a logician, respectively. They proposed a binary threshold unit as a model for the artificial

neuron. The mathematical model of this unit is the following:

y = H

( n∑j=1

wjxj − u)

(2.7)

where H(.) is the activation function (in this case the Heaviside step function) with the threshold u,

xj is the input signal and wj is the associated weight with j = 1, 2, .., n, where n corresponds to the

number of inputs. This unit’s output is 1 when the sum is above the threshold u, and is 0 otherwise. This

model lead Rosenblatt [20] to the development of a pioneer neural network known as the perceptron.

The current AN model consists on a weighted sum of the inputs and a bias, often referred to with the

lower-case characters x and b, respectively. This sum is then subject of a non-linear function to produce

the output of the AN, as it can be seen in figure 2.2(a).

As already mentioned, an ANN is a set of AN organized in layers, which can be analyzed as a

9

weighted directed graph. The ANs are the nodes, and the edges are connections between one neuron’s

output and another one’s input. The layers in an ANN are: an input layer, a hidden layer, and an output

layer. Note that when there is more than one hidden layer, the designation of the network changes and

becomes Deep Neural Networks. In figure 2.2(b) one may verify the layer organization of interconnected

ANs.

(a) Artificial Neuron (AN).

(b) Artificial Neural Networks (ANN).

Figure 2.2: Artificial Neuron and Artificial Neural Networks design (source: [21]).

In the remainder of this section, activation and loss functions, the gradient descent optimization

method, and the backpropagation technique are subject of deeper analysis, in order to give a general

idea of an ANN implementation.

10

2.3.1 Activation Functions

Activation functions (also called non-linearities), play a major role in the neurons computations. They

have naturally the purpose of making the network non-linear, or else it would be just a simple linear com-

bination of the weights from each interconnected neurons. Among others, three are 3 main activations

functions [22] will be subject of analysis as:

• Sigmoid: The sigmoid non-linearity is plotted in figure 2.3(a) and has the following mathematical

expression:

f(x) = σ(x) =1

(1 + exp−x)(2.8)

It takes a real value and confines it to the range [0, 1]. It looks just like a Heavyside step function

with smooth edges. Inspecting figure 2.3(a), one can tell that for values of x close to 0 the slope

is very steep, which means this function pushes the input value towards an end of the curve.

Furthermore, when f(x) is in a region close to 0 or to 1, the gradient becomes very small and a

problem called ”vanishing gradient” arises. If a variation in the input causes a small variation on

the output, then the parameter will learn notably slower or may even not be learned at all. Suffice

it to say that this problem scales when layers of AN are stacked on top of each other.

• Hyperbolic Tangent: The hyperbolic tangent (tanh) non-linearity is plotted in figure 2.3(b) and

has the following mathematical expression:

f(x) = tanh(x) = 2σ(2x)− 1 (2.9)

This is actually a scaled sigmoid function. The output values are confined to the range [−1, 1],

and it presents the same vanishing gradient problem as the latter function. Despite the range, the

actual difference with the sigmoid is that tanh has a stronger gradient once the derivatives are even

steeper.

• Rectified Linear Unit: The Rectified Linear Unit (ReLU) non-linearity is plotted in figure 2.3(c) and

has the following mathematically expression:

f(x) = max(0, x) (2.10)

This simply truncates the lowest possible value to 0. This activation function has became popular

once it does not suffer from the vanishing gradient problem. Nonetheless, the ReLU may still not

be suited for all architectures, once by removing all negative information the gradient becomes 0

and, consequently, the neuron would be useless or ”dead”.

There are a few variations to the ReLU, such as the Leaky Rectified Linear Unit (LReLU), that

for x < 0 instead of an horizontal line (f(x) = 0), the output defines a slightly inclined line (e.g.

f(x) = 0.2x) where the slope is called the leak factor.

11

(a) Sigmoid. (b) Hyperbolic Tangent (tanh). (c) Rectified Linear Unit (ReLU).

Figure 2.3: Activation functions plot (source:[23]).

2.3.2 Loss functions

The way an ANN’s performance is evaluated, alongside other machine learning methods, is through a

loss (or cost) function. This measures the disparity between the algorithm’s prediction and a desired

output. Among other existing loss functions [24], the relevant ones for this thesis, are the following:

• Means Squared Error (MSE): The MSE has the following mathematical expression:

MSE =1

n

n∑i=1

|xi − yi|2 (2.11)

This function computes the linear distance between each input value (xi) and the desired output

(yi).

• Cross Entropy: The Cross Entropy, also known as log loss, has the following mathematical ex-

pression for a binary classifier:

CE = − 1

n

n∑i=1

(yi log(pi) + (1− yi) log(1− pi)) (2.12)

where pi is the probability of xi belonging to class 1 and yi is the class that it actual belongs

to. Mathematically, pi = p(yi = 1|xi) and (1 − pi) = p(yi = 0|xi). This function measures the

divergence between two probability distributions.

2.3.3 Gradient Descent

In order to guarantee a learning process, it is mandatory that the network’s parameters become more

accurate over each iteration. Hence, training a network is solving the following optimization problem:

minwE(w) (2.13)

where E(w) is a loss function.

The Gradient Descent (GD) is the most used optimization algorithm to train ANNs. This method

12

consists on iteratively updating the weights (or parameters) according to:

wk+1 = wk − η∇E (2.14)

where wk is the weight vector at the kth iteration, η is the learning rate and ∇E is the gradient of the

cost function E(w).

This form of the gradient descent is commonly known as offline. When the training data is large

this method may become inefficient, making the learning process slow. Another variant of the gradient

descent is used to deal with this issue, called Stochastic Gradient Descent (SGD). The SGD (also known

as online), instead of using all the training samples to compute the gradient, uses only one sample or a

subset of samples form the training set. In the case of the subset, it is often called mini-batch SGD.

In order to improve the process of minimizing the loss function, some acceleration techniques were

developed [25]. Only one of them will be addressed, which is the momentum technique. Rewriting the

weight update equation (2.14) as:

wk+1 = wk + ∆wk (2.15a)

∆wk = −η∇E (2.15b)

The momentum technique adds a term to the equation 2.15b, becoming:

∆wk = −η∇E + α∆wk−1 (2.15c)

where the first term already appeared on equation 2.14, the second term is called the momentum term

and takes into account a previous iteration to evaluate the ”continuity of the descent”. This allows ac-

celerating the training in certain situations, with importance according to the value of α ∈ [0, 1[, which

denotes the momentum parameter. Hence, the weight update equation for SGD with momentum accel-

eration technique is:

wk+1 = wk − η∇E + α∆wk−1 (2.16)

2.3.4 Backpropagation

Originally introduced in the 1970s, the backpropagation algorithm only have been taken seriously in

1986. At this date, Rumelhart, Hinton, and Williams [25] prove that backpropagation could actually

outperform the former approaches to learning and, from then on, it has become a standard in the learning

process of neural networks.

In order to fully comprehend the backpropagation algorithm, let us take a look over a feedforward

network and understand its notation. Note that the following description of the algorithm is based on a

more detailed one in [26].

The notation used throughout this subsection will be defined as:

• wljk represents the weight of neuron j in the layer l, coming from neuron k in the previous layer

(l − 1) ;

13

• blj represents the bias of the neuron j in the layer l;

• zlj represents the summation of the weighted inputs of the neuron j in the layer l;

• alj represents the activation function’s output of the neuron j in the layer l.

Figure 2.4 allows a graphic interpretation of the aforementioned notation.

Figure 2.4: Artificial Neural Networks (ANN) with notation (source: [26] adapted)

Using this notation, the way this variables relate is given by the following equations:

zlj =∑k

wljka

l−1k + blj (2.17a)

alj = σ(zlj) (2.17b)

where σ(.) is an activation function already described in section 2.3.1. Therefore, joining the latter

equations (2.17a and 2.17b) results in a direct relationship between the inputs and the outputs of a

neuron:

alj = σ(∑k

wljka

l−1k + blj) (2.18)

Rewriting equation 2.18 in a compact matrix form provides a clearer way of thinking in a layer point

of view.

al = σ(Wlal−1 + bl) (2.19)

where al is just alj vectorized, the same principle applies to bl and to Wl. In this last case the dimensions

are j rows by k columns. In the remainder of this algorithm description, bold letters correspond to matrix

forms.

To finalize the feedforward procedure, there is only one thing missing which is the computation of

the error. As already referred in section 2.3.2, a loss function is used to achieve that. However, this

14

loss must be possible to write as a function of the network’s output, and as an average over the loss of

individual training examples.

Considering σ′(.) as the first derivative of the sigmoid representing the variation of the activation

function, L as the output layer, ∇aC as the vector of the partial derivatives ∂C∂aL

j= δlj representing the

variation with the output, and � as the Hadamard product or elementwise product, the main equations

of the backpropagation algorithm can now be approached.

δL = ∇aC � σ′(zL) (2.20a)

δl = ((Wl+1)T δl+1)� σ′(zl) (2.20b)

∂C

∂blj= δlj (2.20c)

∂C

∂wljk

= al−1k δlj (2.20d)

The equation 2.20a defines the error in the output layer, L. The equation 2.20b works in a similar

way but instead of depending on the output of the network, it depends on the output of the previous

layer’s neurons. These two equations together allow the computation of the error for every neuron in

every layer. Having this combined with equations 2.20c and 2.20d, it is possible to obtain the weights

and biases in every neuron (wljk and blj). All these equations are derived in [26].

2.4 Generative Adversarial Networks

Considering the ANNs, with one input layer, one hidden layer, and one output layer, one may dive into

more complex structures. When a network has more than one hidden layer it is called a Deep Neural

Networks (DNN). In this section, an unsupervised learning algorithm that consists on the interaction

between DNNs will be covered.

Generative Adversarial Network (GAN) were developed by Goodfellow, Pouget-Abadie, Mirza, et al.

[11] in 2014. They proposed the use of an adversarial process to estimate generative models. Two

Artificial Neural Networks are trained in parallel. One of those is called the generator network, which

generates samples based on a vector sampled from latent space distribution, and the other is called

the discriminator network, that learns to determine whether a sample comes from the training data

or the generator. The training procedure for the generator network is to maximize the probability of

the discriminator misclassifying the generations. Meanwhile, the discriminator network is trained to

distinguish between real data and generated data. This process corresponds to a minimax two-player

game.

In order to provide a better understanding of this training idea, a comparison with a more practical

problem is commonly made, namely the interaction between a counterfeiter and a bank. The bank

classifies the money as real or counterfeit based on different features between them. However, the

counterfeiter gets feedback on those classifications and, in order to be successful, he tries to mitigate

15

the differences between real money and his counterfeits so they become as realistic as possible. If the

counterfeiter is competent, he will eventually end up making indistinguishable money. This can be seen

as a competition between the bank and the counterfeiter, just as the discriminator and the generator

networks.

The remaining of this section is fully based on the work of Goodfellow, Pouget-Abadie, Mirza, et al.

[11], therefore the same notation will be used when addressing GANs, as the following:

• x represents a real data structure, from the pdata distribution;

• z represents a random vector, sampled from the pz distribution;

• G(.) represents the generator network as a function of some input;

• D(.) represents the discriminator network as a function of some input.

Considering the aforementioned notation, one can tell that G(z) represents a generated sample,

D(x) represents the classification of real samples, and D(G(z)) represents the classification of gener-

ated samples. A visual inspection of figure 2.5 may turn this clearer.

Figure 2.5: Generative Adversarial Network (GAN) high level architecture.

To train both networks, a loss function has to be defined. Intuitively, the discriminator loss evaluates

how well it did at letting real samples go through (i.e. comparing D(x) to 1) plus how often it was ”fooled”

by the generator (i.e. comparing D(G(z)) to 0), whilst the generator loss evaluates how often did it fail

at making realistic samples (i.e. comparing D(G(z)) to 1).

In a formal approach, this training method can actually be interpreted as a two-player minimax game

between D and G with a value function V (D,G), mathematically formulated by:

16

minG

maxD

V (D,G) = Ex∼pdata[logD(x)] + Ez∼pz [log(1−D(G(z)))] (2.21)

Other cost functions were found to be useful when dealing with GANs [27]. Still, other than the

minimax cost function (used in Goodfellow, Pouget-Abadie, Mirza, et al. [11]) were not tested.

Towards a visual perception of how the variables actually change during training one can analyze the

different stages of figure 2.6.

Considering in figure 2.6(a) the generator’s distribution (green, solid line) and true data distribution

(black, dotted line) as an adversarial pair near convergence, the distributions are still distinguishable and

the classifier (blue, dashed line) is only partially accurate. After training the discriminator, in figure 2.6(b),

it eventually converges to D∗(x) = pdata(x)pdata(x)+pg(x)

. After the generator is updated, the discriminator leads

the generator’s distribution (green, solid line) to get closer to the true data one (black, dotted line), as in

figure 2.6(c). At last, after several steps of training, it will come to a point where pg = pdata. Thus, the

classifier (blue, dashed line) will not be able to differentiate between both distributions, i.e. D(x) = 12 , as

pictured in figure 2.6(d).

(a) (b) (c) (d)

Figure 2.6: Informal explanation of minimax training algorithm. The generator’s distribution pg is repre-sented as a green solid line, the real data distribution pdata as a black dotted line, and the discriminator’sclassification as a blue dashed line (source:[11])

Training GANs was found to be a very challenging task. The main issue here is that the application

of the gradient descent algorithm works well when the goal is to minimize a loss function, which is not

exactly this case. The big deal in GAN training, is to find the Nash equilibrium of a non-convex game,

which defines a state where neither players change their strategy regardless of opponent decisions.

The utilization of the gradient descent to change the parameters of the discriminator may have a positive

impact on the discriminator’s loss but a negative one on the generator’s loss, or vice versa. Hence,

instead of converging, the solution oscillates.

GANs have been known to be quite difficult to train, once they lack performance measures. When

the discriminator and generator networks training is not well balanced the GAN may enter in a failure

mode such as mode collapse [28] or vanishing gradient.

Mode collapse is when the generator network generates a limited set of samples, or even a single

17

sample, regardless of the random vector z. Once the discriminator network does not actually force

the diversity in the generator outputs, all these may converge to the same point that the discriminator

network believes is realistic. This failure mode may be identified by inspecting the generations of the

generator network. If the generations keep the same regardless of the random vector, then the generator

managed to ”fool” the discriminator.

Vanishing gradient occurs when the loss drops to zero, ending up with no gradient updates. This

problem arises when the discriminator is too good, resulting in a super slow learning process. On the

other hand, when the discriminator behaves poorly, the generator does not have accurate feedback

keeping it from representing reality.

Radford, Metz, and Chintala [12] proposed a convolutional architecture of the GAN model that was

proved to be more stable to train, and named it Deep Convolutional GAN (DCGAN). The DCGAN is a

GAN where both the discriminator and the generator use convolutional layers. In order to approach the

DCGAN model, the Convolutional Neural Network should be fully detailed first.

2.5 Convolutional Neural Network

As covered in section 2.3, a neuron in a certain layer is connected to all the neurons in the previous

layer. This section will cover an alternative to this way of connecting neurons.

Just as happened of ANNs, the motivation for CNNs came from nature, specifically from the visual

cortex of animals. The main idea behind this is that the neurons in the visual cortex get different types

of information in different layers, depending on what they are focusing on.

The application of this process requires as input a 3-dimensional representation of data (e.g. RGB

image) and it tries to establish a relationship with some data, for instance a classification. The nature

of this relationship is weights, just as regular neural network. The main difference here is that neurons

will only be connected to a small region of the previous layer, instead of to all the neurons as in a

fully-connected way.

There are three main types of layers regarding CNN: Convolutional layers, Pooling layers and Fully-

connected layers. In this section these layers will be clarified, as well as an example architecture of a

CNN will be given.

Note that the remainder of this section regarding CNNs is based on [29], therefore, for the sake of

simplicity, the input data will be assumed to be an RGB image.

2.5.1 Convolutional Layer

A convolutional layer consists on a set of weights, also known as filters or kernels. The filters, which are

the learnable variables of this layer, are small spatial extents that have a depth equal to the one in the

input data. As the name may suggest, this layer performs a mathematical operation called convolution,

in this particular case a 2-dimensonal convolution. This operation consists on the computation of dot

products between the filter’s entries and the input, at every position. This results in a spatial connection

18

between neurons, as represented in figure 2.7.

(a) Volume prespective (b) Neuron prespective

Figure 2.7: Example of neuron disposition in the first convolutional layer. In (a) the input (red shape)and output (blue shape) of the first convolutional layer are spatially represented. Each of the 5 neurons(circles) corresponds to the result of the convolution between a filter and a spatial region of the input(receptive field). In (b) the neuron has 3 inputs, according to the depth of the previous layer. Thisdemonstrates that each neuron in a layer represents the receptive field in whole depth of the previouslayer. (source: [29] adapted)

There are four hyperparameters that directly affect the output of this layer: filter, depth, stride and

zero-padding. A description of each one of these is given, as well as a visual example in figure 2.8.

• Filter: This hyperparameter concerns the spatial extent of the filter. The height and width can be

different from each other, whether the depth of the filter is always the same as the depth of the

input data (Fd = Id).

• Depth: This hyperparameter corresponds to the number of filters to be learned. Each one of the

filters should be looking for different features of the input data. A depth column will be used to

address a set of neurons that concern the same region. For instance, in the first convolutional

layer, the input data depth corresponds to the number of channels of the image (Id = 3 in the RGB

case).

• Stride: This hyperparameter defines the way the filter slides through the input data. If stride is set

to 1, the convolution is computed for every pixel, resulting in the same output dimensions as the

input ones. However, if is set to 2, the convolution is only computed every 2 pixels, resulting in

reduced (half) output dimensions. The stride is valid along the height and the width axis. Still, the

notation will be simplified when height and width are equal, e.g. stride 2 × 2 will be referred to as

S = 2.

• Zero-padding: This hyperparameter fills the spatial borders of the input with zeros. It is generally

used to preserve the input’s width and height to the output shape. When set to ”same” or ”half”,

implying no changes on the dimensions, it means that P = F−12 .

19

Based on all these hyperparameters, it is possible to formulate an equation that determines the

output size of the convolutional layer, as follows:

Oh,w =Ih,w − Fh,w + 2P

S+ 1 (2.22a)

Od = K (2.22b)

where Oh,w is the output’s height and width, Ih,w is the input’s height and width, and Fh,w is the filter’s

height and width. The variable P defines the amount of zero-padding on the border and the variable S

the stride. Od is the output’s depth, which is defined by the number of filters used, K.

Figure 2.8: Dimensioning the output of the convolutional layer. Convolving a 3 × 3 filter (shade) over a5× 5 input using half padding and unit stride (i.e. Ih,w = 5, Fh,w = 3, S = 1, P = 1). (source: [30])

2.5.2 Pooling Layer

The pooling layer is key to ensure that consecutive layers are able to identify larger-scale features. This

kind of layer is used to reduce the spatial size of its input, and consequently reducing the number of

parameters and computations. Only the height and width are downsampled, leaving the depth with the

same dimensions. There are other pooling operations [31], but the most frequently used, due to better

results in practice, is the Max Pooling, exemplified in figure 2.9.

Just as the convolutional layer (section 2.5.1), the pooling layer is also subject of the filter and stride

hyperparameters. Here, the filter defines the spatial extent of where the maximum function is computed.

The hyperparameters are usually set to Fh,w = 2 and S = 2 (as exemplified in figure 2.9), and to

Fh,w = 3 and S = 2, which entails an overlapping pooling.

The equation that determines the output size of the Pooling Layer is the following:

Oh,w =Ih,w − Fh,w

S+ 1 (2.23a)

Od = Id (2.23b)

where Id denotes the input’s depth, and the other variables are the same as the equations 2.22.

20

Figure 2.9: Example of Max Pooling with a 2 × 2 filter over a 4 × 4 input using 2 × 2 stride (i.e. Ih,w =4, Fh,w = 2, S = 2).

2.5.3 Fully-connected Layer

A fully-connected (or dense) layer, as the name may suggest, is a layer whose neurons are connected

to all of the previous layer ones. This is just a regular hidden layer seen in section 2.3, used broadly at

the end of the network to interpret more complex structures. Its input are, in this case, a rearrangement

of the neurons organized in the 3-dimensional shape into a single dimension.

This same representation can be achieved through a convolutional layer. If the filter height and width

match the input height and width, than the convolution will result in a 1× 1 shape, which corresponds to

one neuron per filter. This type of layer has a spatial shape of 1× 1×K, which is no more than a single

dimension vector, where K still denotes the number of filters.

2.5.4 CNN architecture

It has been seen that CNNs work over 3-dimensional input shapes. The main process in this architecture

is composed by a set of 3 layers, which usually go together in the following order:

Conv −→ ReLU −→ Pool

where the first layer is a convolutional layer (section 2.5.1), the second layer computes the Rectified

Linear Unit (ReLU) activation function (section 2.3.1) element-wise, and the third layer is a pooling layer

(section 2.5.2). This set of layers is placed just after the input layer, and is usually stacked to achieve

deeper networks. Despite reducing the height and width dimensions of the neurons spatial placement

with the chain growth, the number of kernels applied increases. Afterwards, the 3-dimensional shape of

neurons, that results from the output of the previous layer, is unfolded to an array-like spatial structure.

Lastly, a fully-connected layer (in section 2.5.3), followed by an activation function that produces both

positive and negative values such as sigmoid or tanh, produces the desired output shape (e.g. 3 labeled

nodes). Figure 2.10 shows a more objective example of a CNN architecture.

21

Figure 2.10: Example of CNN architecture. This includes 4 sets of convolutional with ReLU and poolinglayers, referred to as ”Conv + Maxpool”, and 2 fully-connected layers, referred to as ”FC”. (source:[32])

2.6 Deep Convolutional Generative Adversarial Network

The Deep Convolutional GAN (DCGAN) model is no more than a GAN whose discriminator and gener-

ator networks comprehend some convolutional layers, such as CNNs do. The work of Radford, Metz,

and Chintala lead to the adoption of some already proven modifications to the CNN standard architec-

ture. These will be detailed in the following subsection, as well as uncovered complementary theoretical

aspects.

2.6.1 DCGAN approach

Strided Convolutions

The first modification regards the use of pooling layers do reduce spatial dimensions. Questioning the

requirement of different layers in the pipeline, ”The all convolutional net”, by Springenberg, Dosovitskiy,

Brox, et al. [33], proposes the drop of the pooling layer from the architecture, relying only on convolutions

with non-unitary stride to do that job. This approach is broadly known as strided convolutions. As seen

in equation 2.22a (section 2.5.1) the stride, S, has a scaling property on the output dimensions of the

convolutional layer. Hence, by changing the stride value, one can achieve the same downsampling as

with the pooling layer, as shown in figure 2.11. Furthermore, Springenberg, Dosovitskiy, Brox, et al.

found that not only this proposal is valid without any loss in accuracy on recognition tasks, but actually

gives state-of-the-art performance.

This will be used on the discriminator network, similarly to a CNN.

22

Figure 2.11: Dimensioning the output of the convolutional layer. Convolving a 3× 3 filter (shade) over a5× 5 input with 1× 1 padding using 2× 2 stride (i.e. Ih,w = 5, Fh,w = 3, S = 2, P = 1). (source:[30])

Fractional-Strided Convolutions

Regarding the generator network, the process requires an upsampling to create a shape with dimensions

equal to the input one, based on a single vector. This is called fractional-strided convolution, also known

as transposed convolution [30]. This concept is very similar to the previous one, but the transposed

convolution is computed instead.

The variables with an apostrophe (e.g. O′h,w) concern the fractional-strided convolution, and its

meaning is the same as the one described in equation 2.22a (section 2.5.1). There is relationship

between the fractional-strided convolution variables and the variables of the corresponding transposed

convolution, as follows:

• I ′h,w = I ′h,w +S− 1, where I ′h,w represent the size of the stretched input, comprehending the zeros

added between the input units.

• F ′h,w = Fh,w

• S′ = 1

• P ′ = Fh,w − P − 1

The output of this fractional-strided convolution can be defined through the following mathematical

equations:

O′h,w = S(I ′h,w − 1) + Fh,w − 2P (2.24a)

O′d = K (2.24b)

This process is somehow complex, therefore to understand what this variables represent in practice,

a visual example is provided in figures 2.12 and 2.13.

Eliminating Fully-Connected Layers

The next modification adopted by Radford, Metz, and Chintala concerns the trend to eliminate fully-

connected layers upon convolutional features. One convincing example of this is global average-pooling,

23

Figure 2.12: Dimensioning the output of a fractional-strided convolutional layer. The transpose of con-volving a 3 × 3 filter over a 5 × 5 input using 2 × 2 stride (i.e. Ih,w = 5, Fh,w = 3, S = 2, P = 0). It isequivalent to convolving a 3×3 filter (shade) over a 2×2 input (with 1 zero between the units), with 2×2padding and unit stride (i.e. I ′h,w = 2, I ′h,w = 3, F ′h,w = Fh,w, S

′ = 1, P ′ = 2) (source:[30])

Figure 2.13: Dimensioning the output of a fractional-strided convolutional layer. The transpose of con-volving a 3× 3 filter over a 5× 5 input with 1× 1 padding, using 2× 2 stride (i.e. Ih,w = 5, Fh,w = 3, S =2, P = 1). It is equivalent to convolving a 3× 3 filter (shade) over a 3× 3 input (with 1 zero between theunits), with 1×1 padding and unit stride (i.e. I ′h,w = 3, I ′h,w = 5, F ′h,w = Fh,w, S

′ = 1, P ′ = 1) (source:[30])

used in some state-of-the-art image classification models [34], [35]. This alternative though has its

drawbacks, namely hurting convergence speed. Still, a middle ground of connecting convolution layers

with inputs or outputs of the network was proven to work well.

In the first layer of the generator network of a GAN, the random vector suffers a transformation in

order to become a 4-dimensional shape, before the convolutional computations. This is could be called

a fully-connected layer once it just consists of a linear matrix multiplication. Regarding discriminator

network on the other hand, in the end of the convolutional stack neurons are flattened from its spatial

arrangement and then fed into a single sigmoid output.

Batch Normalization

In order to accelerate training in Deep Neural Networks architectures, Ioffe and Szegedy [36] developed

the batch normalization method. The change in the distribution of layer’s input parameters caused by

24

the variations of the previous layer’s ones during training is defined as internal covariate shift. The batch

normalization aims at reducing this issue. Through normalizing the mini-batch input of a layer to each

unit having zero mean and unit variance, the batch normalization has been proven to be a useful tool to

deal with training problems, for instance really high or really low activation’s output, helping the gradient

flow.

This method has become critical to prevent the generator network from collapsing its samples to a

single point, which is a common issue in GANs known as mode collapse. The application of this method

to all layers yet resulted in model instability, so Radford, Metz, and Chintala circumvented this by not

applying batch normalization to the generator’s output layer neither to the discriminator’s input layer.

2.6.2 Detailed Architecture

After these considerations over the original DCGAN model, from Radford, Metz, and Chintala, it is just

missing a global view of the architecture. This section covers all the architectural aspects addressing

the generator and discriminator networks independently.

Generator Architecture

As covered in section 2.4, the generator network’s role is to create a realistic sample, based on a vector

sampled from a random distribution. This can be achieved in several different ways, but the original

DCGAN will be the one in scope. The architectural aspects of this process are now subject of further

analysis.

First, the generator’s input, z, was randomly sampled from a normal distribution with zero mean and

unitary variance, producing a 100×1 vector. In order to go further through convolutional layers, this vector

must become a 3-dimensional shape. Here is where the partially fully connected layer takes place. The

random vector neurons are fed into the next layer ones, where they are subject of a matrix multiplication

but not of any activation function, as it would happen in a normal fully-connected layer. This projected

the 100× 1 vector into a 16384× 1 vector, which suffered a reshape to the desired 3-dimensional extent

4× 4× 1024.

The following steps consist entirely of fractional-strided convolutions (section 2.6.1), resulting in the

real data output shape, with dimensions 64 × 64 × 3. Figure 2.14 embodies the network layering, and

table 2.1 complements it with the dimensioning of the convolutional layers.

Regarding activation functions (section 2.3.1), every layer uses the ReLU non-linearity as the last

step, except the last one that uses tanh. The convolutional layers are subject of batch normalization

(section 2.6.1) in every layer except the last (”Conv 4” in the example of figure 2.14).

Discriminator Architecture

As covered in section 2.4, the discriminator network’s role is to classify the input in real or generated

data. Again, the original DCGAN will be in scope. The architectural aspects of this process are now

subject of further analysis.

25

Figure 2.14: Original DCGAN’s generator network architecture. Layers are represented by arrows andthe corresponding input/output data structures by colored shapes. The dark blue rectangle, on theleft, denotes the z random vector, the light blue 3-dimensional shapes denotes the convolutional layer’sinputs, and the rightmost 3-dimensional shape, in dark blue, represents a input-data-like structure, alsoreferred to as ”fake sample”.

Layer Input Filter Stride #Filters Output

Conv 1 4× 4× 1024 5× 5× 1024 2× 2 512 8× 8× 512

Conv 2 8× 8× 512 5× 5× 512 2× 2 256 16× 16× 256

Conv 3 16× 16× 256 5× 5× 256 2× 2 128 32× 32× 128

Conv 4 32× 32× 128 5× 5× 128 2× 2 3 64× 64× 3

Table 2.1: DCGAN generator convolutional layer’s specifications.

In first, the discriminator’s input, is a 3-dimensional shape with dimensions 64 × 64 × 3 which may

be a real sample from the dataset, x, or a generation from a random vector z (resultant of the generator

network), G(z). The following steps consist entirely of strided convolutions (section 2.6.1), resulting in a

3-dimensional shape with much lower height and width values.

Figure 2.15 embodies the network layering, and table 2.2 complements it with the dimensioning of

the convolutional layers.

The subsequent step is to reshape the output of the last convolutional layer into a 1-dimensional

vector, in this case reshape 4 × 4 × 512 into 8192 × 1. This 8192 units are fully connected to a single

neuron that is responsible for classification.

26

Figure 2.15: Original DCGAN’s discriminator network architecture. Layers are represented by arrowsand the corresponding input/output data structures by colored shapes. The dark blue 3-dimensionalshape, on the left, denotes the input of the network, which can be x or G(z), the light blue 3-dimensionalshapes denotes the convolutional layer’s outputs, and the dark blue square, on the right, represents theneuron used for classification.

Layer Input Filter Stride #Filters Output

Conv 1 64× 64× 3 5× 5× 3 2× 2 64 32× 32× 64

Conv 2 32× 32× 64 5× 5× 64 2× 2 128 16× 16× 128

Conv 3 16× 16× 128 5× 5× 128 2× 2 256 8× 8× 256

Conv 4 7× 8× 256 5× 5× 256 2× 2 512 4× 4× 512

Table 2.2: Original DCGAN discriminator convolutional layer’s specifications.

Regarding activation functions (section 2.3.1), every layer uses the LReLU non-linearity as the last

step, except the last one that uses sigmoid. The convolutional layers are subject of batch normalization

(section 2.6.1) in every layer except the first one (”Conv 1” in the example of figure 2.15).

27

Chapter 3

Proposed Models

In this chapter, all the proposed models will be described. These follow the nature of the DCGAN (section

2.6). The convolutional approach and the proposed modifications by Radford, Metz, and Chintala (in

section 2.6.1) are adopted here, despite slightly changes.

The main part of the algorithm is shared by all the models proposed in this work. The dynamics

between the discriminator and generator networks have already been covered in the previous section,

thus the focus now is their individual structure.

The process starts by taking 100 samples from a normal distribution N (0, 1), to create the random

vector z. This vector will be fed to the generator network, and pass through a linear layer, consisting on

linear operations, with the weight matrix randomly initialized from a normal distribution N (0, 0.2) and the

bias initialized as 0. The output of this linear layer is reshaped into a 3-dimensional shape and batch-

normalized, and the ReLU non-linearity is then applied. The following steps consist on convolutional

layers, whose parameters will be specified later, once they differ according to each proposed model.

Still, the input of each convolutional layer is batch-normalized and the output subject of a non-linearity,

namely the ReLU, with the exception of the last one where the tanh non-linearity is used.

The discriminator network takes as input a 3-dimensional shape regarding real or generated data.

The following steps consist on convolutional layers, whose parameters will be specified later, once they

differ according to each proposed model. Still, after each convolutional layer a non-linearity is applied,

namely the LReLU with a leak of 0.2, and with the exception of the first, all convolutional layers input

are batch-normalized. After the convolutional procedures, the 3-dimensional output shape is flattened

and subject of a linear layer, whose parameters are initialized as the linear layer used in the generator

network, to produce a single output unit. To this last unit, a sigmoid non-linearity is applied.

The loss function used to measure the performance of each network, according to the output of the

discriminator network as covered in section 2.4, was the cross-entropy covered in section 2.3.2. Note

that the used of mini-batches is accounted by averaging the loss of the whole batch.

Let us now approach the convolutional procedures deferred above. Considering the input data di-

mensions, it is not possible to use the exact original DCGAN model. The original DCGAN is expected to

work over data samples with power of 2 dimensions, once the strided convolutions were always 2 × 2,

29

which does not happen in this particular case. Hence, the intuitive solution was to find the least com-

mon multiple of both height and width of the data to dimension the stride, and arrange them properly by

matching the number of multiplicative factors, as follows:

Ih = 340 = 17× 5× 4

Iw = 345 = 23× 5× 3

With these factorized dimensions, one can conclude that only 2 strided convolutions can be com-

puted.

Note that this dimensioning conflict is only verified in the generator network, once the zero-padding

in the discriminator network allows it to deal with any input dimensions. However, in order to keep some

balance between both networks’ computations, the same number of convolutional layers with equivalent

strides were used.

Regarding the filter dimensioning, in order to guarantee that each unit is accounted in, at least, 2

receptive fields when striding in a certain direction, the filter size should follow:

Fh,w = Sh,w × 2 + 1

The number of filters chosen in the convolutional layers is a power of 2, just as the ones in the original

DCGAN were. That quantity is relevant to detect details in the data, and therefore they were chosen

based on whether they represented decently the data in question, when performing some validation

tests.

Three models are proposed with different approaches to the convolutional steps.

3.1 Model 1

Model 1 is very similar to the original DCGAN model. Due to the different dimensions of the data, the

deepness of the networks had to be changed, having only 2 convolutional layers each. The dimensioning

of the filter and the stride also have changed, but the filters are still relatively small spatial extents and

the strides are also in the same order of magnitude.

Model 1 networks architecture are presented in figure 3.1, and the correspondent convolutional layers

specifications are described in table 3.1.

3.2 Model 2

In this model, it is proposed striding along the width, which means that the receptive field will cover a

temporal interval at a time but consider the whole action occurring in that interval. This ensures that the

first convolutional layer is fully responsible for the analysis of the frequency component, i.e. which notes

are being played. The following convolutional layers will, therefore, control the temporal relationship

30

(a) Generator network architecture. (b) Discriminator network architecture.

Figure 3.1: Model 1 networks architecture. Layers are represented by arrows and the correspondinginput/output data structures by colored shapes. In (a), the dark blue rectangle, on the left, denotes thez random vector, the light blue 3-dimensional shapes denotes the convolutional layer’s inputs, and therightmost 3-dimensional shape, in dark blue, represents a dataset-like structure, also referred to as ”fakesample”. In (b), the dark blue 3-dimensional shape, on the left, denotes the input of the network, whichcan be x or G(z), the light blue 3-dimensional shapes denotes the convolutional layer’s outputs, and thedark blue square, on the right, represents the neuron used for classification.

Network Layer Input Filter Stride #Filters Output

GeneratorConv 1 17× 23× 256 11× 11× 256 5× 5 128 85× 115× 128

Conv 2 85× 115× 128 9× 7× 128 4× 3 2 340× 345× 2

DiscriminatorConv 1 340× 345× 2 9× 7× 2 4× 3 128 85× 115× 128

Conv 2 85× 115× 128 11× 11× 128 5× 5 256 17× 23× 256

Table 3.1: Model 1 convolutional layer’s specifications.

between each time interval.

The Model 2 networks architecture are presented in figure 3.2, and the correspondent convolutional

layers specifications are described in table 3.2.

3.3 Model 3

Model 3 is kind of the opposite of Model 2. In this model, it is proposed striding along the height,

which means that the receptive field will cover a frequency interval at a time but consider the whole time

series. This ensures that the first convolutional layer is fully responsible for the analysis of the temporal

31




GeneratorConv 1 1× 23× 512 1× 11× 512 1× 5 256 1× 115× 256

Conv 2 1× 115× 256 340× 7× 256 340× 3 2 340× 345× 2


Conv 2 1× 115× 256 1× 11× 256 1× 5 512 1× 23× 512


component, i.e. in which time-steps a certain note is being played. The following convolutional layers

will, therefore, control the relationship between different notes.

The Model 3 networks architecture are presented in figure 3.3, and the correspondent convolutional

layers specifications are described in table 3.3.

32




GeneratorConv 1 17× 1× 512 11× 1× 512 5× 1 256 85× 1× 256

Conv 2 85× 1× 256 9× 345× 256 4× 345 2 340× 345× 2


Conv 2 85× 1× 256 11× 1× 256 5× 1 512 17× 1× 512


33

Chapter 4

Implementation and Results

4.1 Dataset

One dataset will be used to train the proposed generative model. This was built over an improved melody

created by the author over a diatonic chord progression. Throughout the whole extent, some musical

features (covered in section 2.1) should be accounted:

• The tempo is fixed to 120 BPM;

• The meter is set to 4 beats per bar, and as consequence of the previous point 1 bar lasts 2

seconds;

• The minimum note value is a quarter note, i.e. each note last a minimum of 0.5 seconds.

• The melody notes are diatonic in the key of A minor.

In order to be treatable, the full melody wave file was split into 2 seconds (1 bar) segments. These

were object of a transform, as it may be seen in figure 4.1, becoming a 3-dimensional shape with

dimensions 340× 345× 2 (section 2.2).

(a) Time domain representation. (b) Time-frequency domain representation.

Figure 4.1: Audio segment representations.

35

From the original long audio wave files, 100 segments of 2 seconds were used to construct the

melody dataset. Thus, in a tensor point-of-view, the whole dataset is a 4-dimensional shape with dimen-

sions 100× 340× 345× 2.

4.2 Software

All the code developed was programmed using the PyCharm Community Edition IDE, in the program-

ming language Python 3.5.

The heavy computations, regarding the training of the deep neural network models proposed, were

performed in a computer provided by the Institute for Systems and Robotics, affiliated to Instituto Supe-

rior Tecnico. This machine is equipped with 4 NVIDIA GeForce GTX 1070 8 GB and 32 GB of RAM.

Two noteworthy libraries were used throughout this work: LibRosa and TensorFlow.

• LibROSA: LibROSA is an open source python library that serves the purpose of analyzing music

or audio data, developed by McFee, Raffel, Liang, et al. [37]. This was found to be convenient

to process the data before and after the generative network, especially regarding the Constant-Q

Transforms.

• TensorFlow: TensorFlow is an open source library for numerical computation based on data flow

graphs, originally developed by the researchers on the Google Brain Team [38]. Tensorflow in-

corporates a graphic tool called TensorBoard, useful to inspect the graph flow of the networks and

process performance measures. Another important characteristic is that its architecture allows it to

be executed both in CPUs and GPUs (through the CUDA interface [39]) providing a better compu-

tational performance. Due to its flexibility in creating architectures and its established community

over the years, it was found to be appropriate tool to address the deep learning domain.

The datasets were created from scratch to be used in this work. A musician firstly recorded a har-

mony and then improvised a melody over it. The software used to record the audio was Ableton Live 9

Lite [40].

The data acquired from the user study was subject of a statistical analysis. The IBM SPSS software

was the chosen one to process the respective data.

4.3 Overall Implementation

The whole implemented system takes as input audio waveforms that represent data in time domain.

This time domain signals are subject of a transform, namely the CQT, changing their representation into

a time-frequency domain one. After that, the transformed signals are fed into a generative model. The

generative model is trained to produce audio samples with a data structure equal to its input’s. In order

to evaluate its time-domain representation, i.e. listening to the audio sample, the output of the generative

model has to be transformed to a time domain representation.

36

This high-level description of the implemented system can be translated into the block diagram of

figure 4.2

Figure 4.2: Implemented system high-level architecture.

Before going any further, let us break down the system and evaluate its components independently,

in order to validate the chain.

4.3.1 Validation Tests

In order to verify that the implemented system is properly developed and suits the purpose of generating

music, some validation tests were performed. Firstly, the CQT invertibility has to be guaranteed, so that

the output of the whole system is, as the input, a waveform. Then, the developed generative model

should perform decently when fed with image data, once this has already been achieved in other works.

Finally, the generator network has to prove to be able to generate a certain sample when the training set

only comprehends that specific sample.

CQT Validation

The CQT invertibility has to be guaranteed in order to assure that this computations do not affect at all

the output of the generative model. In this test, the generative model block will be by-passed so that the

CQT and the CQT−1 algorithms can be validated, as shown in figure 4.3.

Figure 4.3: Constant-Q Transform (CQT) validation high-level architecture.

As approached in section 2.2.2, the parameters used to compute the CQT (table 4.1) were chosen

37

so that a decent invert representation could be computed. As a second condition, the dimensions should

to be as reduced as possible once the data will be subject of heavy computations regarding the deep

neural network models.

Ideally, an error measure would be used to express the result of this test. However, the error between

each sample of both waveforms does not clearly states whether the two waveforms sound similarly.

Therefore, the easiest way to compare the input and output data of this validation test is actually by visual

analysis of the time-frequency representations, available in figure 4.4. Just as the audio evaluation,

the comparison of these two spectrograms is subjective. Still, the author’s subjective analysis of both

representations converged to the same judgment, which is that the reconstruction was successful.

(a) Original sample. (b) Test output.

Figure 4.4: CQT validation test results.

Sample rate Hop size # Bins per octave # Bins Frequency range Scaling factor

44100 256 85 340 C4–C8 0.6

Table 4.1: Table of CQT parameters.

Image Data Validation

In this test the generative model block was the one to be subject of validation. To perform this evaluation

the benchmarking MNIST dataset was used to train the model. The MNIST dataset is composed of

55000 training samples with dimensions 64× 64× 1 regarding images of handwritten digits from 0 to 9.

Despite working on image data, the application of this dataset to the developed deep neural network will

assure that the procedures were setup properly.

The generator and discriminator networks architectures are the ones present in the original DCGAN

(section 2.6.2), i.e. the use of 4 convolutional layers with constant stride, instead of 2 convolutional

layers with varying stride.

The training procedure considered all the samples from the dataset as input. The dimensioning of

the networks’ convolutional layers may be found in tables 2.2 and 2.1, where the original DCGAN was

38

(a) Dataset samples, x

(b) Generated samples, G(z) with 1 generator update periteration

(c) Generated samples, G(z) with 2 generator updates periteration

Figure 4.5: In (a) the first 100 samples from the dataset are presented. In (b) 100 generated sam-ples created by the same number of random vectors, with the generator network properly trained, arepresented.

approached in detail. Accordingly, all the model was trained with mini-batch SGD with a mini-batch

size of 128, and, to accelerate training, Adam optimizer was used with the parameters suggested in the

literature, i.e. the learning rate was set to 0.0002 and the momentum term, β, to 0.5. The size of the

random vector z was set to 100.

After training for 20 epochs, with no further tuning of the hyperparameters, the generator network

showed evidence of mode collapse, resulting in generating similar non-sense data. Robinson [41] have

already dealt with this matter and found an effective solution. In order to make sure that the discriminator

loss does not drop to zero, for each iteration the generator network is trained 2 times. After implementing

this solution, the generator network was able to generate realistic handwritten digits. The input data and

generated samples from both tries are shown in figure 4.5.

The networks losses are computed with cross-entropy and its evolution throughout the training pro-

cess is plotted in figure 4.6. In the case of the discriminator, the loss is the sum of the losses of the

real and generated data classifications. By inspecting the plots, one may verify that both losses oscillate

during the training process. Still, for the case where the generator network is only update once per

iteration, the loss of the discriminator eventually goes to zero and consequently the loss of the generator

39

starts growing. Comparing the losses behavior with the generated samples, one can conclude that the

oscillatory behavior is healthy for the learning process of both networks, once it guarantees the balance

between them.

(a) Discriminator’s loss (b) Generator’s loss

Figure 4.6: Discriminator (a) and generator (b) losses trained on the MNIST dataset. The orange linerepresents the losses when the generator network is update once per iteration, and the blue line repre-sents the losses when the generator network is update twice per iteration. The plotted data is smoothedby the Tensorboard interface in order to provide an easier analysis.

One Training Sample Validation

The last validation test aims at evaluating the suitability of the implemented generative model to generate

an output that can be transformed to an audio sample. The strategy adopted consists on training the

network with only 1 input sample, as represented in figure 4.7.

Figure 4.7: Training the generative model with only one sample as input.

The training hyperparameters were similar to the ones used in the previous test with the MNIST

dataset. However, as concluded in that test, the number of updates of the generator was not enough to

40

assure that the training between both networks is balanced. Therefore, it was found that updating the

generator network 5 times per iteration provided good results, i.e. a sample identical to the one used as

input.

After training for 10000 epochs, it was found that all the proposed models could achieve a decent out-

come, as it can be seen in figure 4.8. Still, one may clearly verify that the generated sample from Model

1 is very accurate. Between the remaining models, Model 3 presents a visual ”brighter background”

implying a lower background noise in the corresponding audio sample. Note that these samples still

vary with the random vector, but with this intensive training over the same input sample, the dependence

got very small.

(a) Input dataset sample (b) Generation with Model 1

(c) Generation with Model 2 (d) Generation with Model 3

Figure 4.8: Generations of the one training sample validation test for all the proposed models. In (a)the single training set sample is presented, and in (b), (c) and (d) the generations of Models 1, 2 and 3,respectively, are presented.

By inspecting figure 4.9, one may verify that the losses of both networks, despite an initial phase,

keep oscillating until the end of the training epochs, for all models. This, as already seen, is a sign of a

healthy training.

41


Figure 4.9: Discriminator (a) and generator (b) losses regarding the validation test. The green, red andblue lines are relative to Models 1, 2 and 3, respectively. The plotted data is smoothed in order to providean easier analysis.

4.3.2 Results

This section addresses the set of conditions that provided the trained network with the best results, as

well as the evaluation of those conditions to the performance.

Let us start by stating the considered parameters and describing its influence in the generative

model’s performance:

• Number of filters

The number of filters used in the convolutional layers of each of the 3 proposed models were set

as stated in tables 3.1, 3.2 and 3.3, regarding models 1, 2 and 3, respectively. Different number

of filters were tested following a power of 2 basis, as mentioned before. The variation of that

number was found to have impact on the resolution of the CQT plot, resulting on weaker audio

representations when the number was set too low. It was concluded that after a certain number of

filters the improvements of the resolution have stopped, hence being the ones adopted.

Model 1 reached that spot at the 128 filters. However, Model 2 and Model 3 only reached that at

256. This might be caused by the large filter dimensions in the first layer, concerning a whole row

or a whole column.

• Batch size

The amount of samples evaluated by each network at a time was found to be critical to the gener-

ative model’s performance. As mentioned in section 2.6.2, the mode collapse is a very common

training failure when addressing GANs. As approached in [28] and verified in this case, the use of

batch discrimination avoids this failure.

• Network updates per iteration

The number of updates of each network per iteration was found to be critical to balance the training

of both networks. As referred when testing the implemented algorithm with the MNIST dataset,

a lack of balance in training leads the loss of the discriminator network to drop to zero, which

42

makes the gradients get too small ceasing the learning procedure (known as the vanishing gradient

problem). Updating the generator network more than once per iteration was found to be a viable

strategy to deal with this problem.

The balance found between each network’s update per iteration that provides a well-behaved train-

ing, and consequently better results, was to update 1 time the discriminator network and 5 times

the generator network, per iteration.

• Adam optimizer’s parameters

The Adam optimizer parameters are the learning rate and the momentum. Their influence in

the training concern stability and speed of convergence, which in this case means reaching the

”oscillatory stage”. The values used in original DCGAN model were found to be stable enough and

tuning them did not significantly improved the networks performance. Therefore, the learning rate

was kept at 0.0002 and the momentum term at 0.5.

After training each of the proposed models for the whole melody dataset for 10000 epochs, with the

hyperparameters stated above, it was found that Model 1 was by far the one that produced the best

results, once Model 2 and Model 3 were trapped by their large kernels. However, training Model 1 took

approximately 17 hours (more than 3 times longer than the other two models). The spectrogram of a

few generations of the trained models are presented in figure 4.10, alongside a dataset sample. It can

be visually verified that Model 1 is the only one that presents an output similar to a dataset sample.

The networks training balance may be verified in figure 4.11 for the different proposed models trained

with the melody dataset. One may verify in the generator network loss plot that its losses keep oscillating

around the same value from epoch 2000 until the end, for all models. This implies that the discriminator

network’s loss does not drop to zero, which might not be clear in the leftmost plot for Model 2 and Model

3.

It may be concluded that exploitation of different convolutional architectures, namely performing them

in only one direction and with a significantly bigger filter size (Model 2 and Model 3), does not create

value in the networks performance. Once the generations from those models are nothing like the input

ones, the consequent transformation to audio samples will not be interesting.

A different set of parameters could perhaps have been set to train Model 2 and Model 3 to achieve

other results. However, after long days of parameter tuning, the results provided by training Model 1 in

these conditions were found to be undoubtedly the best.

A prior evaluation of the trained generative model was made to consider this as the best performance

within the all the experiment. Still, the evaluation of the output of the whole system, which is an audio

sample, is very subjective. This results were considered not to have enough significance when judged

by only one individual. Therefore, a user study regarding the quality of the generated samples was found

to be essential, and is presented in the next chapter.

43

(a) Input dataset sample (b) Generation with Model 1

(c) Generation with Model 2 (d) Generation with Model 3

Figure 4.10: Generations of the trained generative model for all the proposed models. In (a) 4 differentdataset samples are presented, and in (b), (c) and (d) generated samples from 4 different randomvectors are presented, regarding Models 1, 2 and 3, respectively.


Figure 4.11: Discriminator (a) and generator (b) losses regarding the trained generative model’s bestperformance. The red, blue and orange lines are relative to Models 1, 2 and 3, respectively. The plotteddata is smoothed in order to provide an easier analysis.

44

Chapter 5

User Study

As a method to evaluate the results of the generative model implemented a user study was developed.

This work proposes the exploration of music generation based on audio. The generative model proposed

should be able to produce a musical audio sample with some constraints defined by the dataset. These

are the following:

• 2 second long audio samples, which corresponds to 1 bar with tempo set to 120 BPM.

• The melody dataset implies the presence of only one note at a time, with a minimum subdivision

of a quarter note, i.e. 1 bar can only comprehend the maximum of 4 notes (with the duration of 0.5

seconds).

• All the notes are in the key of A minor.

A musical sound definition is mandatory in order to classify the results. A distinction between a tone

and noise is made regarding the physical characteristics of sound. The main difference is that tone

is identified by certain characteristics such as controlled pitch and timbre, whereas noise is generally

identified by its source i.e. waves breaking on shore or a plastic bottle being squashed [42].

This study should infer over the following hypothesis:

Hypothesis 1: The generated audio samples are not classified as a noise sound.

Hypothesis 2: The generated audio samples are ranked as musically pleasant.

A fully detailed description of the experimental set-up is presented in the following sections.

5.1 Participants

5 people (3 males, 2 females), aged between 22 to 25 years old, voluntarily participated in the study.

All participants have at least a basic theoretical music knowledge, and none of them suffered from any

hearing disorder. No participant had prior contact to the study.

45

5.2 Design

The study involved 10 samples from each of the following groups:

• X, the dataset.

• Y, the trained generative model.

• Z, the untrained generative model.

In order to mitigate practice effects, the study design followed a block randomization, i.e. each

participant evaluated a block with the same samples but randomly sequenced.

Two questions were formulated, and should be answered for each sample in the block. These are

the following:

• Q1 — Do you classify the sound you heard as a noise sound or as a tone sound?

This should be answered with either tone or noise, i.e. a binary answer.

• Q2 — The sound you heard is musically pleasing.

This should be evaluated based on a Likert scale [43], which gives a quantitative value on a

subjective matter, generally based on the level of agreement/disagreement. The scale used will

comprehend values from 1 to 5.

5.3 Procedure

A brief overview of the procedures was given to participants before they are submitted to the experiment.

They had to fill demographics information (gender, age, theoretical music level, hearing condition) first.

Then, participants were asked to read some definitions in order to proceed to the following stage.

The experimenter played each sample in the randomly ordered block guaranteeing an interval of

5 seconds between each 2 second sample, so that participants could answer the 2 questions. After

listening to the 30 samples block the experiment was concluded.

5.4 Results

Participants were asked to answer the questions Q1 and Q2. For each of the three sample groups X, Y

and Z, 50 answers were evaluated.

As Q1 expects a binary answer (Tone/Noise), the relative frequencies regarding each sample group

were computed and are plotted in figure 5.1. One may conclude that all the samples from the X group

were classified as tone, all the samples from the Z group were classified as noise, and 95% of the

samples from the Y group were classified as tone.

For Q2, the answer is a rank (Likert scale), therefore, the average ranks per sample group were

computed and are plotted in figure 5.2. One may conclude that in terms of musical pleasantly, in average,

46

Figure 5.1: Relative frequency per sample group regarding Q1.

the samples from the group Z are not pleasant at all, the samples from the X group are very pleasant,

and the samples from the Y group are relatively pleasant.

Figure 5.2: Boxplot of the average rank per sample group regarding Q2.

The above discriminative statistics already shown evidence that both hypothesis are verified. How-

ever, the survey results were tested for statistical significant differences for the different sample groups

by means of a Friedman test.

The Friedman test is used to test for differences between groups when the dependent variable being

measured is ordinal. The results of this test are presented in table 5.1, proving that:

• There was a statistically significant difference in the noise/tone classification (Q1) depending on

which group the sample belongs to, χ2(2) = 94.360, p-value < 0.05.

• There was a statistically significant difference in how musically pleasing the sample was (Q2)

depending on which group the sample belongs to, χ2(2) = 89.805, p-value < 0.05.

The Friedman test proved the existence of statistical significant difference, but does not prove any-

thing else. In order to determine where these differences actually occur, it is necessary to perform post

47

Friedman Test Q1 Q2

N 50 50

Chi-Square 94.360 89.805

df 2 2

p-value 3.236× 10−21 3.156× 10−20

Table 5.1: Friedmans test for Q1 and Q2.

hoc tests. The appropriate are the Wilcoxon signed-rank tests on the different combinations of related

groups. Hence, the following combinations will be compared: X-Y, X-Z and Y-Z.

When making multiple comparisons with the Wilcoxon test, an adjustment of the p-value, called

Bonferroni, has to be made. This simply consist on taking the initial significance level (0.05) and divide it

by the number of tests being performed (3 combinations). Therefore, the adjusted significance level will

be 0.05/3 = 0.017.

Question Wilcoxon Signed Ranks Test X-Y X-Z Y-Z

Q1Z −1.732 −7.071 −6.856

p-value 0.083 1.537× 10−12 7.099× 10−12

Q2Z −4.118 −6.372 −6.199

p-value 3.819× 10−5 1.865× 10−10 5.682× 10−10

Table 5.2: Wilcoxon signed-rank tests for Q1 and Q2.

The Wilcoxon signed-rank test is used to compare two sets of scores that come from the same

participant. The results of this test are presented in table 5.2, proving that:

• There was a statistically significant difference in the noise/tone classification (Q1) between the

sample groups X-Z (Z(2) = −7.071, p-value < 0.017) and Y-Z (Z(2) = −6.856, p-value < 0.017),

but not between the sample groups X-Y (Z(2) = −1.732, p-value ≥ 0.017).

• There was a statistically significant difference in how musically pleasing the sample was (Q2)

between the sample groups X-Y (Z(2) = −4.118, p-value < 0.017), X-Z (Z(2) = −6.372, p-value <

0.017) and Y-Z (Z(2) = −6.199, p-value < 0.017).

48

Based on the performed post hoc tests, one may confidently state that samples generated by the

trained generative model (Y) are tone sounds. Moreover, the samples from the untrained generative

model (Z) are less musically pleasing than the samples from the trained generative model (Y), and the

samples both from both Z and Y are less musically pleasing than the samples from the dataset (X).

49

Chapter 6

Conclusions

6.1 Achievements

In this master’s thesis a music generation system is proposed. This system is composed by a transform

block that transforms waveform audio samples to a time-frequency domain, a generative model that

generates time-frequency domain samples, and an inverse transform that provides a waveform audio

output.

In an early stage, some basic validation tests were performed. The blocks regarding the Constant-

Q Transform were tested independently from the remaining system, achieving a considerable similarity

between the original audio sample and the processed one. The implemented generative model was

tested with the MNIST benchmarking dataset, achieving results close to the ones in the literature, being

the lack of performance measures the only reason not to consider them equally good. The suitableness

of the proposed generative model to represent the data in question was confirmed.

The generative model proposed comprehended three different convolutional approaches differing

mainly on striding directions. Horizontal and vertical, only horizontal and only vertical were the different

options and were named Model 1, Model 2 and Model 3, respectively. The Model 1 was found to be

the only that provided reasonable results. The exploitation of these layers’ architectures have still been

useful to break down some of the parameters influence over the learning process.

The user study conducted tested the hypothesis that the generated audio samples are not noise,

i.e. controlled pitch characteristics were found, and are musically pleasing. This study compared the

trained generative model’s output samples, Y, with samples from the dataset, X, and with samples from

the generative model before training, Z, i.e. noise. It was concluded that the Y samples are classified as

not noise, i.e. a tone sound. It was also concluded that the samples from the trained generative model

group (Y) are more musically pleasing than the samples from the untrained generative model group (Z)

but still not as musically pleasing as the samples from the dataset group (X).

The implemented system was expected to generate better music samples, once some successful

models had already been developed using the MIDI notation. However, once music is being represented

in such a low level as it is a waveform, even generating a sample that is not noise was a challenging

51

task. Considering that and the results of the user study, it was found that the initial expectations were

too high, and that the results achieved are actually interesting.

6.2 Future Work

Despite the successful implementation of the proposed generative model to address music generation,

there’s still a large margin for improvement.

The transform used to process the waveform audio samples did increase the dimensions of the data.

A method to reduce those dimensions without jeopardizing the quality of the data might be critical when

dealing with larger datasets.

The generation of independent bars was achieved when trained over independent bars. However,

adding a layer to the system that correlates bars would allow the generation of longer structures with

more than one bar, increasing the musical complexity.

Retrieving information from the latent space allows conditioning the generations. With that in mind,

one may focus on generating melodies over a prior chord or, the exact opposite, generating chords to

support some melody.

The use of less constrained datasets might be harder to train but, once successful, the creativity of

the generations should definitely improve. For instance, the use of a dataset with different instruments,

and consequently different timbres.

52

Bibliography

[1] L. A. Hiller and L. M. Isaacson, Experimental music: composition with an electronic computer.

McGraw-Hill, 1959.

[2] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, Midinet: A convolutional generative adversarial network

for symbolic-domain music generation, 2017. eprint: arXiv:1703.10847.

[3] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, Musegan: Multi-track sequential generative

adversarial networks for symbolic music generation and accompaniment, 2017. eprint: arXiv:

1709.06298.

[4] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A.

Senior, and K. Kavukcuoglu, Wavenet: A generative model for raw audio, 2016. eprint: arXiv:

1609.03499.

[5] V. Kalingeri and S. Grandhe, Music generation with deep learning, 2016. eprint: arXiv:1612.

04928.

[6] A. Nayebi and M. Vitelli, “Gruv: Algorithmic music generation using recurrent neural networks”,

Course CS224D: Deep Learning for Natural Language Processing (Stanford), 2015.

[7] A. Eigenfeldt and P. Pasquier, “Realtime generation of harmonic progressions using controlled

markov selection”, in Proceedings of ICCC-X-Computational Creativity Conference, 2010, pp. 16–

25.

[8] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio, Sam-

plernn: An unconditional end-to-end neural audio generation model, 2016. eprint: arXiv:1612.

07837.

[9] O. Mogren, C-rnn-gan: Continuous recurrent neural networks with adversarial training, 2016.

eprint: arXiv:1611.09904.

[10] T. L. Paine, P. Khorrami, S. Chang, Y. Zhang, P. Ramachandran, M. A. Hasegawa-Johnson, and

T. S. Huang, Fast wavenet generation algorithm, 2016. eprint: arXiv:1611.09482.

[11] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and

Y. Bengio, Generative adversarial networks, 2014. eprint: arXiv:1406.2661.

[12] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolu-

tional generative adversarial networks”, arXiv preprint arXiv:1511.06434, 2015.

53

arXiv:1703.10847

arXiv:1709.06298

arXiv:1709.06298

arXiv:1609.03499

arXiv:1609.03499

arXiv:1612.04928

arXiv:1612.04928

arXiv:1612.07837

arXiv:1612.07837

arXiv:1611.09904

arXiv:1611.09482

arXiv:1406.2661

[13] J. C. Graue, Scale music. [Online]. Available: https://www.britannica.com/art/scale-music.

[14] W. M. University, The elements of music. [Online]. Available: http://wmich.edu/mus-gened/

mus170/RockElements.pdf.

[15] J. C. Brown and M. S. Puckette, “An efficient algorithm for the calculation of a constant q trans-

form”, The Journal of the Acoustical Society of America, vol. 92, no. 5, pp. 2698–2701, 1992.

[16] C. Schorkhuber and A. Klapuri, “Constant-q transform toolbox for music processing”, in 7th Sound

and Music Computing Conference, Barcelona, Spain, 2010, pp. 3–64.

[17] Librosa. [Online]. Available: https://librosa.github.io/librosa/index.html.

[18] B. Yegnanarayana, Artificial neural networks. PHI Learning Pvt. Ltd., 2009.

[19] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity”, The

bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–133, 1943.

[20] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in

the brain.”, Psychological review, vol. 65, no. 6, p. 386, 1958.

[21] [Online]. Available: https://tex.stackexchange.com/questions/132444/diagram-of-an-

artificial-neural-network.

[22] [Online]. Available: https : / / medium . com / the - theory - of - everything / understanding -

activation-functions-in-neural-networks-9491262884e0.

[23] [Online]. Available: https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/.

[24] [Online]. Available: http://rohanvarma.me/Loss-Functions/.

[25] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating

errors”, nature, vol. 323, no. 6088, p. 533, 1986.

[26] S. Perry, Create an artificial neural network using the neuroph java framework. [Online]. Available:

https://www.ibm.com/developerworks/library/cc-artificial-neural-networks-neuroph-

machine-learning/index.html.

[27] I. Goodfellow, Nips 2016 tutorial: Generative adversarial networks, 2016. eprint: arXiv:1701.

00160.

[28] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved tech-

niques for training gans”, in Advances in Neural Information Processing Systems, 2016, pp. 2234–

2242.

[29] A. Karpathy, Cs231n convolutional neural networks for visual recognition. [Online]. Available: http:

//cs231n.github.io/convolutional-networks/#conv.

[30] V. Dumoulin and F. Visin, A guide to convolution arithmetic for deep learning, 2016. eprint: arXiv:

1603.07285.

[31] C.-Y. Lee, P. W. Gallagher, and Z. Tu, Generalizing pooling functions in convolutional neural net-

works: Mixed, gated, and tree, 2015. eprint: arXiv:1509.08985.

54

https://www.britannica.com/art/scale-music

http://wmich.edu/mus-gened/mus170/RockElements.pdf

http://wmich.edu/mus-gened/mus170/RockElements.pdf

https://librosa.github.io/librosa/index.html

https://tex.stackexchange.com/questions/132444/diagram-of-an-artificial-neural-network

https://tex.stackexchange.com/questions/132444/diagram-of-an-artificial-neural-network

https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0

https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0

https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/

http://rohanvarma.me/Loss-Functions/

https://www.ibm.com/developerworks/library/cc-artificial-neural-networks-neuroph-machine-learning/index.html

https://www.ibm.com/developerworks/library/cc-artificial-neural-networks-neuroph-machine-learning/index.html

arXiv:1701.00160

arXiv:1701.00160

http://cs231n.github.io/convolutional-networks/#conv

http://cs231n.github.io/convolutional-networks/#conv

arXiv:1603.07285

arXiv:1603.07285

arXiv:1509.08985

[32] A. Dertat, Applied deep learning - part 4: Convolutional neural networks. [Online]. Available:

https : / / towardsdatascience . com / applied - deep - learning - part - 4 - convolutional -

neural-networks-584bc134c1e2.

[33] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, Striving for simplicity: The all con-

volutional net, 2014. eprint: arXiv:1412.6806.

[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.

Rabinovich, Going deeper with convolutions, 2014. eprint: arXiv:1409.4842.

[35] M. Lin, Q. Chen, and S. Yan, Network in network, 2013. eprint: arXiv:1312.4400.

[36] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing

internal covariate shift, 2015. eprint: arXiv:1502.03167.

[37] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “Librosa: Audio

and music signal analysis in python”, in Proceedings of the 14th python in science conference,

2015, pp. 18–25.

[38] Tensorflow. [Online]. Available: https://www.tensorflow.org/.

[39] Cuda. [Online]. Available: https://developer.nvidia.com/cuda-gpus.

[40] Ableton live 9 lite. [Online]. Available: https://www.ableton.com/en/products/live-lite/

features/.

[41] R. Robinson, Ml notebook. [Online]. Available: https://mlnotebook.github.io/post/GAN4/

#train.

[42] W. E. Thomson, Musical sound. [Online]. Available: https://www.britannica.com/science/

musical-sound.

[43] S. Jamieson, Likert scale. [Online]. Available: https://www.britannica.com/topic/Likert-

Scale.

55

https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2

https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2

arXiv:1412.6806

arXiv:1409.4842

arXiv:1312.4400

arXiv:1502.03167

https://www.tensorflow.org/

https://developer.nvidia.com/cuda-gpus

https://www.ableton.com/en/products/live-lite/features/

https://www.ableton.com/en/products/live-lite/features/

https://mlnotebook.github.io/post/GAN4/#train

https://mlnotebook.github.io/post/GAN4/#train

https://www.britannica.com/science/musical-sound

https://www.britannica.com/science/musical-sound

https://www.britannica.com/topic/Likert-Scale

https://www.britannica.com/topic/Likert-Scale

Appendix A

Survey

The following document was the one given to participants when conducting the user study.

57

This survey is part of a user study within a master’s thesis. The following experiment will consist on listening to short audio segments and classifying them as requested.

Please start by filling the following table with some personal information.

Age Gender Hearing disorder Basic theoretical musical knowledge

Male Female Yes No Yes No

o o o o o o

In order to proceed with the survey, the following concepts should be considered:

(1) A musical sound can be classified as a tone sound or a noise sound. A distinction between these regards the physical characteristics of sound. The main difference is that tone is identified by certain characteristics such as controlled pitch and timbre, whereas noise is generally identified by its source, for example waves breaking on shore or a plastic bottle being squashed.

In the next page a table is presented with the survey’s questions. For each sample heard, the participant should only fill the corresponding line. Only one circle should be filled per question.

The questions to be answered for all the samples heard are the following:

1. According to (1), is the sample you heard a noise sound or a tone sound? 2. How musically pleasing is the sound you heard?

Sample number

1 - Physical sound characteristics 2 - The audio sample is musically pleasing

Noise Tone 1 2 3 4 5

Strongly disagree

Strongly agree

1 o o o o o o o 2 o o o o o o o 3 o o o o o o o 4 o o o o o o o 5 o o o o o o o 6 o o o o o o o 7 o o o o o o o 8 o o o o o o o 9 o o o o o o o

10 o o o o o o o 11 o o o o o o o 12 o o o o o o o 13 o o o o o o o 14 o o o o o o o 15 o o o o o o o 16 o o o o o o o 17 o o o o o o o 18 o o o o o o o 19 o o o o o o o 20 o o o o o o o 21 o o o o o o o 22 o o o o o o o 23 o o o o o o o 24 o o o o o o o 25 o o o o o o o 26 o o o o o o o 27 o o o o o o o 28 o o o o o o o 29 o o o o o o o 30 o o o o o o o

music generation using generative adversarial …...music generation using generative adversarial...

Documents