music generation using generative adversarial …...music generation using generative adversarial...
TRANSCRIPT
Music Generation Using Generative Adversarial Networks
Diogo de Almeida Mousaco Pinho
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisor: Prof. Rodrigo Martins de Matos Ventura
Examination Committee
Chairperson: Prof. João Fernando Cardoso Silva SequeiraSupervisor: Prof. Rodrigo Martins de Matos Ventura
Member of the Committee: Prof. Pedro Manuel Quintas Aguiar
June 2018
ii
Acknowledgments
In first, I would like to thank my supervisor Prof. Rodrigo Ventura for the support and encouragement to
explore two topics of major personal interest, namely Machine Learning and Music.
I would like to thank in particular Pedro Ferreira, Jose Corujeira, Lia Laporta and Raquel Miranda for
spending their time discussing my work at some point.
Finally, I would also like to thank my family and friends for the motivation and support provided during
this thesis.
iii
iv
Declaration
I declare that this document is an original work of my own authorship and that it fulfills all the require-
ments of the Code of Conduct and Good Practices of the Universidade de Lisboa.
v
vi
Resumo
A ideia de uma maquina ser capaz de gerar musica e, de certa forma, intrigante. O processo de
composicao musical implica a manipulacao de sons de base ou notacao para criar estruturas mais
complexas. Nesta tese e proposto um sistema de geracao baseado em formas de onda que represen-
tam compassos musicais, tirando partido de tecnicas de Machine Learning. Um pre-processamento das
amostras de audio e executado, e consiste na transformacao das formas de onda numa representacao
tempo-frequencia, geralmente utilizada para lidar com sinais de musica. Um modelo generativo do
estado da arte foi implementado com o objectivo de criar trechos identicos aos do dataset, o qual e
composto por compassos com duracao de 2 segundos. O modelo original e conhecido como Gener-
ative Adversarial Network (GAN), mas a variante implementada beneficia de camadas convolucionais
na arquitetura das redes e e chamado Deep Convolutional Generative Adversarial Network. Varias
abordagens com diferentes arquiteturas e hiperparametros sao implementadas de forma a avaliar a
capacidade do modelo de cumprir os objectivos propostos. Atraves de um user study conclui-se que
os trechos de musica gerados pelo sistema implementado nao sao ruıdo, e que sao musicalmente
agradaveis.
Palavras-chave: Geracao de Musica, Generative Adversarial Networks, Deep Learning,
Deep Convolutional Generative Adversarial Networks, Aprendizagem Automatica
vii
viii
Abstract
The idea of a machine being able to generate music is somehow intriguing. The music composition
process implies the manipulation of baseline sounds or notation to create more complex structures. In
this thesis a generation system based on raw waveforms representing musical bars is proposed, taking
advantage of Machine Learning techniques. A preprocessing of the audio samples is performed, con-
sisting on a transformation of the waveforms into a time-frequency representation, commonly used to
deal with music signals. A state-of-the-art generative model was implemented with the purpose of cre-
ating music segments similar to those in the dataset, which is composed by 2 second long music bars.
The original model is known Generative Adversarial Network (GAN) but the approached variant bene-
fits from convolutional layers in its network’s architecture and is called Deep Convolutional Generative
Adversarial Network. Several approaches were made regarding different architectures and hyperparam-
eters, in order to evaluate the model’s capability of meeting the proposed objectives. By means of an
user study it is concluded that the music segments generated by the implemented system are not noise,
and are actually musically pleasing.
Keywords: Music Generation, Generative Adversarial Networks, Deep Learning, Deep Convo-
lutional Generative Adversarial Networks, Machine Learning
ix
x
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Theoretical Background 5
2.1 Music Theoretical Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Rhythm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Melody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Harmony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Constant-Q Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 CQT Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
xi
2.5.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.3 Fully-connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.4 CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Deep Convolutional Generative Adversarial Network . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 DCGAN approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.2 Detailed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Proposed Models 29
3.1 Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Model 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Implementation and Results 35
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Overall Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Validation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 User Study 45
5.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6 Conclusions 51
6.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliography 53
A Survey 57
xii
List of Tables
2.1 Original DCGAN generator convoltuional layer’s specifications. . . . . . . . . . . . . . . . 26
2.2 Original DCGAN discriminator convoltuional layer’s specifications. . . . . . . . . . . . . . 27
3.1 Model 1 convoltuional layers’ specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Model 2 convoltuional layers’ specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Model 3 convoltuional layers’ specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Table of CQT parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 Friedmans test for Q1 and Q2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Wilcoxon signed-rank tests for Q1 and Q2. . . . . . . . . . . . . . . . . . . . . . . . . . . 48
xiii
xiv
List of Figures
2.1 Log-spectrogram example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Artificial Neuron and Artificial Neural Networks design. . . . . . . . . . . . . . . . . . . . . 10
2.3 Activation functions plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Artificial Neural Networks (ANN) with notation . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Generative Adversarial Network (GAN) layout. . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Informal explanation of minimax training algorithm. . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Example of neuron disposition in the first convolutional layer. . . . . . . . . . . . . . . . . 19
2.8 Dimensioning the output of the convolutional layer. (1) . . . . . . . . . . . . . . . . . . . . 20
2.9 Example of Max Pooling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.10 Example of CNN architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.11 Dimensioning the output of the convolutional layer. (2) . . . . . . . . . . . . . . . . . . . . 23
2.12 Dimensioning the output of a fractional-strided convolutional layer. (1) . . . . . . . . . . . 24
2.13 Dimensioning the output of a fractional-strided convolutional layer. (2) . . . . . . . . . . . 24
2.14 Original DCGAN’s generator network architecture. . . . . . . . . . . . . . . . . . . . . . . 26
2.15 Original DCGAN’s discriminator network architecture. . . . . . . . . . . . . . . . . . . . . 27
3.1 Model 1 networks architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Model 2 networks architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Model 3 networks architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Audio segment representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Implemented system high-level architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 CQT validation high-level architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 CQT validation test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 MNIST real and generated samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6 Discriminator and generator losses trained on the MNIST dataset. . . . . . . . . . . . . . 40
4.7 Training the generative model with only one sample as input. . . . . . . . . . . . . . . . . 40
4.8 Generations of the one training sample validation test for all the proposed models. . . . . 41
4.9 Discriminator and generator losses regarding the validation test. . . . . . . . . . . . . . . 42
4.10 Generations of the trained generative model for all the proposed models. . . . . . . . . . 44
xv
4.11 Discriminator and generator losses regarding the trained generative model’s best perfor-
mance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1 Relative frequency per sample group regarding Q1. . . . . . . . . . . . . . . . . . . . . . 47
5.2 Boxplot of the average rank per sample group regarding Q2. . . . . . . . . . . . . . . . . 47
xvi
List of Acronyms
AN Artificial Neuron
ANN Artificial Neural Networks
BPM Beats Per Minute
CNN Convolutional Neural Network
CPU Central Processing Unit
CQT Constant-Q Transform
CUDA Compute Unified Device Architecture
D Discriminator
DCGAN Deep Convolutional GAN
DFT Discrete Fourier Transform
DNN Deep Neural Networks
G Generator
GAN Generative Adversarial Network
GD Gradient Descent
GPU Graphics Processing Unit
GRU Gated Recurrent Unit
IDE Integrated Development Environment
LReLU Leaky Rectified Linear Unit
LSTM Long Short-Term Memory
MIDI Musical Instrument Digital Interface
MSE Means Squared Error
RAM Random Access Memory
xvii
ReLU Rectified Linear Unit
RNN Recurrent Neural Network
SGD Stochastic Gradient Descent
WGAN Wassrstein GAN
xviii
Chapter 1
Introduction
1.1 Motivation
Music can be interpreted as the art of mixing sounds and silences in a way that produces beauty,
is harmonic, and expresses emotions. Those responsible for composing music intend to spark some
emotions on the listener. As a composer myself, I have dealt with the underlying struggles of expressing
one’s ideas into melodies, harmonies or even rhythms. Having this in mind, one can wonder whether
if it is possible to achieve the same purpose when the source is not a human being, but actually some
algorithm. This thought may lead one to deeper questions such as if it is mandatory that a machine has
emotions in order to create music that provokes emotions on humans.
The algorithms’ decision making processes have always been based on a priori conditions over pre-
defined features of data structures. However, this approach has been proven to have viable alternatives
with the recent developments of artificial neural networks, which learn to make decisions by themselves
over features defined by themselves, based solely on a dataset and a classification. Deep learning algo-
rithms is the trendiest technology in these days and they are believed to revolutionize the field in a near
future.
The music generation framework have already been subject of study using this deep learning algo-
rithms technology. Most of these works aims at generating musical notation, which regards the compo-
sition perspective, i.e. how musical notes are sequenced. However, if instead of concerning notation,
one considers creating an actual sound, the task gets much more challenging.
1.2 Objectives
In this thesis the task of generating a music segment waveform is addressed. A generation system
based on raw waveforms is designed and implemented. The output is expected to represent music
segments that should follow certain constraints imposed by the input data, such as being a 2 second
long segment.
The successfulness of this system will be evaluated based on the sound fidelity, i.e. the generated
1
segment should not be a noise sound, and based on how musically pleasing the sound is. Using a
generic dataset including sequences of different patterns from every music genre, the system’s outcome
should be unconfined and as unbiased as it can possibly be, challenging the human creativity. However,
a more humble approach is considered with the purpose of creating a valid audio sample.
Ultimately, this thesis should unveil the suitableness of the proposed state-of-the-art deep learning
generative models to address the task of generating audio segments.
1.3 Related work
Algorithmic composition is a subject of work that dates back to 1959 [1]. However, the recent devel-
opments of Deep Neural Networks (DNN), which have proven astonishing results in learning from big
datasets, allowed this topic of music generation to be further developed. Over the past couple of years,
tons of proposed models addressing music generation have been published, all of them on deep learning
algorithms [2]–[10].
A big part of the neural network based models for music generation use Recurrent Neural Network
(RNN) and some variants, once the music generation process can be seen as creating dependent
sequences [5]–[7]. RNNs are neural networks that possess a directed connection from the output of
a unit in a certain layer with another unit’s input in another layer, closer to the input one, just as a
closed loop system. As an example of these variants, Nayebi and Vitelli [6] present the Gated Recurrent
Unit (GRU) and the Long Short-Term Memory (LSTM) architectures to address music composition.
According to their work, the LSTM was the only musically plausible among the experimented ones. The
used data in these experiments was preprocessed to be represented in a time-frequency domain.
Another way of representing data is directly from the time-domain audio waveforms, and it is used in
WaveNet. Introduced by Oord, Dieleman, Zen, et al. [4], WaveNet is a fully probabilistic and autoregres-
sive model, where all previous audio samples will condition the distribution of the next one. WaveNet
is actually a Convolutional Neural Network (CNN), where dilated (or a trous) causal convolutions are
present in all layers. This kind of convolution consists on each filter taking every n-th element of the
input matrix (n corresponds to a layer), rather than the whole elements. The WaveNet model reached
state-of-the-art performance when applied to text-to-speech and foresees promising results when ap-
plied to music modeling, proving that CNN are a valid option to generate music, alongside RNN.
Yang, Chou, and Yang [2] proposed a model based on a CNN, but trained with a Generative Adver-
sarial Network (GAN) [11], called MIDINET. This model works over a piano-roll representation of Musical
Instrument Digital Interface (MIDI) data, which is an encoding protocol of musical features such as sound
and silence, notes, tempo, among others. MIDI is by far the most used method to compute music struc-
tures, once its complexity can be as simple as a binary code for whether a note is played on a certain
time-step or not, disregarding the audio source variable. Back to the MIDINET model, the GAN uses
an adversarial learning algorithm with two networks. One of them is called the generator which aims at
converting a random noise sample into a realistic artificial data. The other is called discriminator and is
just a classifier that tells whether the sample comes from real data or from the generator network. In this
2
case, both of them contain convolutional layers on its architecture, such as CNNs do, which is a variant
of the original GAN model, proposed by Radford, Metz, and Chintala [12] and known as Deep Convo-
lutional GAN (DCGAN). This GAN alone generates bars and does not consider temporal dependencies
between them, although this issue is dealt with a conditioner Convolutional Neural Network (CNN).
Another relevant model, very similar to the latter, is the MuseGAN [3]. This also works over MIDI
data, and uses a GAN model variant that minimizes the Wasserstein distance, namely Wassrstein GAN
(WGAN). The way temporal dependencies are dealt with here is by combining time-dependent and
time-independent random vectors to generate conditional bars.
1.4 Contributions
According to the proposed work, the main contributions are the following:
• An approach to music generation based on raw audio data, with machine learning techniques.
• An implementation of a generative model with two deep neural network architectures.
• An approach to convolutional strategies to exploit feature locality.
• A statistical analysis of a user study to validate the results.
1.5 Thesis Outline
This thesis is structured as follows:
Chapter 2 establishes the background over theoretical music concepts, a signal processing algo-
rithm, some neural network basic concepts, and also covers some more high-level structures of
neural networks.
Chapter 3 describes the design and implementation features of the proposed models.
Chapter 4 analyses the suitableness of the proposed generative model to approach the matter of
generating music, and presents the parameters that provided the best trained model.
Chapter 5 describes the study developed to validate the results of this work.
Chapter 6 concludes the thesis and suggests avenues for future work.
3
4
Chapter 2
Theoretical Background
In this chapter a theoretical introduction is addressed, concerning the basic structures and knowledge
for a clear understanding of the remainder sections. In Section 2.1 a light music theoretical overview is
made in order to evaluate some musical details in the used data structures. In Section 2.2, the Constant-
Q Transform, a time-frequency representation, is analyzed in detail. In Section 2.3, the neural networks
subject is introduced and the basic principles of how they work are covered. Lastly, Sections 2.4 to 2.6
cover more particular Deep Neural Networks.
2.1 Music Theoretical Overview
In physics, sound can be defined as a wave of pressure that propagates on a transmission medium (e.g.
air or water). When this wave reaches the human ear it is processed by the brain, which defines the
human hearing capability. In order to be classified as music, these waves that are heard should follow
some structure.
The first question that rises is how to define the baseline of music, i.e. a musical note. Musical notes
are a notation that maps specific frequencies. In western music [13], there are 12 notes assigned with
the following upper-case letters: {C, C], D, D], E, F , F], G, G], A, A], B }. All the keys in a piano
are a direct representation of consecutive sets like the latter. To distinguish different sets a number is
assigned, and it is known as an octave representation. So, two consecutive sets with the corresponding
octave are written as {C4, C]4, D4, D]4, E4, F4, F]4, G4, G]4, A4, A]4, B4, C5, C]5, D5, D]5, E5, F5,
F]5, G5, G]5, A5, A]5, B5 }. As a standard of tuning, the A4 musical note maps the 440 Hz frequency.
The mathematical formulation to derive the next note corresponding frequency, whose distance from
one to another is called in music a half-step, (in this case A]4) is just a multiplication by a factor of 21/12
(i.e. A]4 = 440 ∗ 21/12 = 466.16 Hz).
After having the frequency mapping of musical notes defined, music notation will be used when
useful in the remainder of this section.
This overview will focus on the tree main elements of music theory: Rhythm, Harmony and Melody
[14].
5
2.1.1 Rhythm
The temporal element in music is called rhythm. It may be defined as the placement, in time, of sounds
to create patterns. When one taps his foot to the music, which is known as ”keeping the beat”, he is
actually following the rhythmical pulse of the music. There are still a few rhythmic terms that defines
some important features of rhythm:
• Duration: This is a pretty straightforward concept that is defined by how long a sound or silence
lasts. These durations are named after the ratio of a note duration against a bar duration, i.e. one
bar can comprehend 1 whole note, 2 half notes, 4 quarter notes, and so on.
• Tempo: Tempo just measures the speed of the beat, usually addressed as frequency variable in
the form of Beats Per Minute (BPM).
• Meter: It consists on organized accent patterns of beats. One example of where this rhythmic
characteristic can be noticed is in a particular music genre called Waltz, where the tempo may
vary, but the meter is always based on a 3 beat bar where the first beat is accentuated strongly.
2.1.2 Melody
An horizontal series of notes (over time) can be defined as a melody. The melody is what actually gives
a musical sense to sound. One simple example of this is the difference between talking and singing.
The singing process consists on saying words and producing notes at the same time. One can tell this
difference by inspecting what happens when one tries to sing but does not know the lyrics.
The way notes are sequenced is not exactly random. It is usually based on predefined groups of five
or seven notes, called scales.
2.1.3 Harmony
The harmony is kind of a vertical stack of notes, and its job is to give a context to the melody notes so
that it feels pleasant. This stacking of notes at the same time creates a chord, and a series of chords
creates a chord progression. As it happened with melody, the choice of notes to stack is not random.
Again, usually these stacked notes match the ones in a scale and follow some chord structure such as
the three-note ones: major, minor, augmented and diminished.
When all the notes in a chord progression belong to a certain scale, it is called a diatonic chord
progression. This means that any scale note in the melody will sound harmonically pleasant over this
progression.
2.2 Constant-Q Transform
Sounds consisting of harmonic frequency components, when plotted against log-frequency, have this
property that the distance between these frequency components is the same independently of the
6
fundamental frequency [15]. The spacing between two consecutive harmonics follows this sequence
log(2
1), log(
3
2), .... Thus, the absolute position depends on the fundamental frequency, but the relative
position of the harmonics are constant. As a result, when represented in the frequency domain, this
components create a pattern which only depends on the instrumental source of the sound.
The CQT representation was the chosen one for preprocessing raw audio wave forms and its math-
ematical model is described next.
2.2.1 Mathematical Model
As stated in [16], the CQT, XCQ(k, n), of a discrete time-domain signal x(n) is defined by:
XCQ(k, n) =
bn+Nk/2c∑j=bn−Nk/2c
x(j)a∗k(j − n+Nk/2) (2.1)
where k = 1, 2, ...,K indexes the frequency bins of the CQT, b.c represents the floor operation which
rounds down the argument and a∗k(n) denotes the complex conjugate of ak(n). The latter, ak(n), are
complex-valued waveforms, here also called time-frequency atoms, given by:
ak(n) =1
Nkw
(n
Nk
)exp[−i2πnfk
fs] (2.2)
where fk is the center frequency of bin k, fs is the sampling rate, and w(.), is a continuous window
function, sampled at points determined by the argument. The window function is zero outside the range
[0, 1]. The window lengths Nk ∈ R in 2.1 and 2.2 are real values and are inversely proportional to fk so
that the Q-factor is the same for all bins k. This Q-factor can be interpreted as the center frequency to
band-width ratio.
In Schorkhuber and Klapuri [16], the center frequencies, fk, follows:
fk = f12(k−1)/B (2.3)
where f1 is the lowest-frequency bin’s center frequency and B is the number of bins per octave. The
parameter B regulates the frequency resolution.
The Q-factor of any bin k is given by:
Q =fk
∆fk=Nkfk∆wfs
(2.4)
where ∆fk is the −3 dB bandwidth of the frequency response of the atom ak(n), and ∆w is the −3 dB
bandwidth of the mainlobe of the window function’s, w(.), spectrum.
In order to introduce the minimum frequency smearing, the bandwidth, ∆fk, should be as small as
possible, which can be obtained by having a large Q-factor. Still, the Q-factor cannot be arbitrarily set,
or else it would not be possible to analyze portions of the spectrum between bins. Thus, the best value
of Q that still allows signal reconstruction is given by:
7
Q =q
∆w(21/B − 1)(2.5)
where q ∈ [0, 1] is a scaling factor, typically set to 1. Setting q < 1 will improve the time resolution but
decrease the frequency resolution.
Combining the equations 2.5 and 2.4, and solving for Nk, the following equation arises:
Nk =qfs
fk(21/B − 1)(2.6)
having no longer dependency on ∆w.
To increase computational efficiency of the CQT while allowing signal reconstruction from the coef-
ficients, the atoms should be placed Hk samples apart. Hk is referred as ”hop size”, and to achieve a
reasonable reconstruction of the signal, its values should be 0 < Hk <12Nk.
2.2.2 CQT Application
Schorkhuber and Klapuri [16] proposed a method to efficiently compute the CQT, based on the algorithm
proposed by Brown and Puckette [15]. This proposed method [16], is not only more computational
efficient, but it also allows the computation of the inverse CQT. Both the CQT and its inverse, are
implemented in a python packaged called LibROSA [17].
The CQT will always be computed with the same fixed parameters. The sample frequency was set
to fs = 44100 Hz. The minimum and maximum frequency were set to the corresponding frequencies
of the notes C3 and C7 (the letter C corresponds to a musical note, and the numbers correspond to
different octaves as mentioned in section 2.1), respectively, as they comprehend a range where all the
data lies. Each octave is represented by 85 frequency bins, which results in a total of 340 bins for all
the used octaves. The ”hop size”, Hk, is set to 256 and the scaling factor to q = 0.6. For a 2 second
long audio signal this corresponds to 345 atoms. Then the transformed data should have dimensions
340× 345 complex values. Still, in order guarantee that latter computations would not be affected, each
complex value had to be separated in two real ones, regarding the real and imaginary parts. Therefore,
the structure of the data after this transform is applied will be 3-dimensional shape, with dimensions
340× 345× 2.
This values were chosen based on two criteria. Firstly a decent reconstruction had to be possible
so that further processes over the data would not be compromised. Secondly, once this data will be fed
into a heavy computation network, as it can be concluded in further sections, the dimensions should be
as reduced as possible.
As an example of this representation, in figure 2.1 is plotted the power log-spectrogram of a 2 second
music sample. Note that this log-spectrogram representation will be used quite often in further sections,
but once both x and y axis are fixed and the z axis (color bar) is just an amplitude scaling, they might not
be represented.
8
Figure 2.1: Log-spectrogram example.
2.3 Neural Networks
Artificial Neural Networks (ANN) are inspired on the biological neural structures that can be found on
the human brain. These networks contain organized layers of interconnected units, where each unit is
called an Artificial Neuron (AN). ANNs are mainly used for tasks in which it is difficult to derive logical
constraints explicitly, such as pattern recognition and predictive analysis [18].
The first computational model of an AN was developed in 1943 by McCulloch and Pitts [19], a neuro-
scientist and a logician, respectively. They proposed a binary threshold unit as a model for the artificial
neuron. The mathematical model of this unit is the following:
y = H
( n∑j=1
wjxj − u)
(2.7)
where H(.) is the activation function (in this case the Heaviside step function) with the threshold u,
xj is the input signal and wj is the associated weight with j = 1, 2, .., n, where n corresponds to the
number of inputs. This unit’s output is 1 when the sum is above the threshold u, and is 0 otherwise. This
model lead Rosenblatt [20] to the development of a pioneer neural network known as the perceptron.
The current AN model consists on a weighted sum of the inputs and a bias, often referred to with the
lower-case characters x and b, respectively. This sum is then subject of a non-linear function to produce
the output of the AN, as it can be seen in figure 2.2(a).
As already mentioned, an ANN is a set of AN organized in layers, which can be analyzed as a
9
weighted directed graph. The ANs are the nodes, and the edges are connections between one neuron’s
output and another one’s input. The layers in an ANN are: an input layer, a hidden layer, and an output
layer. Note that when there is more than one hidden layer, the designation of the network changes and
becomes Deep Neural Networks. In figure 2.2(b) one may verify the layer organization of interconnected
ANs.
(a) Artificial Neuron (AN).
(b) Artificial Neural Networks (ANN).
Figure 2.2: Artificial Neuron and Artificial Neural Networks design (source: [21]).
In the remainder of this section, activation and loss functions, the gradient descent optimization
method, and the backpropagation technique are subject of deeper analysis, in order to give a general
idea of an ANN implementation.
10
2.3.1 Activation Functions
Activation functions (also called non-linearities), play a major role in the neurons computations. They
have naturally the purpose of making the network non-linear, or else it would be just a simple linear com-
bination of the weights from each interconnected neurons. Among others, three are 3 main activations
functions [22] will be subject of analysis as:
• Sigmoid: The sigmoid non-linearity is plotted in figure 2.3(a) and has the following mathematical
expression:
f(x) = σ(x) =1
(1 + exp−x)(2.8)
It takes a real value and confines it to the range [0, 1]. It looks just like a Heavyside step function
with smooth edges. Inspecting figure 2.3(a), one can tell that for values of x close to 0 the slope
is very steep, which means this function pushes the input value towards an end of the curve.
Furthermore, when f(x) is in a region close to 0 or to 1, the gradient becomes very small and a
problem called ”vanishing gradient” arises. If a variation in the input causes a small variation on
the output, then the parameter will learn notably slower or may even not be learned at all. Suffice
it to say that this problem scales when layers of AN are stacked on top of each other.
• Hyperbolic Tangent: The hyperbolic tangent (tanh) non-linearity is plotted in figure 2.3(b) and
has the following mathematical expression:
f(x) = tanh(x) = 2σ(2x)− 1 (2.9)
This is actually a scaled sigmoid function. The output values are confined to the range [−1, 1],
and it presents the same vanishing gradient problem as the latter function. Despite the range, the
actual difference with the sigmoid is that tanh has a stronger gradient once the derivatives are even
steeper.
• Rectified Linear Unit: The Rectified Linear Unit (ReLU) non-linearity is plotted in figure 2.3(c) and
has the following mathematically expression:
f(x) = max(0, x) (2.10)
This simply truncates the lowest possible value to 0. This activation function has became popular
once it does not suffer from the vanishing gradient problem. Nonetheless, the ReLU may still not
be suited for all architectures, once by removing all negative information the gradient becomes 0
and, consequently, the neuron would be useless or ”dead”.
There are a few variations to the ReLU, such as the Leaky Rectified Linear Unit (LReLU), that
for x < 0 instead of an horizontal line (f(x) = 0), the output defines a slightly inclined line (e.g.
f(x) = 0.2x) where the slope is called the leak factor.
11
(a) Sigmoid. (b) Hyperbolic Tangent (tanh). (c) Rectified Linear Unit (ReLU).
Figure 2.3: Activation functions plot (source:[23]).
2.3.2 Loss functions
The way an ANN’s performance is evaluated, alongside other machine learning methods, is through a
loss (or cost) function. This measures the disparity between the algorithm’s prediction and a desired
output. Among other existing loss functions [24], the relevant ones for this thesis, are the following:
• Means Squared Error (MSE): The MSE has the following mathematical expression:
MSE =1
n
n∑i=1
|xi − yi|2 (2.11)
This function computes the linear distance between each input value (xi) and the desired output
(yi).
• Cross Entropy: The Cross Entropy, also known as log loss, has the following mathematical ex-
pression for a binary classifier:
CE = − 1
n
n∑i=1
(yi log(pi) + (1− yi) log(1− pi)) (2.12)
where pi is the probability of xi belonging to class 1 and yi is the class that it actual belongs
to. Mathematically, pi = p(yi = 1|xi) and (1 − pi) = p(yi = 0|xi). This function measures the
divergence between two probability distributions.
2.3.3 Gradient Descent
In order to guarantee a learning process, it is mandatory that the network’s parameters become more
accurate over each iteration. Hence, training a network is solving the following optimization problem:
minwE(w) (2.13)
where E(w) is a loss function.
The Gradient Descent (GD) is the most used optimization algorithm to train ANNs. This method
12
consists on iteratively updating the weights (or parameters) according to:
wk+1 = wk − η∇E (2.14)
where wk is the weight vector at the kth iteration, η is the learning rate and ∇E is the gradient of the
cost function E(w).
This form of the gradient descent is commonly known as offline. When the training data is large
this method may become inefficient, making the learning process slow. Another variant of the gradient
descent is used to deal with this issue, called Stochastic Gradient Descent (SGD). The SGD (also known
as online), instead of using all the training samples to compute the gradient, uses only one sample or a
subset of samples form the training set. In the case of the subset, it is often called mini-batch SGD.
In order to improve the process of minimizing the loss function, some acceleration techniques were
developed [25]. Only one of them will be addressed, which is the momentum technique. Rewriting the
weight update equation (2.14) as:
wk+1 = wk + ∆wk (2.15a)
∆wk = −η∇E (2.15b)
The momentum technique adds a term to the equation 2.15b, becoming:
∆wk = −η∇E + α∆wk−1 (2.15c)
where the first term already appeared on equation 2.14, the second term is called the momentum term
and takes into account a previous iteration to evaluate the ”continuity of the descent”. This allows ac-
celerating the training in certain situations, with importance according to the value of α ∈ [0, 1[, which
denotes the momentum parameter. Hence, the weight update equation for SGD with momentum accel-
eration technique is:
wk+1 = wk − η∇E + α∆wk−1 (2.16)
2.3.4 Backpropagation
Originally introduced in the 1970s, the backpropagation algorithm only have been taken seriously in
1986. At this date, Rumelhart, Hinton, and Williams [25] prove that backpropagation could actually
outperform the former approaches to learning and, from then on, it has become a standard in the learning
process of neural networks.
In order to fully comprehend the backpropagation algorithm, let us take a look over a feedforward
network and understand its notation. Note that the following description of the algorithm is based on a
more detailed one in [26].
The notation used throughout this subsection will be defined as:
• wljk represents the weight of neuron j in the layer l, coming from neuron k in the previous layer
(l − 1) ;
13
• blj represents the bias of the neuron j in the layer l;
• zlj represents the summation of the weighted inputs of the neuron j in the layer l;
• alj represents the activation function’s output of the neuron j in the layer l.
Figure 2.4 allows a graphic interpretation of the aforementioned notation.
Figure 2.4: Artificial Neural Networks (ANN) with notation (source: [26] adapted)
Using this notation, the way this variables relate is given by the following equations:
zlj =∑k
wljka
l−1k + blj (2.17a)
alj = σ(zlj) (2.17b)
where σ(.) is an activation function already described in section 2.3.1. Therefore, joining the latter
equations (2.17a and 2.17b) results in a direct relationship between the inputs and the outputs of a
neuron:
alj = σ(∑k
wljka
l−1k + blj) (2.18)
Rewriting equation 2.18 in a compact matrix form provides a clearer way of thinking in a layer point
of view.
al = σ(Wlal−1 + bl) (2.19)
where al is just alj vectorized, the same principle applies to bl and to Wl. In this last case the dimensions
are j rows by k columns. In the remainder of this algorithm description, bold letters correspond to matrix
forms.
To finalize the feedforward procedure, there is only one thing missing which is the computation of
the error. As already referred in section 2.3.2, a loss function is used to achieve that. However, this
14
loss must be possible to write as a function of the network’s output, and as an average over the loss of
individual training examples.
Considering σ′(.) as the first derivative of the sigmoid representing the variation of the activation
function, L as the output layer, ∇aC as the vector of the partial derivatives ∂C∂aL
j= δlj representing the
variation with the output, and � as the Hadamard product or elementwise product, the main equations
of the backpropagation algorithm can now be approached.
δL = ∇aC � σ′(zL) (2.20a)
δl = ((Wl+1)T δl+1)� σ′(zl) (2.20b)
∂C
∂blj= δlj (2.20c)
∂C
∂wljk
= al−1k δlj (2.20d)
The equation 2.20a defines the error in the output layer, L. The equation 2.20b works in a similar
way but instead of depending on the output of the network, it depends on the output of the previous
layer’s neurons. These two equations together allow the computation of the error for every neuron in
every layer. Having this combined with equations 2.20c and 2.20d, it is possible to obtain the weights
and biases in every neuron (wljk and blj). All these equations are derived in [26].
2.4 Generative Adversarial Networks
Considering the ANNs, with one input layer, one hidden layer, and one output layer, one may dive into
more complex structures. When a network has more than one hidden layer it is called a Deep Neural
Networks (DNN). In this section, an unsupervised learning algorithm that consists on the interaction
between DNNs will be covered.
Generative Adversarial Network (GAN) were developed by Goodfellow, Pouget-Abadie, Mirza, et al.
[11] in 2014. They proposed the use of an adversarial process to estimate generative models. Two
Artificial Neural Networks are trained in parallel. One of those is called the generator network, which
generates samples based on a vector sampled from latent space distribution, and the other is called
the discriminator network, that learns to determine whether a sample comes from the training data
or the generator. The training procedure for the generator network is to maximize the probability of
the discriminator misclassifying the generations. Meanwhile, the discriminator network is trained to
distinguish between real data and generated data. This process corresponds to a minimax two-player
game.
In order to provide a better understanding of this training idea, a comparison with a more practical
problem is commonly made, namely the interaction between a counterfeiter and a bank. The bank
classifies the money as real or counterfeit based on different features between them. However, the
counterfeiter gets feedback on those classifications and, in order to be successful, he tries to mitigate
15
the differences between real money and his counterfeits so they become as realistic as possible. If the
counterfeiter is competent, he will eventually end up making indistinguishable money. This can be seen
as a competition between the bank and the counterfeiter, just as the discriminator and the generator
networks.
The remaining of this section is fully based on the work of Goodfellow, Pouget-Abadie, Mirza, et al.
[11], therefore the same notation will be used when addressing GANs, as the following:
• x represents a real data structure, from the pdata distribution;
• z represents a random vector, sampled from the pz distribution;
• G(.) represents the generator network as a function of some input;
• D(.) represents the discriminator network as a function of some input.
Considering the aforementioned notation, one can tell that G(z) represents a generated sample,
D(x) represents the classification of real samples, and D(G(z)) represents the classification of gener-
ated samples. A visual inspection of figure 2.5 may turn this clearer.
Figure 2.5: Generative Adversarial Network (GAN) high level architecture.
To train both networks, a loss function has to be defined. Intuitively, the discriminator loss evaluates
how well it did at letting real samples go through (i.e. comparing D(x) to 1) plus how often it was ”fooled”
by the generator (i.e. comparing D(G(z)) to 0), whilst the generator loss evaluates how often did it fail
at making realistic samples (i.e. comparing D(G(z)) to 1).
In a formal approach, this training method can actually be interpreted as a two-player minimax game
between D and G with a value function V (D,G), mathematically formulated by:
16
minG
maxD
V (D,G) = Ex∼pdata[logD(x)] + Ez∼pz [log(1−D(G(z)))] (2.21)
Other cost functions were found to be useful when dealing with GANs [27]. Still, other than the
minimax cost function (used in Goodfellow, Pouget-Abadie, Mirza, et al. [11]) were not tested.
Towards a visual perception of how the variables actually change during training one can analyze the
different stages of figure 2.6.
Considering in figure 2.6(a) the generator’s distribution (green, solid line) and true data distribution
(black, dotted line) as an adversarial pair near convergence, the distributions are still distinguishable and
the classifier (blue, dashed line) is only partially accurate. After training the discriminator, in figure 2.6(b),
it eventually converges to D∗(x) = pdata(x)pdata(x)+pg(x)
. After the generator is updated, the discriminator leads
the generator’s distribution (green, solid line) to get closer to the true data one (black, dotted line), as in
figure 2.6(c). At last, after several steps of training, it will come to a point where pg = pdata. Thus, the
classifier (blue, dashed line) will not be able to differentiate between both distributions, i.e. D(x) = 12 , as
pictured in figure 2.6(d).
(a) (b) (c) (d)
Figure 2.6: Informal explanation of minimax training algorithm. The generator’s distribution pg is repre-sented as a green solid line, the real data distribution pdata as a black dotted line, and the discriminator’sclassification as a blue dashed line (source:[11])
Training GANs was found to be a very challenging task. The main issue here is that the application
of the gradient descent algorithm works well when the goal is to minimize a loss function, which is not
exactly this case. The big deal in GAN training, is to find the Nash equilibrium of a non-convex game,
which defines a state where neither players change their strategy regardless of opponent decisions.
The utilization of the gradient descent to change the parameters of the discriminator may have a positive
impact on the discriminator’s loss but a negative one on the generator’s loss, or vice versa. Hence,
instead of converging, the solution oscillates.
GANs have been known to be quite difficult to train, once they lack performance measures. When
the discriminator and generator networks training is not well balanced the GAN may enter in a failure
mode such as mode collapse [28] or vanishing gradient.
Mode collapse is when the generator network generates a limited set of samples, or even a single
17
sample, regardless of the random vector z. Once the discriminator network does not actually force
the diversity in the generator outputs, all these may converge to the same point that the discriminator
network believes is realistic. This failure mode may be identified by inspecting the generations of the
generator network. If the generations keep the same regardless of the random vector, then the generator
managed to ”fool” the discriminator.
Vanishing gradient occurs when the loss drops to zero, ending up with no gradient updates. This
problem arises when the discriminator is too good, resulting in a super slow learning process. On the
other hand, when the discriminator behaves poorly, the generator does not have accurate feedback
keeping it from representing reality.
Radford, Metz, and Chintala [12] proposed a convolutional architecture of the GAN model that was
proved to be more stable to train, and named it Deep Convolutional GAN (DCGAN). The DCGAN is a
GAN where both the discriminator and the generator use convolutional layers. In order to approach the
DCGAN model, the Convolutional Neural Network should be fully detailed first.
2.5 Convolutional Neural Network
As covered in section 2.3, a neuron in a certain layer is connected to all the neurons in the previous
layer. This section will cover an alternative to this way of connecting neurons.
Just as happened of ANNs, the motivation for CNNs came from nature, specifically from the visual
cortex of animals. The main idea behind this is that the neurons in the visual cortex get different types
of information in different layers, depending on what they are focusing on.
The application of this process requires as input a 3-dimensional representation of data (e.g. RGB
image) and it tries to establish a relationship with some data, for instance a classification. The nature
of this relationship is weights, just as regular neural network. The main difference here is that neurons
will only be connected to a small region of the previous layer, instead of to all the neurons as in a
fully-connected way.
There are three main types of layers regarding CNN: Convolutional layers, Pooling layers and Fully-
connected layers. In this section these layers will be clarified, as well as an example architecture of a
CNN will be given.
Note that the remainder of this section regarding CNNs is based on [29], therefore, for the sake of
simplicity, the input data will be assumed to be an RGB image.
2.5.1 Convolutional Layer
A convolutional layer consists on a set of weights, also known as filters or kernels. The filters, which are
the learnable variables of this layer, are small spatial extents that have a depth equal to the one in the
input data. As the name may suggest, this layer performs a mathematical operation called convolution,
in this particular case a 2-dimensonal convolution. This operation consists on the computation of dot
products between the filter’s entries and the input, at every position. This results in a spatial connection
18
between neurons, as represented in figure 2.7.
(a) Volume prespective (b) Neuron prespective
Figure 2.7: Example of neuron disposition in the first convolutional layer. In (a) the input (red shape)and output (blue shape) of the first convolutional layer are spatially represented. Each of the 5 neurons(circles) corresponds to the result of the convolution between a filter and a spatial region of the input(receptive field). In (b) the neuron has 3 inputs, according to the depth of the previous layer. Thisdemonstrates that each neuron in a layer represents the receptive field in whole depth of the previouslayer. (source: [29] adapted)
There are four hyperparameters that directly affect the output of this layer: filter, depth, stride and
zero-padding. A description of each one of these is given, as well as a visual example in figure 2.8.
• Filter: This hyperparameter concerns the spatial extent of the filter. The height and width can be
different from each other, whether the depth of the filter is always the same as the depth of the
input data (Fd = Id).
• Depth: This hyperparameter corresponds to the number of filters to be learned. Each one of the
filters should be looking for different features of the input data. A depth column will be used to
address a set of neurons that concern the same region. For instance, in the first convolutional
layer, the input data depth corresponds to the number of channels of the image (Id = 3 in the RGB
case).
• Stride: This hyperparameter defines the way the filter slides through the input data. If stride is set
to 1, the convolution is computed for every pixel, resulting in the same output dimensions as the
input ones. However, if is set to 2, the convolution is only computed every 2 pixels, resulting in
reduced (half) output dimensions. The stride is valid along the height and the width axis. Still, the
notation will be simplified when height and width are equal, e.g. stride 2 × 2 will be referred to as
S = 2.
• Zero-padding: This hyperparameter fills the spatial borders of the input with zeros. It is generally
used to preserve the input’s width and height to the output shape. When set to ”same” or ”half”,
implying no changes on the dimensions, it means that P = F−12 .
19
Based on all these hyperparameters, it is possible to formulate an equation that determines the
output size of the convolutional layer, as follows:
Oh,w =Ih,w − Fh,w + 2P
S+ 1 (2.22a)
Od = K (2.22b)
where Oh,w is the output’s height and width, Ih,w is the input’s height and width, and Fh,w is the filter’s
height and width. The variable P defines the amount of zero-padding on the border and the variable S
the stride. Od is the output’s depth, which is defined by the number of filters used, K.
Figure 2.8: Dimensioning the output of the convolutional layer. Convolving a 3 × 3 filter (shade) over a5× 5 input using half padding and unit stride (i.e. Ih,w = 5, Fh,w = 3, S = 1, P = 1). (source: [30])
2.5.2 Pooling Layer
The pooling layer is key to ensure that consecutive layers are able to identify larger-scale features. This
kind of layer is used to reduce the spatial size of its input, and consequently reducing the number of
parameters and computations. Only the height and width are downsampled, leaving the depth with the
same dimensions. There are other pooling operations [31], but the most frequently used, due to better
results in practice, is the Max Pooling, exemplified in figure 2.9.
Just as the convolutional layer (section 2.5.1), the pooling layer is also subject of the filter and stride
hyperparameters. Here, the filter defines the spatial extent of where the maximum function is computed.
The hyperparameters are usually set to Fh,w = 2 and S = 2 (as exemplified in figure 2.9), and to
Fh,w = 3 and S = 2, which entails an overlapping pooling.
The equation that determines the output size of the Pooling Layer is the following:
Oh,w =Ih,w − Fh,w
S+ 1 (2.23a)
Od = Id (2.23b)
where Id denotes the input’s depth, and the other variables are the same as the equations 2.22.
20
Figure 2.9: Example of Max Pooling with a 2 × 2 filter over a 4 × 4 input using 2 × 2 stride (i.e. Ih,w =4, Fh,w = 2, S = 2).
2.5.3 Fully-connected Layer
A fully-connected (or dense) layer, as the name may suggest, is a layer whose neurons are connected
to all of the previous layer ones. This is just a regular hidden layer seen in section 2.3, used broadly at
the end of the network to interpret more complex structures. Its input are, in this case, a rearrangement
of the neurons organized in the 3-dimensional shape into a single dimension.
This same representation can be achieved through a convolutional layer. If the filter height and width
match the input height and width, than the convolution will result in a 1× 1 shape, which corresponds to
one neuron per filter. This type of layer has a spatial shape of 1× 1×K, which is no more than a single
dimension vector, where K still denotes the number of filters.
2.5.4 CNN architecture
It has been seen that CNNs work over 3-dimensional input shapes. The main process in this architecture
is composed by a set of 3 layers, which usually go together in the following order:
Conv −→ ReLU −→ Pool
where the first layer is a convolutional layer (section 2.5.1), the second layer computes the Rectified
Linear Unit (ReLU) activation function (section 2.3.1) element-wise, and the third layer is a pooling layer
(section 2.5.2). This set of layers is placed just after the input layer, and is usually stacked to achieve
deeper networks. Despite reducing the height and width dimensions of the neurons spatial placement
with the chain growth, the number of kernels applied increases. Afterwards, the 3-dimensional shape of
neurons, that results from the output of the previous layer, is unfolded to an array-like spatial structure.
Lastly, a fully-connected layer (in section 2.5.3), followed by an activation function that produces both
positive and negative values such as sigmoid or tanh, produces the desired output shape (e.g. 3 labeled
nodes). Figure 2.10 shows a more objective example of a CNN architecture.
21
Figure 2.10: Example of CNN architecture. This includes 4 sets of convolutional with ReLU and poolinglayers, referred to as ”Conv + Maxpool”, and 2 fully-connected layers, referred to as ”FC”. (source:[32])
2.6 Deep Convolutional Generative Adversarial Network
The Deep Convolutional GAN (DCGAN) model is no more than a GAN whose discriminator and gener-
ator networks comprehend some convolutional layers, such as CNNs do. The work of Radford, Metz,
and Chintala lead to the adoption of some already proven modifications to the CNN standard architec-
ture. These will be detailed in the following subsection, as well as uncovered complementary theoretical
aspects.
2.6.1 DCGAN approach
Strided Convolutions
The first modification regards the use of pooling layers do reduce spatial dimensions. Questioning the
requirement of different layers in the pipeline, ”The all convolutional net”, by Springenberg, Dosovitskiy,
Brox, et al. [33], proposes the drop of the pooling layer from the architecture, relying only on convolutions
with non-unitary stride to do that job. This approach is broadly known as strided convolutions. As seen
in equation 2.22a (section 2.5.1) the stride, S, has a scaling property on the output dimensions of the
convolutional layer. Hence, by changing the stride value, one can achieve the same downsampling as
with the pooling layer, as shown in figure 2.11. Furthermore, Springenberg, Dosovitskiy, Brox, et al.
found that not only this proposal is valid without any loss in accuracy on recognition tasks, but actually
gives state-of-the-art performance.
This will be used on the discriminator network, similarly to a CNN.
22
Figure 2.11: Dimensioning the output of the convolutional layer. Convolving a 3× 3 filter (shade) over a5× 5 input with 1× 1 padding using 2× 2 stride (i.e. Ih,w = 5, Fh,w = 3, S = 2, P = 1). (source:[30])
Fractional-Strided Convolutions
Regarding the generator network, the process requires an upsampling to create a shape with dimensions
equal to the input one, based on a single vector. This is called fractional-strided convolution, also known
as transposed convolution [30]. This concept is very similar to the previous one, but the transposed
convolution is computed instead.
The variables with an apostrophe (e.g. O′h,w) concern the fractional-strided convolution, and its
meaning is the same as the one described in equation 2.22a (section 2.5.1). There is relationship
between the fractional-strided convolution variables and the variables of the corresponding transposed
convolution, as follows:
• I ′h,w = I ′h,w +S− 1, where I ′h,w represent the size of the stretched input, comprehending the zeros
added between the input units.
• F ′h,w = Fh,w
• S′ = 1
• P ′ = Fh,w − P − 1
The output of this fractional-strided convolution can be defined through the following mathematical
equations:
O′h,w = S(I ′h,w − 1) + Fh,w − 2P (2.24a)
O′d = K (2.24b)
This process is somehow complex, therefore to understand what this variables represent in practice,
a visual example is provided in figures 2.12 and 2.13.
Eliminating Fully-Connected Layers
The next modification adopted by Radford, Metz, and Chintala concerns the trend to eliminate fully-
connected layers upon convolutional features. One convincing example of this is global average-pooling,
23
Figure 2.12: Dimensioning the output of a fractional-strided convolutional layer. The transpose of con-volving a 3 × 3 filter over a 5 × 5 input using 2 × 2 stride (i.e. Ih,w = 5, Fh,w = 3, S = 2, P = 0). It isequivalent to convolving a 3×3 filter (shade) over a 2×2 input (with 1 zero between the units), with 2×2padding and unit stride (i.e. I ′h,w = 2, I ′h,w = 3, F ′h,w = Fh,w, S
′ = 1, P ′ = 2) (source:[30])
Figure 2.13: Dimensioning the output of a fractional-strided convolutional layer. The transpose of con-volving a 3× 3 filter over a 5× 5 input with 1× 1 padding, using 2× 2 stride (i.e. Ih,w = 5, Fh,w = 3, S =2, P = 1). It is equivalent to convolving a 3× 3 filter (shade) over a 3× 3 input (with 1 zero between theunits), with 1×1 padding and unit stride (i.e. I ′h,w = 3, I ′h,w = 5, F ′h,w = Fh,w, S
′ = 1, P ′ = 1) (source:[30])
used in some state-of-the-art image classification models [34], [35]. This alternative though has its
drawbacks, namely hurting convergence speed. Still, a middle ground of connecting convolution layers
with inputs or outputs of the network was proven to work well.
In the first layer of the generator network of a GAN, the random vector suffers a transformation in
order to become a 4-dimensional shape, before the convolutional computations. This is could be called
a fully-connected layer once it just consists of a linear matrix multiplication. Regarding discriminator
network on the other hand, in the end of the convolutional stack neurons are flattened from its spatial
arrangement and then fed into a single sigmoid output.
Batch Normalization
In order to accelerate training in Deep Neural Networks architectures, Ioffe and Szegedy [36] developed
the batch normalization method. The change in the distribution of layer’s input parameters caused by
24
the variations of the previous layer’s ones during training is defined as internal covariate shift. The batch
normalization aims at reducing this issue. Through normalizing the mini-batch input of a layer to each
unit having zero mean and unit variance, the batch normalization has been proven to be a useful tool to
deal with training problems, for instance really high or really low activation’s output, helping the gradient
flow.
This method has become critical to prevent the generator network from collapsing its samples to a
single point, which is a common issue in GANs known as mode collapse. The application of this method
to all layers yet resulted in model instability, so Radford, Metz, and Chintala circumvented this by not
applying batch normalization to the generator’s output layer neither to the discriminator’s input layer.
2.6.2 Detailed Architecture
After these considerations over the original DCGAN model, from Radford, Metz, and Chintala, it is just
missing a global view of the architecture. This section covers all the architectural aspects addressing
the generator and discriminator networks independently.
Generator Architecture
As covered in section 2.4, the generator network’s role is to create a realistic sample, based on a vector
sampled from a random distribution. This can be achieved in several different ways, but the original
DCGAN will be the one in scope. The architectural aspects of this process are now subject of further
analysis.
First, the generator’s input, z, was randomly sampled from a normal distribution with zero mean and
unitary variance, producing a 100×1 vector. In order to go further through convolutional layers, this vector
must become a 3-dimensional shape. Here is where the partially fully connected layer takes place. The
random vector neurons are fed into the next layer ones, where they are subject of a matrix multiplication
but not of any activation function, as it would happen in a normal fully-connected layer. This projected
the 100× 1 vector into a 16384× 1 vector, which suffered a reshape to the desired 3-dimensional extent
4× 4× 1024.
The following steps consist entirely of fractional-strided convolutions (section 2.6.1), resulting in the
real data output shape, with dimensions 64 × 64 × 3. Figure 2.14 embodies the network layering, and
table 2.1 complements it with the dimensioning of the convolutional layers.
Regarding activation functions (section 2.3.1), every layer uses the ReLU non-linearity as the last
step, except the last one that uses tanh. The convolutional layers are subject of batch normalization
(section 2.6.1) in every layer except the last (”Conv 4” in the example of figure 2.14).
Discriminator Architecture
As covered in section 2.4, the discriminator network’s role is to classify the input in real or generated
data. Again, the original DCGAN will be in scope. The architectural aspects of this process are now
subject of further analysis.
25
Figure 2.14: Original DCGAN’s generator network architecture. Layers are represented by arrows andthe corresponding input/output data structures by colored shapes. The dark blue rectangle, on theleft, denotes the z random vector, the light blue 3-dimensional shapes denotes the convolutional layer’sinputs, and the rightmost 3-dimensional shape, in dark blue, represents a input-data-like structure, alsoreferred to as ”fake sample”.
Layer Input Filter Stride #Filters Output
Conv 1 4× 4× 1024 5× 5× 1024 2× 2 512 8× 8× 512
Conv 2 8× 8× 512 5× 5× 512 2× 2 256 16× 16× 256
Conv 3 16× 16× 256 5× 5× 256 2× 2 128 32× 32× 128
Conv 4 32× 32× 128 5× 5× 128 2× 2 3 64× 64× 3
Table 2.1: DCGAN generator convolutional layer’s specifications.
In first, the discriminator’s input, is a 3-dimensional shape with dimensions 64 × 64 × 3 which may
be a real sample from the dataset, x, or a generation from a random vector z (resultant of the generator
network), G(z). The following steps consist entirely of strided convolutions (section 2.6.1), resulting in a
3-dimensional shape with much lower height and width values.
Figure 2.15 embodies the network layering, and table 2.2 complements it with the dimensioning of
the convolutional layers.
The subsequent step is to reshape the output of the last convolutional layer into a 1-dimensional
vector, in this case reshape 4 × 4 × 512 into 8192 × 1. This 8192 units are fully connected to a single
neuron that is responsible for classification.
26
Figure 2.15: Original DCGAN’s discriminator network architecture. Layers are represented by arrowsand the corresponding input/output data structures by colored shapes. The dark blue 3-dimensionalshape, on the left, denotes the input of the network, which can be x or G(z), the light blue 3-dimensionalshapes denotes the convolutional layer’s outputs, and the dark blue square, on the right, represents theneuron used for classification.
Layer Input Filter Stride #Filters Output
Conv 1 64× 64× 3 5× 5× 3 2× 2 64 32× 32× 64
Conv 2 32× 32× 64 5× 5× 64 2× 2 128 16× 16× 128
Conv 3 16× 16× 128 5× 5× 128 2× 2 256 8× 8× 256
Conv 4 7× 8× 256 5× 5× 256 2× 2 512 4× 4× 512
Table 2.2: Original DCGAN discriminator convolutional layer’s specifications.
Regarding activation functions (section 2.3.1), every layer uses the LReLU non-linearity as the last
step, except the last one that uses sigmoid. The convolutional layers are subject of batch normalization
(section 2.6.1) in every layer except the first one (”Conv 1” in the example of figure 2.15).
27
28
Chapter 3
Proposed Models
In this chapter, all the proposed models will be described. These follow the nature of the DCGAN (section
2.6). The convolutional approach and the proposed modifications by Radford, Metz, and Chintala (in
section 2.6.1) are adopted here, despite slightly changes.
The main part of the algorithm is shared by all the models proposed in this work. The dynamics
between the discriminator and generator networks have already been covered in the previous section,
thus the focus now is their individual structure.
The process starts by taking 100 samples from a normal distribution N (0, 1), to create the random
vector z. This vector will be fed to the generator network, and pass through a linear layer, consisting on
linear operations, with the weight matrix randomly initialized from a normal distribution N (0, 0.2) and the
bias initialized as 0. The output of this linear layer is reshaped into a 3-dimensional shape and batch-
normalized, and the ReLU non-linearity is then applied. The following steps consist on convolutional
layers, whose parameters will be specified later, once they differ according to each proposed model.
Still, the input of each convolutional layer is batch-normalized and the output subject of a non-linearity,
namely the ReLU, with the exception of the last one where the tanh non-linearity is used.
The discriminator network takes as input a 3-dimensional shape regarding real or generated data.
The following steps consist on convolutional layers, whose parameters will be specified later, once they
differ according to each proposed model. Still, after each convolutional layer a non-linearity is applied,
namely the LReLU with a leak of 0.2, and with the exception of the first, all convolutional layers input
are batch-normalized. After the convolutional procedures, the 3-dimensional output shape is flattened
and subject of a linear layer, whose parameters are initialized as the linear layer used in the generator
network, to produce a single output unit. To this last unit, a sigmoid non-linearity is applied.
The loss function used to measure the performance of each network, according to the output of the
discriminator network as covered in section 2.4, was the cross-entropy covered in section 2.3.2. Note
that the used of mini-batches is accounted by averaging the loss of the whole batch.
Let us now approach the convolutional procedures deferred above. Considering the input data di-
mensions, it is not possible to use the exact original DCGAN model. The original DCGAN is expected to
work over data samples with power of 2 dimensions, once the strided convolutions were always 2 × 2,
29
which does not happen in this particular case. Hence, the intuitive solution was to find the least com-
mon multiple of both height and width of the data to dimension the stride, and arrange them properly by
matching the number of multiplicative factors, as follows:
Ih = 340 = 17× 5× 4
Iw = 345 = 23× 5× 3
With these factorized dimensions, one can conclude that only 2 strided convolutions can be com-
puted.
Note that this dimensioning conflict is only verified in the generator network, once the zero-padding
in the discriminator network allows it to deal with any input dimensions. However, in order to keep some
balance between both networks’ computations, the same number of convolutional layers with equivalent
strides were used.
Regarding the filter dimensioning, in order to guarantee that each unit is accounted in, at least, 2
receptive fields when striding in a certain direction, the filter size should follow:
Fh,w = Sh,w × 2 + 1
The number of filters chosen in the convolutional layers is a power of 2, just as the ones in the original
DCGAN were. That quantity is relevant to detect details in the data, and therefore they were chosen
based on whether they represented decently the data in question, when performing some validation
tests.
Three models are proposed with different approaches to the convolutional steps.
3.1 Model 1
Model 1 is very similar to the original DCGAN model. Due to the different dimensions of the data, the
deepness of the networks had to be changed, having only 2 convolutional layers each. The dimensioning
of the filter and the stride also have changed, but the filters are still relatively small spatial extents and
the strides are also in the same order of magnitude.
Model 1 networks architecture are presented in figure 3.1, and the correspondent convolutional layers
specifications are described in table 3.1.
3.2 Model 2
In this model, it is proposed striding along the width, which means that the receptive field will cover a
temporal interval at a time but consider the whole action occurring in that interval. This ensures that the
first convolutional layer is fully responsible for the analysis of the frequency component, i.e. which notes
are being played. The following convolutional layers will, therefore, control the temporal relationship
30
(a) Generator network architecture. (b) Discriminator network architecture.
Figure 3.1: Model 1 networks architecture. Layers are represented by arrows and the correspondinginput/output data structures by colored shapes. In (a), the dark blue rectangle, on the left, denotes thez random vector, the light blue 3-dimensional shapes denotes the convolutional layer’s inputs, and therightmost 3-dimensional shape, in dark blue, represents a dataset-like structure, also referred to as ”fakesample”. In (b), the dark blue 3-dimensional shape, on the left, denotes the input of the network, whichcan be x or G(z), the light blue 3-dimensional shapes denotes the convolutional layer’s outputs, and thedark blue square, on the right, represents the neuron used for classification.
Network Layer Input Filter Stride #Filters Output
GeneratorConv 1 17× 23× 256 11× 11× 256 5× 5 128 85× 115× 128
Conv 2 85× 115× 128 9× 7× 128 4× 3 2 340× 345× 2
DiscriminatorConv 1 340× 345× 2 9× 7× 2 4× 3 128 85× 115× 128
Conv 2 85× 115× 128 11× 11× 128 5× 5 256 17× 23× 256
Table 3.1: Model 1 convolutional layer’s specifications.
between each time interval.
The Model 2 networks architecture are presented in figure 3.2, and the correspondent convolutional
layers specifications are described in table 3.2.
3.3 Model 3
Model 3 is kind of the opposite of Model 2. In this model, it is proposed striding along the height,
which means that the receptive field will cover a frequency interval at a time but consider the whole time
series. This ensures that the first convolutional layer is fully responsible for the analysis of the temporal
31
(a) Generator network architecture. (b) Discriminator network architecture.
Figure 3.2: Model 2 networks architecture. Layers are represented by arrows and the correspondinginput/output data structures by colored shapes. In (a), the dark blue rectangle, on the left, denotes thez random vector, the light blue 3-dimensional shapes denotes the convolutional layer’s inputs, and therightmost 3-dimensional shape, in dark blue, represents a dataset-like structure, also referred to as ”fakesample”. In (b), the dark blue 3-dimensional shape, on the left, denotes the input of the network, whichcan be x or G(z), the light blue 3-dimensional shapes denotes the convolutional layer’s outputs, and thedark blue square, on the right, represents the neuron used for classification.
Network Layer Input Filter Stride #Filters Output
GeneratorConv 1 1× 23× 512 1× 11× 512 1× 5 256 1× 115× 256
Conv 2 1× 115× 256 340× 7× 256 340× 3 2 340× 345× 2
DiscriminatorConv 1 340× 345× 2 340× 7× 2 340× 3 256 1× 115× 256
Conv 2 1× 115× 256 1× 11× 256 1× 5 512 1× 23× 512
Table 3.2: Model 2 convolutional layer’s specifications.
component, i.e. in which time-steps a certain note is being played. The following convolutional layers
will, therefore, control the relationship between different notes.
The Model 3 networks architecture are presented in figure 3.3, and the correspondent convolutional
layers specifications are described in table 3.3.
32
(a) Generator network architecture. (b) Discriminator network architecture.
Figure 3.3: Model 3 networks architecture. Layers are represented by arrows and the correspondinginput/output data structures by colored shapes. In (a), the dark blue rectangle, on the left, denotes thez random vector, the light blue 3-dimensional shapes denotes the convolutional layer’s inputs, and therightmost 3-dimensional shape, in dark blue, represents a dataset-like structure, also referred to as ”fakesample”. In (b), the dark blue 3-dimensional shape, on the left, denotes the input of the network, whichcan be x or G(z), the light blue 3-dimensional shapes denotes the convolutional layer’s outputs, and thedark blue square, on the right, represents the neuron used for classification.
Network Layer Input Filter Stride #Filters Output
GeneratorConv 1 17× 1× 512 11× 1× 512 5× 1 256 85× 1× 256
Conv 2 85× 1× 256 9× 345× 256 4× 345 2 340× 345× 2
DiscriminatorConv 1 340× 345× 2 9× 345× 2 4× 345 256 85× 1× 256
Conv 2 85× 1× 256 11× 1× 256 5× 1 512 17× 1× 512
Table 3.3: Model 3 convolutional layer’s specifications.
33
34
Chapter 4
Implementation and Results
4.1 Dataset
One dataset will be used to train the proposed generative model. This was built over an improved melody
created by the author over a diatonic chord progression. Throughout the whole extent, some musical
features (covered in section 2.1) should be accounted:
• The tempo is fixed to 120 BPM;
• The meter is set to 4 beats per bar, and as consequence of the previous point 1 bar lasts 2
seconds;
• The minimum note value is a quarter note, i.e. each note last a minimum of 0.5 seconds.
• The melody notes are diatonic in the key of A minor.
In order to be treatable, the full melody wave file was split into 2 seconds (1 bar) segments. These
were object of a transform, as it may be seen in figure 4.1, becoming a 3-dimensional shape with
dimensions 340× 345× 2 (section 2.2).
(a) Time domain representation. (b) Time-frequency domain representation.
Figure 4.1: Audio segment representations.
35
From the original long audio wave files, 100 segments of 2 seconds were used to construct the
melody dataset. Thus, in a tensor point-of-view, the whole dataset is a 4-dimensional shape with dimen-
sions 100× 340× 345× 2.
4.2 Software
All the code developed was programmed using the PyCharm Community Edition IDE, in the program-
ming language Python 3.5.
The heavy computations, regarding the training of the deep neural network models proposed, were
performed in a computer provided by the Institute for Systems and Robotics, affiliated to Instituto Supe-
rior Tecnico. This machine is equipped with 4 NVIDIA GeForce GTX 1070 8 GB and 32 GB of RAM.
Two noteworthy libraries were used throughout this work: LibRosa and TensorFlow.
• LibROSA: LibROSA is an open source python library that serves the purpose of analyzing music
or audio data, developed by McFee, Raffel, Liang, et al. [37]. This was found to be convenient
to process the data before and after the generative network, especially regarding the Constant-Q
Transforms.
• TensorFlow: TensorFlow is an open source library for numerical computation based on data flow
graphs, originally developed by the researchers on the Google Brain Team [38]. Tensorflow in-
corporates a graphic tool called TensorBoard, useful to inspect the graph flow of the networks and
process performance measures. Another important characteristic is that its architecture allows it to
be executed both in CPUs and GPUs (through the CUDA interface [39]) providing a better compu-
tational performance. Due to its flexibility in creating architectures and its established community
over the years, it was found to be appropriate tool to address the deep learning domain.
The datasets were created from scratch to be used in this work. A musician firstly recorded a har-
mony and then improvised a melody over it. The software used to record the audio was Ableton Live 9
Lite [40].
The data acquired from the user study was subject of a statistical analysis. The IBM SPSS software
was the chosen one to process the respective data.
4.3 Overall Implementation
The whole implemented system takes as input audio waveforms that represent data in time domain.
This time domain signals are subject of a transform, namely the CQT, changing their representation into
a time-frequency domain one. After that, the transformed signals are fed into a generative model. The
generative model is trained to produce audio samples with a data structure equal to its input’s. In order
to evaluate its time-domain representation, i.e. listening to the audio sample, the output of the generative
model has to be transformed to a time domain representation.
36
This high-level description of the implemented system can be translated into the block diagram of
figure 4.2
Figure 4.2: Implemented system high-level architecture.
Before going any further, let us break down the system and evaluate its components independently,
in order to validate the chain.
4.3.1 Validation Tests
In order to verify that the implemented system is properly developed and suits the purpose of generating
music, some validation tests were performed. Firstly, the CQT invertibility has to be guaranteed, so that
the output of the whole system is, as the input, a waveform. Then, the developed generative model
should perform decently when fed with image data, once this has already been achieved in other works.
Finally, the generator network has to prove to be able to generate a certain sample when the training set
only comprehends that specific sample.
CQT Validation
The CQT invertibility has to be guaranteed in order to assure that this computations do not affect at all
the output of the generative model. In this test, the generative model block will be by-passed so that the
CQT and the CQT−1 algorithms can be validated, as shown in figure 4.3.
Figure 4.3: Constant-Q Transform (CQT) validation high-level architecture.
As approached in section 2.2.2, the parameters used to compute the CQT (table 4.1) were chosen
37
so that a decent invert representation could be computed. As a second condition, the dimensions should
to be as reduced as possible once the data will be subject of heavy computations regarding the deep
neural network models.
Ideally, an error measure would be used to express the result of this test. However, the error between
each sample of both waveforms does not clearly states whether the two waveforms sound similarly.
Therefore, the easiest way to compare the input and output data of this validation test is actually by visual
analysis of the time-frequency representations, available in figure 4.4. Just as the audio evaluation,
the comparison of these two spectrograms is subjective. Still, the author’s subjective analysis of both
representations converged to the same judgment, which is that the reconstruction was successful.
(a) Original sample. (b) Test output.
Figure 4.4: CQT validation test results.
Sample rate Hop size # Bins per octave # Bins Frequency range Scaling factor
44100 256 85 340 C4–C8 0.6
Table 4.1: Table of CQT parameters.
Image Data Validation
In this test the generative model block was the one to be subject of validation. To perform this evaluation
the benchmarking MNIST dataset was used to train the model. The MNIST dataset is composed of
55000 training samples with dimensions 64× 64× 1 regarding images of handwritten digits from 0 to 9.
Despite working on image data, the application of this dataset to the developed deep neural network will
assure that the procedures were setup properly.
The generator and discriminator networks architectures are the ones present in the original DCGAN
(section 2.6.2), i.e. the use of 4 convolutional layers with constant stride, instead of 2 convolutional
layers with varying stride.
The training procedure considered all the samples from the dataset as input. The dimensioning of
the networks’ convolutional layers may be found in tables 2.2 and 2.1, where the original DCGAN was
38
(a) Dataset samples, x
(b) Generated samples, G(z) with 1 generator update periteration
(c) Generated samples, G(z) with 2 generator updates periteration
Figure 4.5: In (a) the first 100 samples from the dataset are presented. In (b) 100 generated sam-ples created by the same number of random vectors, with the generator network properly trained, arepresented.
approached in detail. Accordingly, all the model was trained with mini-batch SGD with a mini-batch
size of 128, and, to accelerate training, Adam optimizer was used with the parameters suggested in the
literature, i.e. the learning rate was set to 0.0002 and the momentum term, β, to 0.5. The size of the
random vector z was set to 100.
After training for 20 epochs, with no further tuning of the hyperparameters, the generator network
showed evidence of mode collapse, resulting in generating similar non-sense data. Robinson [41] have
already dealt with this matter and found an effective solution. In order to make sure that the discriminator
loss does not drop to zero, for each iteration the generator network is trained 2 times. After implementing
this solution, the generator network was able to generate realistic handwritten digits. The input data and
generated samples from both tries are shown in figure 4.5.
The networks losses are computed with cross-entropy and its evolution throughout the training pro-
cess is plotted in figure 4.6. In the case of the discriminator, the loss is the sum of the losses of the
real and generated data classifications. By inspecting the plots, one may verify that both losses oscillate
during the training process. Still, for the case where the generator network is only update once per
iteration, the loss of the discriminator eventually goes to zero and consequently the loss of the generator
39
starts growing. Comparing the losses behavior with the generated samples, one can conclude that the
oscillatory behavior is healthy for the learning process of both networks, once it guarantees the balance
between them.
(a) Discriminator’s loss (b) Generator’s loss
Figure 4.6: Discriminator (a) and generator (b) losses trained on the MNIST dataset. The orange linerepresents the losses when the generator network is update once per iteration, and the blue line repre-sents the losses when the generator network is update twice per iteration. The plotted data is smoothedby the Tensorboard interface in order to provide an easier analysis.
One Training Sample Validation
The last validation test aims at evaluating the suitability of the implemented generative model to generate
an output that can be transformed to an audio sample. The strategy adopted consists on training the
network with only 1 input sample, as represented in figure 4.7.
Figure 4.7: Training the generative model with only one sample as input.
The training hyperparameters were similar to the ones used in the previous test with the MNIST
dataset. However, as concluded in that test, the number of updates of the generator was not enough to
40
assure that the training between both networks is balanced. Therefore, it was found that updating the
generator network 5 times per iteration provided good results, i.e. a sample identical to the one used as
input.
After training for 10000 epochs, it was found that all the proposed models could achieve a decent out-
come, as it can be seen in figure 4.8. Still, one may clearly verify that the generated sample from Model
1 is very accurate. Between the remaining models, Model 3 presents a visual ”brighter background”
implying a lower background noise in the corresponding audio sample. Note that these samples still
vary with the random vector, but with this intensive training over the same input sample, the dependence
got very small.
(a) Input dataset sample (b) Generation with Model 1
(c) Generation with Model 2 (d) Generation with Model 3
Figure 4.8: Generations of the one training sample validation test for all the proposed models. In (a)the single training set sample is presented, and in (b), (c) and (d) the generations of Models 1, 2 and 3,respectively, are presented.
By inspecting figure 4.9, one may verify that the losses of both networks, despite an initial phase,
keep oscillating until the end of the training epochs, for all models. This, as already seen, is a sign of a
healthy training.
41
(a) Discriminator’s loss (b) Generator’s loss
Figure 4.9: Discriminator (a) and generator (b) losses regarding the validation test. The green, red andblue lines are relative to Models 1, 2 and 3, respectively. The plotted data is smoothed in order to providean easier analysis.
4.3.2 Results
This section addresses the set of conditions that provided the trained network with the best results, as
well as the evaluation of those conditions to the performance.
Let us start by stating the considered parameters and describing its influence in the generative
model’s performance:
• Number of filters
The number of filters used in the convolutional layers of each of the 3 proposed models were set
as stated in tables 3.1, 3.2 and 3.3, regarding models 1, 2 and 3, respectively. Different number
of filters were tested following a power of 2 basis, as mentioned before. The variation of that
number was found to have impact on the resolution of the CQT plot, resulting on weaker audio
representations when the number was set too low. It was concluded that after a certain number of
filters the improvements of the resolution have stopped, hence being the ones adopted.
Model 1 reached that spot at the 128 filters. However, Model 2 and Model 3 only reached that at
256. This might be caused by the large filter dimensions in the first layer, concerning a whole row
or a whole column.
• Batch size
The amount of samples evaluated by each network at a time was found to be critical to the gener-
ative model’s performance. As mentioned in section 2.6.2, the mode collapse is a very common
training failure when addressing GANs. As approached in [28] and verified in this case, the use of
batch discrimination avoids this failure.
• Network updates per iteration
The number of updates of each network per iteration was found to be critical to balance the training
of both networks. As referred when testing the implemented algorithm with the MNIST dataset,
a lack of balance in training leads the loss of the discriminator network to drop to zero, which
42
makes the gradients get too small ceasing the learning procedure (known as the vanishing gradient
problem). Updating the generator network more than once per iteration was found to be a viable
strategy to deal with this problem.
The balance found between each network’s update per iteration that provides a well-behaved train-
ing, and consequently better results, was to update 1 time the discriminator network and 5 times
the generator network, per iteration.
• Adam optimizer’s parameters
The Adam optimizer parameters are the learning rate and the momentum. Their influence in
the training concern stability and speed of convergence, which in this case means reaching the
”oscillatory stage”. The values used in original DCGAN model were found to be stable enough and
tuning them did not significantly improved the networks performance. Therefore, the learning rate
was kept at 0.0002 and the momentum term at 0.5.
After training each of the proposed models for the whole melody dataset for 10000 epochs, with the
hyperparameters stated above, it was found that Model 1 was by far the one that produced the best
results, once Model 2 and Model 3 were trapped by their large kernels. However, training Model 1 took
approximately 17 hours (more than 3 times longer than the other two models). The spectrogram of a
few generations of the trained models are presented in figure 4.10, alongside a dataset sample. It can
be visually verified that Model 1 is the only one that presents an output similar to a dataset sample.
The networks training balance may be verified in figure 4.11 for the different proposed models trained
with the melody dataset. One may verify in the generator network loss plot that its losses keep oscillating
around the same value from epoch 2000 until the end, for all models. This implies that the discriminator
network’s loss does not drop to zero, which might not be clear in the leftmost plot for Model 2 and Model
3.
It may be concluded that exploitation of different convolutional architectures, namely performing them
in only one direction and with a significantly bigger filter size (Model 2 and Model 3), does not create
value in the networks performance. Once the generations from those models are nothing like the input
ones, the consequent transformation to audio samples will not be interesting.
A different set of parameters could perhaps have been set to train Model 2 and Model 3 to achieve
other results. However, after long days of parameter tuning, the results provided by training Model 1 in
these conditions were found to be undoubtedly the best.
A prior evaluation of the trained generative model was made to consider this as the best performance
within the all the experiment. Still, the evaluation of the output of the whole system, which is an audio
sample, is very subjective. This results were considered not to have enough significance when judged
by only one individual. Therefore, a user study regarding the quality of the generated samples was found
to be essential, and is presented in the next chapter.
43
(a) Input dataset sample (b) Generation with Model 1
(c) Generation with Model 2 (d) Generation with Model 3
Figure 4.10: Generations of the trained generative model for all the proposed models. In (a) 4 differentdataset samples are presented, and in (b), (c) and (d) generated samples from 4 different randomvectors are presented, regarding Models 1, 2 and 3, respectively.
(a) Discriminator’s loss (b) Generator’s loss
Figure 4.11: Discriminator (a) and generator (b) losses regarding the trained generative model’s bestperformance. The red, blue and orange lines are relative to Models 1, 2 and 3, respectively. The plotteddata is smoothed in order to provide an easier analysis.
44
Chapter 5
User Study
As a method to evaluate the results of the generative model implemented a user study was developed.
This work proposes the exploration of music generation based on audio. The generative model proposed
should be able to produce a musical audio sample with some constraints defined by the dataset. These
are the following:
• 2 second long audio samples, which corresponds to 1 bar with tempo set to 120 BPM.
• The melody dataset implies the presence of only one note at a time, with a minimum subdivision
of a quarter note, i.e. 1 bar can only comprehend the maximum of 4 notes (with the duration of 0.5
seconds).
• All the notes are in the key of A minor.
A musical sound definition is mandatory in order to classify the results. A distinction between a tone
and noise is made regarding the physical characteristics of sound. The main difference is that tone
is identified by certain characteristics such as controlled pitch and timbre, whereas noise is generally
identified by its source i.e. waves breaking on shore or a plastic bottle being squashed [42].
This study should infer over the following hypothesis:
Hypothesis 1: The generated audio samples are not classified as a noise sound.
Hypothesis 2: The generated audio samples are ranked as musically pleasant.
A fully detailed description of the experimental set-up is presented in the following sections.
5.1 Participants
5 people (3 males, 2 females), aged between 22 to 25 years old, voluntarily participated in the study.
All participants have at least a basic theoretical music knowledge, and none of them suffered from any
hearing disorder. No participant had prior contact to the study.
45
5.2 Design
The study involved 10 samples from each of the following groups:
• X, the dataset.
• Y, the trained generative model.
• Z, the untrained generative model.
In order to mitigate practice effects, the study design followed a block randomization, i.e. each
participant evaluated a block with the same samples but randomly sequenced.
Two questions were formulated, and should be answered for each sample in the block. These are
the following:
• Q1 — Do you classify the sound you heard as a noise sound or as a tone sound?
This should be answered with either tone or noise, i.e. a binary answer.
• Q2 — The sound you heard is musically pleasing.
This should be evaluated based on a Likert scale [43], which gives a quantitative value on a
subjective matter, generally based on the level of agreement/disagreement. The scale used will
comprehend values from 1 to 5.
5.3 Procedure
A brief overview of the procedures was given to participants before they are submitted to the experiment.
They had to fill demographics information (gender, age, theoretical music level, hearing condition) first.
Then, participants were asked to read some definitions in order to proceed to the following stage.
The experimenter played each sample in the randomly ordered block guaranteeing an interval of
5 seconds between each 2 second sample, so that participants could answer the 2 questions. After
listening to the 30 samples block the experiment was concluded.
5.4 Results
Participants were asked to answer the questions Q1 and Q2. For each of the three sample groups X, Y
and Z, 50 answers were evaluated.
As Q1 expects a binary answer (Tone/Noise), the relative frequencies regarding each sample group
were computed and are plotted in figure 5.1. One may conclude that all the samples from the X group
were classified as tone, all the samples from the Z group were classified as noise, and 95% of the
samples from the Y group were classified as tone.
For Q2, the answer is a rank (Likert scale), therefore, the average ranks per sample group were
computed and are plotted in figure 5.2. One may conclude that in terms of musical pleasantly, in average,
46
Figure 5.1: Relative frequency per sample group regarding Q1.
the samples from the group Z are not pleasant at all, the samples from the X group are very pleasant,
and the samples from the Y group are relatively pleasant.
Figure 5.2: Boxplot of the average rank per sample group regarding Q2.
The above discriminative statistics already shown evidence that both hypothesis are verified. How-
ever, the survey results were tested for statistical significant differences for the different sample groups
by means of a Friedman test.
The Friedman test is used to test for differences between groups when the dependent variable being
measured is ordinal. The results of this test are presented in table 5.1, proving that:
• There was a statistically significant difference in the noise/tone classification (Q1) depending on
which group the sample belongs to, χ2(2) = 94.360, p-value < 0.05.
• There was a statistically significant difference in how musically pleasing the sample was (Q2)
depending on which group the sample belongs to, χ2(2) = 89.805, p-value < 0.05.
The Friedman test proved the existence of statistical significant difference, but does not prove any-
thing else. In order to determine where these differences actually occur, it is necessary to perform post
47
Friedman Test Q1 Q2
N 50 50
Chi-Square 94.360 89.805
df 2 2
p-value 3.236× 10−21 3.156× 10−20
Table 5.1: Friedmans test for Q1 and Q2.
hoc tests. The appropriate are the Wilcoxon signed-rank tests on the different combinations of related
groups. Hence, the following combinations will be compared: X-Y, X-Z and Y-Z.
When making multiple comparisons with the Wilcoxon test, an adjustment of the p-value, called
Bonferroni, has to be made. This simply consist on taking the initial significance level (0.05) and divide it
by the number of tests being performed (3 combinations). Therefore, the adjusted significance level will
be 0.05/3 = 0.017.
Question Wilcoxon Signed Ranks Test X-Y X-Z Y-Z
Q1Z −1.732 −7.071 −6.856
p-value 0.083 1.537× 10−12 7.099× 10−12
Q2Z −4.118 −6.372 −6.199
p-value 3.819× 10−5 1.865× 10−10 5.682× 10−10
Table 5.2: Wilcoxon signed-rank tests for Q1 and Q2.
The Wilcoxon signed-rank test is used to compare two sets of scores that come from the same
participant. The results of this test are presented in table 5.2, proving that:
• There was a statistically significant difference in the noise/tone classification (Q1) between the
sample groups X-Z (Z(2) = −7.071, p-value < 0.017) and Y-Z (Z(2) = −6.856, p-value < 0.017),
but not between the sample groups X-Y (Z(2) = −1.732, p-value ≥ 0.017).
• There was a statistically significant difference in how musically pleasing the sample was (Q2)
between the sample groups X-Y (Z(2) = −4.118, p-value < 0.017), X-Z (Z(2) = −6.372, p-value <
0.017) and Y-Z (Z(2) = −6.199, p-value < 0.017).
48
Based on the performed post hoc tests, one may confidently state that samples generated by the
trained generative model (Y) are tone sounds. Moreover, the samples from the untrained generative
model (Z) are less musically pleasing than the samples from the trained generative model (Y), and the
samples both from both Z and Y are less musically pleasing than the samples from the dataset (X).
49
50
Chapter 6
Conclusions
6.1 Achievements
In this master’s thesis a music generation system is proposed. This system is composed by a transform
block that transforms waveform audio samples to a time-frequency domain, a generative model that
generates time-frequency domain samples, and an inverse transform that provides a waveform audio
output.
In an early stage, some basic validation tests were performed. The blocks regarding the Constant-
Q Transform were tested independently from the remaining system, achieving a considerable similarity
between the original audio sample and the processed one. The implemented generative model was
tested with the MNIST benchmarking dataset, achieving results close to the ones in the literature, being
the lack of performance measures the only reason not to consider them equally good. The suitableness
of the proposed generative model to represent the data in question was confirmed.
The generative model proposed comprehended three different convolutional approaches differing
mainly on striding directions. Horizontal and vertical, only horizontal and only vertical were the different
options and were named Model 1, Model 2 and Model 3, respectively. The Model 1 was found to be
the only that provided reasonable results. The exploitation of these layers’ architectures have still been
useful to break down some of the parameters influence over the learning process.
The user study conducted tested the hypothesis that the generated audio samples are not noise,
i.e. controlled pitch characteristics were found, and are musically pleasing. This study compared the
trained generative model’s output samples, Y, with samples from the dataset, X, and with samples from
the generative model before training, Z, i.e. noise. It was concluded that the Y samples are classified as
not noise, i.e. a tone sound. It was also concluded that the samples from the trained generative model
group (Y) are more musically pleasing than the samples from the untrained generative model group (Z)
but still not as musically pleasing as the samples from the dataset group (X).
The implemented system was expected to generate better music samples, once some successful
models had already been developed using the MIDI notation. However, once music is being represented
in such a low level as it is a waveform, even generating a sample that is not noise was a challenging
51
task. Considering that and the results of the user study, it was found that the initial expectations were
too high, and that the results achieved are actually interesting.
6.2 Future Work
Despite the successful implementation of the proposed generative model to address music generation,
there’s still a large margin for improvement.
The transform used to process the waveform audio samples did increase the dimensions of the data.
A method to reduce those dimensions without jeopardizing the quality of the data might be critical when
dealing with larger datasets.
The generation of independent bars was achieved when trained over independent bars. However,
adding a layer to the system that correlates bars would allow the generation of longer structures with
more than one bar, increasing the musical complexity.
Retrieving information from the latent space allows conditioning the generations. With that in mind,
one may focus on generating melodies over a prior chord or, the exact opposite, generating chords to
support some melody.
The use of less constrained datasets might be harder to train but, once successful, the creativity of
the generations should definitely improve. For instance, the use of a dataset with different instruments,
and consequently different timbres.
52
Bibliography
[1] L. A. Hiller and L. M. Isaacson, Experimental music: composition with an electronic computer.
McGraw-Hill, 1959.
[2] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, Midinet: A convolutional generative adversarial network
for symbolic-domain music generation, 2017. eprint: arXiv:1703.10847.
[3] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, Musegan: Multi-track sequential generative
adversarial networks for symbolic music generation and accompaniment, 2017. eprint: arXiv:
1709.06298.
[4] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A.
Senior, and K. Kavukcuoglu, Wavenet: A generative model for raw audio, 2016. eprint: arXiv:
1609.03499.
[5] V. Kalingeri and S. Grandhe, Music generation with deep learning, 2016. eprint: arXiv:1612.
04928.
[6] A. Nayebi and M. Vitelli, “Gruv: Algorithmic music generation using recurrent neural networks”,
Course CS224D: Deep Learning for Natural Language Processing (Stanford), 2015.
[7] A. Eigenfeldt and P. Pasquier, “Realtime generation of harmonic progressions using controlled
markov selection”, in Proceedings of ICCC-X-Computational Creativity Conference, 2010, pp. 16–
25.
[8] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio, Sam-
plernn: An unconditional end-to-end neural audio generation model, 2016. eprint: arXiv:1612.
07837.
[9] O. Mogren, C-rnn-gan: Continuous recurrent neural networks with adversarial training, 2016.
eprint: arXiv:1611.09904.
[10] T. L. Paine, P. Khorrami, S. Chang, Y. Zhang, P. Ramachandran, M. A. Hasegawa-Johnson, and
T. S. Huang, Fast wavenet generation algorithm, 2016. eprint: arXiv:1611.09482.
[11] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and
Y. Bengio, Generative adversarial networks, 2014. eprint: arXiv:1406.2661.
[12] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolu-
tional generative adversarial networks”, arXiv preprint arXiv:1511.06434, 2015.
53
[13] J. C. Graue, Scale music. [Online]. Available: https://www.britannica.com/art/scale-music.
[14] W. M. University, The elements of music. [Online]. Available: http://wmich.edu/mus-gened/
mus170/RockElements.pdf.
[15] J. C. Brown and M. S. Puckette, “An efficient algorithm for the calculation of a constant q trans-
form”, The Journal of the Acoustical Society of America, vol. 92, no. 5, pp. 2698–2701, 1992.
[16] C. Schorkhuber and A. Klapuri, “Constant-q transform toolbox for music processing”, in 7th Sound
and Music Computing Conference, Barcelona, Spain, 2010, pp. 3–64.
[17] Librosa. [Online]. Available: https://librosa.github.io/librosa/index.html.
[18] B. Yegnanarayana, Artificial neural networks. PHI Learning Pvt. Ltd., 2009.
[19] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity”, The
bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–133, 1943.
[20] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in
the brain.”, Psychological review, vol. 65, no. 6, p. 386, 1958.
[21] [Online]. Available: https://tex.stackexchange.com/questions/132444/diagram-of-an-
artificial-neural-network.
[22] [Online]. Available: https : / / medium . com / the - theory - of - everything / understanding -
activation-functions-in-neural-networks-9491262884e0.
[23] [Online]. Available: https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/.
[24] [Online]. Available: http://rohanvarma.me/Loss-Functions/.
[25] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating
errors”, nature, vol. 323, no. 6088, p. 533, 1986.
[26] S. Perry, Create an artificial neural network using the neuroph java framework. [Online]. Available:
https://www.ibm.com/developerworks/library/cc-artificial-neural-networks-neuroph-
machine-learning/index.html.
[27] I. Goodfellow, Nips 2016 tutorial: Generative adversarial networks, 2016. eprint: arXiv:1701.
00160.
[28] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved tech-
niques for training gans”, in Advances in Neural Information Processing Systems, 2016, pp. 2234–
2242.
[29] A. Karpathy, Cs231n convolutional neural networks for visual recognition. [Online]. Available: http:
//cs231n.github.io/convolutional-networks/#conv.
[30] V. Dumoulin and F. Visin, A guide to convolution arithmetic for deep learning, 2016. eprint: arXiv:
1603.07285.
[31] C.-Y. Lee, P. W. Gallagher, and Z. Tu, Generalizing pooling functions in convolutional neural net-
works: Mixed, gated, and tree, 2015. eprint: arXiv:1509.08985.
54
[32] A. Dertat, Applied deep learning - part 4: Convolutional neural networks. [Online]. Available:
https : / / towardsdatascience . com / applied - deep - learning - part - 4 - convolutional -
neural-networks-584bc134c1e2.
[33] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, Striving for simplicity: The all con-
volutional net, 2014. eprint: arXiv:1412.6806.
[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.
Rabinovich, Going deeper with convolutions, 2014. eprint: arXiv:1409.4842.
[35] M. Lin, Q. Chen, and S. Yan, Network in network, 2013. eprint: arXiv:1312.4400.
[36] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing
internal covariate shift, 2015. eprint: arXiv:1502.03167.
[37] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “Librosa: Audio
and music signal analysis in python”, in Proceedings of the 14th python in science conference,
2015, pp. 18–25.
[38] Tensorflow. [Online]. Available: https://www.tensorflow.org/.
[39] Cuda. [Online]. Available: https://developer.nvidia.com/cuda-gpus.
[40] Ableton live 9 lite. [Online]. Available: https://www.ableton.com/en/products/live-lite/
features/.
[41] R. Robinson, Ml notebook. [Online]. Available: https://mlnotebook.github.io/post/GAN4/
#train.
[42] W. E. Thomson, Musical sound. [Online]. Available: https://www.britannica.com/science/
musical-sound.
[43] S. Jamieson, Likert scale. [Online]. Available: https://www.britannica.com/topic/Likert-
Scale.
55
56
Appendix A
Survey
The following document was the one given to participants when conducting the user study.
57
This survey is part of a user study within a master’s thesis. The following experiment will consist on listening to short audio segments and classifying them as requested.
Please start by filling the following table with some personal information.
Age Gender Hearing disorder Basic theoretical musical knowledge
Male Female Yes No Yes No
o o o o o o
In order to proceed with the survey, the following concepts should be considered:
(1) A musical sound can be classified as a tone sound or a noise sound. A distinction between these regards the physical characteristics of sound. The main difference is that tone is identified by certain characteristics such as controlled pitch and timbre, whereas noise is generally identified by its source, for example waves breaking on shore or a plastic bottle being squashed.
In the next page a table is presented with the survey’s questions. For each sample heard, the participant should only fill the corresponding line. Only one circle should be filled per question.
The questions to be answered for all the samples heard are the following:
1. According to (1), is the sample you heard a noise sound or a tone sound? 2. How musically pleasing is the sound you heard?
Sample number
1 - Physical sound characteristics 2 - The audio sample is musically pleasing
Noise Tone 1 2 3 4 5
Strongly disagree
Strongly agree
1 o o o o o o o 2 o o o o o o o 3 o o o o o o o 4 o o o o o o o 5 o o o o o o o 6 o o o o o o o 7 o o o o o o o 8 o o o o o o o 9 o o o o o o o
10 o o o o o o o 11 o o o o o o o 12 o o o o o o o 13 o o o o o o o 14 o o o o o o o 15 o o o o o o o 16 o o o o o o o 17 o o o o o o o 18 o o o o o o o 19 o o o o o o o 20 o o o o o o o 21 o o o o o o o 22 o o o o o o o 23 o o o o o o o 24 o o o o o o o 25 o o o o o o o 26 o o o o o o o 27 o o o o o o o 28 o o o o o o o 29 o o o o o o o 30 o o o o o o o
60