wavelet pooling for convolutional neural...

Treball final de grau

GRAU D’INFORMATICA

Facultat de Matematiques i InformaticaUniversitat de Barcelona

Wavelet pooling for convolutionalneural networks

Autora: Aina Ferra Marcus

Directora: Dra. Petia Radeva

Co-director: Eduardo Aguilar

Realitzat a: Departament de

Matematiques i Informatica

Barcelona, June 27, 2018

Abstract

Wavelets are mathematical functions that are currently used in many computer visionproblems, such as image denoising or image compression. In this work, first we willstudy all the basic theory about wavelets, in order to understand them and build a basicknowledge that allows us to develop another application. For such purpose, we proposetwo pooling methods based on wavelets: one based on simple wavelet basis and one thatcombines two basis working in parallel. We will test them and show that they can beused at the same level of performance as max and average pooling.

Resum

Les wavelet son funcions matematiques que s’utilitzen avui en dia en molts problemes devisio per computador, com per exemple netejar o compressio d’imatges. En aquest treball,primer estudiarem tota la teoria basica que ens permet entendre les funcions i construir unconeixement solid per poder construir una aplicacio nova. Per aquest proposit, proposemdos metodes de pooling basats en wavelets: un metode utilitza una sola base de waveletsi l’altre en combina dues que funcionen en paral·lel. Provem i testegem els metodes perdemostrar que es poden utilitzar per aconseguir els mateixos resultats que el max i averagepooling.

i

Acknowledgements

First of all, I would like to thank my professor Dr. Petia Radeva for this opportunity. Iwould also like to thank Eduardo Aguilar for your valuable help in the understandings ofthe many dimensions of TensorFlow and Keras.

I would like to thanks the author of the paper that served as inspiration for this work,Dr. Travis Williams and Dr. Robert Li. Special thanks to Dr. Williams for respondingto my e-mails in such a short notice and being so specific about small details.

I would like to thanks Dr. Lluıs Garrido for the corrections of this work, among manyvaluable advises.

Thanks to my two best friends, Jose and Ignasi, for accompanying me on this journey.You guys are the best.

And of course, thanks to my family and Nico. You have always been there for me.

ii

Contents

1 Introduction 1

2 State of the art 3

3 Methodology 4

3.1 Fourier analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.3 Continuous wavelet transform . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.4 Discrete wavelet transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.5 Multiresolution analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.6 Wavelet matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.7 2D transform explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.8 Example: the Haar wavelet . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.9 Visualization and effects of some wavelets . . . . . . . . . . . . . . . . . . 13

3.10 Introduction and basic concepts of neural networks . . . . . . . . . . . . . 16

3.11 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . 17

3.12 Convolutional layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.13 Pooling layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.13.1 Max pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.13.2 Average pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.13.3 Probabilistic pooling methods . . . . . . . . . . . . . . . . . . . . . 21

4 Wavelet pooling 22

4.1 Multiwavelet pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Validation 24

5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2 Validation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3.2 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3.3 SVHN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4.2 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.4.3 SVHN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

iii

5.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Conclusions and future lines 34

iv

1 Introduction

Wavelets are mathematical functions that have been heavily studied during the 20th cen-tury, used in many applications. The first wavelet development is attributed to AlfredHaar in 1909. The development of the continuous wavelet transform would come muchlater (1975) while studying the reaction of the ear to sound; and can be attributed toGeorge Zweig. In 1982, Pierre Goupillaud, Alex Grossmann and Jean Morlet establishedthe formulation of that now is known as the continuous transform, thus “rediscovering”it. They were interested in analyzing data from sesmic surveys, which are used in oiland mineral explotations to get images of layering in subsurface rock [1]. Then wouldcome Jan-Olov Stromberg’s work on discrete wavelets (1983), Ingrid Daubechies’ orthog-onal wavelets with compact support (1988), Stephane Georges Mallat’s multiresolutionframework (1989), Ali Akansu’s binomial QMF (1990), Nathalie Delprat’s time frequencyinterpretation of the continuous transform (1991), Newland’s harmonic wavelet transform(1993) and many more since then.

We can see that the origin of wavelets comes from different real world applications andin the span of 20 years, many researchers have shown interest in them. Such tendencyhas continued: wavelets are currently used in data and image compression [2], partialdifferential equation solving [3], transient detection [4], pattern recognition [5, 6], textureanalysis [7], noise reduction [8], financial data analysis [9] and many more fields. Perhapsan interesting and surprising example is the FBI algorithm for fingerprint compression[10].

On another front, one of the biggest breakthroughs of recent computer science has beenthe development of deep learning. Such technique imitates the workings of the humanbrain in processing data and creating patterns for use in decision making. Deep learningbelongs to the field of artificial intelligence. The term “Artificial Intelligence” was firstused at a workshop held on the campus of Darmouth College during the summer of 1956.The term “deep learning”, however, would come much later, used for the first time byIgor Aizenberg and colleagues around 2000. One may recall the historical events of 1996when Deep Blue won its first game against the world chess champion Garry Kasparov; orin 2016, when AlphaGo defeated the 9-dan professional Lee Sedol at Go. Those are justsome examples of what artificial intelligence, specifically deep learning for the latter, canachieve.

Neural networks, the main tool of deep learning, are a before-and-after in the historyof computer science. One special kind, convolutional neural networks, are commonly asso-ciated with computer vision. Their historical roots can be traced back to the 1980s, whenKunishiko Fukushima proposed a neural network inspired by the feline visual process-ing system [13] and [14]. Yann LeCun helped establish how we use convolutional neuralnetworks nowadays: as multiple layers of neurons for processing more complex featuresat deeper layers of the network. Their properties differ from other traditional networks;for example, they use what best represents images: height, width and depth (number ofchannels). Instead of a set of neurons that accept a single numerical vector of values,convolutional neural networks work with the whole image, adding layers of informationin the depth dimension. In summary, convolutional neural networks consistently classifyimages, objects and videos at a higher accuracy rate than vector-based deep learningtechniques.

1

Pooling methods are designed to compact information, i.e., reduce data dimensions andparameters, thus increasing computational efficiency. Since convolutional neural networkswork with the whole image, the number of neurons increases and so does the computa-tional cost. For this reason, some kind of control over the size of our data and parametersis needed. However, this is not the only reason to use pooling methods, as they are alsovery important to perform a multi-level analysis. This means that rather than the exactpixel where the activation happened, we look for the region where it is located. Poolingmethods vary from deterministic simple ones, such as max pooling, to probabilistic moresophisticated ones, like stochastic pooling. All of these methods have in common thatthey use a neighborhood approach that, although fast, introduce edge halos, blurring andaliasing.

Max pooling is a basic technique that usually works, but perhaps too simple and ittends to overfit. On the other hand, average pooling is more resistant to overfitting, but itcan create blurring effects to certain datasets. Choosing the right pooling method is keyto obtain good results. The main goal of this work is to prove that wavelets can be usedas a valid pooling method. We claim that wavelets can be a good choice for the majorityof datasets and that they can be a safe choice when in doubt. In order to do that, wewill study the basic theory of wavelets as mathematical abstract functions and we willprepare and test a wavelet pooling layer against three different datasets. On the otherhand, we will state the basic concepts of convolutional neural networks and we will studythe common pooling techniques. Then, we explain the pooling algorithm taken from theworks of Dr. Travis Williams and Dr. Robert Li [20] and test it in different contexts.Then, we will explore a how different choice of wavelet basis affect the performance ofthe network. Finally, we propose a new pooling method: multiwavelet pooling. Sincewavelets work similarly to filters, we will prove that combining multiple wavelets basis inthe same layer contributes to achieve a higher accuracy.

This work is structured as follows: Section 2 contains the state of the art; Section3 contains methodology, including basic wavelet theory and basic convolutional neuralnetwork knowledge; Section 4 formalizes our proposed pooling methods; Section 5 containsthe validation and results; and Section 6 contains the conclusions and future work.

2

2 State of the art

In this work, we will be focusing mainly on the effect and performance of different poolingtechniques. The purpose of pooling layers is to achieve spatial invariance by reducingthe resolution of the feature maps. Here, we will be presenting the most used poolingtechniques.

Perhaps the most known and used pooling method is max pooling: it reduces datasize by taking the highest value among a fixed region. Max pooling is a simple andcomputationally efficient technique, since it requires only comparisons and no furthercomputation; often used by default when in doubt. However, due to its simplicity, ittends to overfit the data [24].

The second most used method is average pooling: it computes the average of allactivations in a fixed region. This method is also computationally efficient but, by takinga neighborhood of values, it can produce halos and blurring [26].

To combat the previously mentioned problems, researchers focused on probabilisticmethods, i.e., introducing randomness that hopefully helped performance. For that,mixed pooling was born: with a set probability λ max pooling will be performed andwith probability 1 − λ, average pooling will be. There is no set way to perform thismethod, as this can be applied for all features within a layer, mixed between featureswithin a layer or mixed between regions for different features within a layer [25]. Addi-tionally, gate pooling takes this idea and generalizes it a bit further: the probability basedon which we choose between average and max pooling will be a trainable parameter [25].One step further brings us to what is referred to as tree pooling, in which pooling filtersare learnt and also how to responsively combine those learned filters [25].

Continuing with the same line of thought, stochastic pooling was introduced: assigna probability to each pixel corresponding to its activation value; the higher the value,the higher the probability to be chosen. This method works in the same fashion as maxpooling, but introducing some randomness that can help improve accuracy [26].

Wavelet pooling is designed to resize the image without almost losing information[20]. With this property, it could be a safe choice when one is doubtful between maxpooling and average pooling: wavelet pooling will not create any halos and, because of itsstructure, it seem it could resist better overfitting. We propose to generalize a bit furtherthis technique by combining several wavelet pooling layers that will act like a poolingfilter.

3

3 Methodology

In this section, we will explore mainly two subjects: wavelet theory and convolutionalneural networks. For the first one, we will introduce all its basic concepts, giving formaldefinitions, building up to the matrix that we will use in the proposed algorithm. For thelatter, we will review some concepts of deep learning and the what and why for this kindof network.

3.1 Fourier analysis

One may recall that in 1807, Joseph Fourier asserted that any 2π periodic function f(x)could be written as

f(x) = a0 +

∞∑k=1

(ak cos(kx) + bk sin(kx))

now known as Fourier series. The coefficients a0, ak, bk can be calculated by

a0 =1

2π

∫ 2π

0f(x)dx ak =

1

π

∫ 2π

0f(x) cos(kx)dx bk =

1

π

∫ 2π

0f(x) sin(kx)dx.

Later on, that would give rise to a new mathematical field: frequency analysis. Let usassume that we have a smooth signal f(t). The standard Fourier transform

Ff(w) =1√(2π)

∫e−iwtf(t)dt

gives a representation of the frequency content of f . In many applications, one is interestedin its frequency content locally in time. But the Fourier transform fails to encode highfrequency bursts. Time localization can be achieved by windowing the signal f , so as tocut off only a well localized slice of f and then taking its Fourier transform

(Ff)g(w, t) =

∫f(s)g(s− t)e−iwsds.

This is called the windowed Fourier transform, where g represents our window. It iscommon in signal analysis to work with its discrete frequency version, where t and w areassigned regularly spaced values t = nt0, w = mw0, n,m ∈ Z and w0, t0 > 0 are fixed.Then we have

(Ff)g(m,n) =

∫f(s)g(s− nt0)e−imw0sds.

Finally, to work with each frequency equally, the width of the window should be allowedto vary inversely with the magnitude of the frequency. And so arranging the formula, thefirst wavelet transform appears

(Ff)g(w, t) =1√α

∫ ∞−∞

f(s)g

(x− tα

)e−2πiw x−t

α dx

where α is a weighting coefficient: as it increases, the frequency decreases and the windowfunction expands. This arrangement can also be done with its discrete counterpart.

4

3.2 Wavelets

It is common in literature to assume that one already knows what a wavelet is and so aprecise definition is hard to find. Furthermore, there are some mixed concepts in waveletanalysis, such as the words wavelet, mother wavelet, father wavelet and wavelet family.We will present a proper definition for each of those concepts, but in time, the wordwavelet will take the lead.

Definition 3.1. A mother wavelet is a function ψ : C → C satisfying the followingconditions

1.

∫ ∞−∞|ψ(t)|2dt <∞

2.

∫ ∞−∞

|Ψ(w)|2

|w|dw <∞

where Ψ is the Fourier transform of ψ.

The first thing we should notice is that the definition states that ψ is defined overC, meaning that we should be careful performing some operations. This will not botherus later on, since our object of study (images) can be viewed as a function over R andeverything will work just fine.

Now, about the interesting mathematical concept. The first condition implies that thefunction has finite energy, an important concept in signal processing. The second one iscalled admissibility condition and implies that if Ψ is smooth then Ψ(0) = 0. It can beshown that square integrable functions satisfying the admissibility condition can be usedto first analyze and then reconstruct a signal without loss of information. That statementgives us a hint why researchers have decided to use wavelets as a tool in image processing:they allow us to pool the image, then analyze it and reconstruct the same exactimage.

Further on, since Ψ(0) = 0, that implies that the average value of the wavelet must bezero ∫ ∞

−∞ψ(t)dt = 0

and therefore it must be oscillatory, that is, ψ(t) must be a wave.

Roughly speaking, a mother wavelet is a wave-like oscillation with an amplitude thatbegins at zero, increases and then decreases back to zero. The main difference betweencosines, sines and mother wavelets is that while the former are extended to the wholetime domain, the latter are only “alive” for a short period of time. Let us look at someexamples of how mother wavelets look like.

Now, we present what does it mean to form a family of wavelets.

Definition 3.2. Consider ψ a mother wavelet, a family of wavelets is a basis defined by

ψ(s,l)(x) =1√sψ

(x− ls

)where s is an integer called scale index and l is an integer called location index.

5

It is clear now why the phrasing becomes lazy and ends up being wavelet for everything.A family of wavelets is just a set of functions generated by translating and scaling themother wavelet. So, essentially, a family of wavelets is just a bunch of modification of thesame mother wavelet.

3.3 Continuous wavelet transform

It is obvious that we will not work with continuous data nor functions, but studying thiscase is fundamental to understand its discrete counterpart and the problems that comewith it. The wavelet transform provides a time-frequency description, similar to the onein the Fourier transform. The location and scale parameters s, l vary continuously overR. The continuous wavelet transform (CWT) of a smooth function f(t) is given by

CWT(s,l)f = s−1/2

∫f(t)ψ

(t− ls

)dt.

The resemblance with the Fourier transform is striking, we are just changing the functione· with a wavelet ψ(·). From Euler’s formula, eiy = cos y + i sin y, we are just changingsines and cosines for another type of wave.

Another important thing is that a function can be reconstructed from its wavelettransform by means of

f(t) =1

CΨ

∫ ∞0

∫ ∞−∞

CWT(s,l)fψ(s,l)(t)dsdl

l2

where cΨ comes from the admissibility condition:

CΨ =

∫ ∞0

|Ψ(w)|2

wdw <∞.

The inverse formula can be viewed in two different ways: as a way of reconstructing fonce its wavelet transform is known, or as a way to write f as a superposition of waveletsψ(s,l); the coefficients in this superposition are exactly given by the wavelet transform.

We could stop studying the continuous transform here, since we have everything weneed to understand the further process. However, there are some properties that are worthstudying since they give a better vision about wavelets and help with our understanding.Let us talk about moments.

An important concept for wavelets is the notion of vanishing moments. Wavelets decayover time and, eventually, they die out. But how they do that and how smooth they areis coded with the vanishing moments.

Definition 3.3. The p-th vanishing moment of a wavelet ψ(s,l)(t) is defined by

Mp =

∫ ∞−∞

tpψ(s,l)(t)dt.

Vanishing moments can be defined for any function and measure the “smoothness” ofsuch: the larger they are, the smoother the function. If we expand the Taylor series ofour CWT at t = 0, taking l = 0 as well for simplicity:

CWT(s,0)f =n∑p=0

f (p)(0)

∫tp

p!ψ(s,0)(t)dt+O(n+ 1)

6

now, rewriting the expression with vanishing moments:

CWT(s,0)f = f(0)M0s+ f (1)(0)M1s2 +

f (2)(0)

2!M2s

3 + . . .+f (n)(0)

n!Mns

n+1 +O(sn+2).

From the admissibility condition, we know that M0 = 0. If the other moments are zero aswell, then the wavelet coefficients ψ(s,l)(t) will decay as fast as sn+2 for a smooth signalf(t). If a wavelet has N vanishing moments, so will the continuous wavelet transform.Vanishing moments will be further explained in section 3.9.

3.4 Discrete wavelet transform

It is time to move on to the core part of our study. In this case, the location and scaleparameters l, s take only discrete values. For s we choose the integer powers of onefixed scale parameter s0 > 1, that is, s = asm0 and a ∈ R, m ∈ Z. It follows that thediscretization of the location parameter l should depend on m: narrow (high frequency)wavelets are translated by small steps in order to cover the whole time range, while wider(lower frequency) wavelets are translated by larger steps. We choose then l = nl0as

m0 ,

where l0 > 0 is fixed and n ∈ Z. The formula looks like

DWTm,n(s0,l0)f = s−m/20

∫f(t)ψ(s−m0 t− nl0)dt.

Since this is the counterpart of a continuous algorithm, some questions arise: is itpossible to characterize f completely knowing its discrete wavelet transform? Is it possibleto reconstruct f in a numerically stable way? Considering the dual problem that weproposed in the continuous form, one may ask the following questions: is it possible towrite f as a superposition of ψ(s,l)? Is there a numerically stable algorithm to find suchcoefficients?

In the discrete case, there does not exist, in general, a formula to recover f from CWT .Reconstruction, if at all possible, must therefore be done by some other means. Besides,discrete frames often offer redundant information. This redundancy can be exploited oreliminated to its bare essentials. This takes us to the next concept.

3.5 Multiresolution analysis

First of all, we will start with a general definition.

Definition 3.4. Let Vj , j = . . . ,−2,−1, 0, 1, 2, . . . be a sequence of subspaces of functionsin L2(R). The collection of spaces {Vj , j ∈ Z} is called a multiresolution analysis withscaling function φ if the following conditions hold:

1. Nested: Vj ⊂ Vj+1

2. Density:⋃Vj = L2(R), where X denotes the topological closure of a space X.

3. Separation:⋂Vj = {0}

4. Scaling: f(x) ∈ V0 ⇐⇒ f(2jx) ∈ V0

5. Orthonormal basis: φ ∈ V0 and the set {φ(x − k), k ∈ Z} is an orthonormal basis(using the L2 inner product) for V0.

7

Let us stop for a moment. We have introduced some new notation without explainingit. What is a L2 space?

Definition 3.5. The L2 space is the space of square-integrable functions, i.e., thosefunctions that ∫ ∞

−∞|f(x)|2dx <∞.

Alternatively, it can also be defined as the set of functions which square is Lebesgueintegrable. The inner product of the space is given by

〈f, g〉 =

∫Af(x)g(x)dx.

This is a simplified definition, but it serves our purpose. In our field of research, allfunctions are square integrable, since we are working with finite set of values (images), sothat is purely notation.

The Vj spaces are called approximation spaces. There may be several choices of φcorresponding to a system of approximation spaces. Different choices of φ may yielddifferent multiresolution analyses. So it is not a one to one relationship.

Remark 3.6. Although the definition requires the translates of φ to be orthonormal, asimple basis of φ will suffice. We can then use φ to obtain a new scaling function forwhich {φ(x− k), k ∈ Z} is orthonormal. This is a common concept for lineal algebra andit will have consequences further on.

This definition is often presented together with the central equation in multiresolutionanalysis, the scaling relation.

Theorem 3.7. Suppose {Vj , j ∈ Z} is a multiresolution analysis with scaling function φ.Then, the following scaling relation holds:

φ(x) =∑k∈Z

hkφ(2x− k) where hk = 2

∫ ∞−∞

φ(x)φ(2x− k)dx. (3.1)

Since this is an introductory work and proofs require extensive mathematical knowl-edge on the matter, we will not state them here. However, if one is interested, the proofcan be found on [1]. Now, at this point, the reader may be wondering: what does any ofthis have to do with wavelets? Let us check out the next revealing theorem.

Theorem 3.8. Suppose {Vj , j ∈ Z} is a multiresolution analysis with scaling function φ.Let Wj be the span of {ψ(2jx− k) : k ∈ Z} where

ψ(x) =∑k∈Z

gkφ(2x− k) where gk = (−1)kg1−k. (3.2)

Then Wj ∈ Vj+1 is the orthogonal complement of Vj in Vj+1, i. e., Vj ⊕Wj = Vj+1.Furthermore, {ψj,k(x) := 2j/2ψ(2jx− k), k ∈ Z} is an orthonormal basis for Wj.

This may be a lot of information to process, but what is really important is that now,our space looks like that, fixed a j0 ∈ Z:

8

L2(R) = Vj0 ⊕Wj0 ⊕Wj0+1 ⊕Wj0+2 ⊕ . . .

which means that φ and ψ both together are a basis of our space. Moreover, the definitionof ψ given by the theorem is the definition of a family of wavelets, just changing j = −mand everything checks out. This is a very strong property: we have a basis formed by ourscaling function and our wavelet. This is the reason why, sometimes, the scaling functionis also called father wavelet.

3.6 Wavelet matrix

In our case study, we are dealing with images; they have a stronger property than justbeing discrete: they are finite. This simplifies a lot of our work. For this purpose, let usconsider ZN , N ∈ Z, the space of finite successions (which means that we have a seriesof values a0, a1, a2, . . . , aN−1, aN = a0, aN+1 = a1, . . .). This property is also written as(RNa)(m) = am−N . In this scenario, data can be thought of vectors of N -length with thedescribed property. Then, we can consider the space l2(ZN ) which can be thought of thespace of finite signals. This is where our wavelets lie now.

For example, consider a signal z ∈ L2(ZN ), which is represented by a vector of values.Then, we can perform the Fourier transform by simply doing

z =N−1∑n=0

zne−2πimn/N .

Our first concern is to know whether we can find an orthonormal basis or not and, incase that we could, how to do it. Recall that, from theorem 3.8, our wavelets form anorthonormal basis in the discrete case, so it is only natural to think that this propertywould be preserved.

Lemma 3.9. Let w ∈ L2(ZN ). Then {Rkw}N−1k=0 is an orthonormal basis for L2(ZN ) if

and only if the discrete finite Fourier transform w follows that |w(n)| = 1 for all n ∈ ZN .

We can find a proof of this in [15]. Now we have a way to find an orthonormal basis interms of the Fourier transform, but we would like something that requires less calculus.For this purpose, let us define now our wavelet basis.

Definition 3.10. Suppose N is an even integer, N = 2M for some M ∈ N. An orthonor-mal basis for L2(ZN ) of the form

{R2ku}M−1k=0 ∪ {R2kv}M−1

k=0

for some u, v ∈ l2(ZN ) is called a first stage wavelet basis for l2(ZN ). We call u and v thegenerators of the first stage wavelet basis. We sometimes also call u the father waveletand v the mother wavelet.

This is basically saying that we have to find two vectors u, v that considering a system

9

of vectors(u0 u1 . . . uN−1)

(uN−2 uN−1 . . . uN−3)...

... . . ....

(u3 u4 . . . u2)

(v0 v1 . . . vN−1)(vN−2 vN−1 . . . vN−3)

...... . . .

...(v3 v4 . . . v2)

such vectors are an orthonormal basis for L2(ZN ). Considering that we have to applya Fourier transform to prove that they are an orthonormal basis, we expect somethingsimpler from the finite case, which is given by the following theorem.

Theorem 3.11. Suppose M ∈ N and N = 2M . Let u, v ∈ L2(ZN ). Then

B = {R2kv}M−1k=0 ∪ {R2ku}M−1

k=0

is an orthonormal basis for L2(ZN ) if and only if the system matrix W (n) of u and v isunitary for n = 0, . . . ,M − 1.

The proof of this theorem can also be found on [15]. So now we have completelydefined our matrix wavelet in terms of an orthonormal basis and a unitary matrix. Thisrelates to the discrete case, because such vectors can be found following the equations 3.1and 3.2. Therefore, our wavelet matrix can be expressed as

W =

h0 h1 h2 . . . hN−1

hN−2 hN−1 h1 . . . hN−3...

......

. . ....

g0 g1 g2 . . . gN−1

gN−2 gN−1 g1 . . . gN−3...

......

. . ....

.

This theory has been introduced to treat with one dimensional array data, but can beeasily extended to two dimensional data, which is our case of interest and our next goal.

3.7 2D transform explained

Until this point, we have been working with 1D vector functions. The main point tofocus on is that, in order to build a wavelet matrix, we need to find two vectors that forman orthonormal basis. What may seem problematic is that with the wavelet transformwe go from 1D to 2D. What happens with images, then? Do we need to work with 4Dmatrices? The answer is no. Let us look at the process.

First of all, some notation. Since our matrix is completely characterized with differ-entiated elements hi and gi from different subspaces, let us write

10

W =

h0 h1 h2 . . . hN−1

hN−2 hN−1 h1 . . . hN−3...

......

. . ....

g0 g1 g2 . . . gN−1

gN−2 gN−1 g1 . . . gN−3...

......

. . ....

=

(H

G

).

Let us assume that A is an n×n matrix to transform, say an image, for example. Howdo we transform A? If we compute WA we are simply applying the transform to eachcolumn. Now, for an image, this would look like that which does not seem so bad. Butit is the result of transforming only the columns and building together a new image. Itdoes not go well with the theory that we have studied.

What should we do to process the rows of A as well? We just need to compute WAW T .The reader may take a minute to check that the result is a square matrix. Even if W wasnot square, we would have (m× n) · (n× n) · (n×m) = m×m, which is mathematicallyrelevant because the result of the process is always a square matrix.

This process, when looking at it in block format, is revealing

WAW T =

(H

G

)A

(H

G

)T=

(HA

GA

)(HT GT

)=

(HAHT HAGT

GAHT GAGT

)=

(LL HL

LH HH

).

where the last matrix contains its usual notation. The original image A is transformed into4 matrices called subbands. The matrices H and G are designed so that the LL subbandis the low resolution residual consists of low frequency components, which means thatit is an approximation of our original image; and thee subbands HL, LH and HH givehorizontal, vertical and diagonal details, respectively.

3.8 Example: the Haar wavelet

The Haar wavelet is defined by its mother wavelet function ψ and its scaling function φ:

ψ(t) =

1 0 ≤ t < 1/2,

−1 1/2 ≤ t < 1,

0 otherwise

φ(t) =

{1 0 ≤ t < 1,

0 otherwise

Let us take a look at the graphs of those functions, see Figure 1.

Now we can compute the values hi and gi mentioned before:

11

(a) Haar scaling function (b) Haar mother wavelet

Figure 1: Haar filter is defined by the scaling function and the mother wavelet.

• h0 = 2

∫ ∞−∞

φ(t)φ(2t)dt = 2

∫ 1

0φ(t)φ(2t)dt = 2

∫ 1/2

0φ(t)φ(2t)dt =

∫ 1/2

01dt =

2t|1/20 = 1.

• h1 = 2

∫ ∞−∞

φ(t)φ(2t − 1)dt = 2

∫ 1

0φ(t)φ(2t − 1)dt = 2

∫ 1

1/2φ(t)φ(2t − 1)dt =∫ 1

1/21dt = 2t|11/2 = 1.

• h2 = 2

∫ ∞−∞

φ(t)φ(2t− 2)dt = 2

∫ 1

0φ(t)φ(2t− 2)dt = 2

∫ 1

01 · 0dt = 0.

• It is easy to see that hk = 0, ∀k > 1.

And for the other values it follows that:

1. g0 = (−1)0h1−0 = 1 · 1 = 1.

2. g1 = (−1)1h1−1 = −1h0 = −1.

3. And gk = 0, ∀k > 1.

Therefore, the final matrix is

H =(1 1

)G =

(1 −1

)W =

(1 11 −1

).

But remember remark 3.6, what happens with this matrix? It is not orthonormal aswe expected, since:(

1 11 −1

)·(

1 11 −1

)T=

(1 11 −1

)·(

1 11 −1

)=

(2 00 2

)6=(

1 00 1

).

That is ok, because, as said in the remark, you can always transform the basis into anorthonormal basis. In this case, you just have to multiply by

√2/2 and then:

(√2

2

√2

2√2

2 −√

22

)·

(√2

2

√2

2√2

2 −√

22

)T=

(√2

2

√2

2√2

2 −√

22

)·

(√2

2

√2

2√2

2 −√

22

)=

(1 00 1

).

12

Now, for a moment, let us look in a very practical way what this matrix is doing. Letus assume, then, that we have an n× n matrix A.

• When multiplying the H block, in every step we are computing√

2(xi+xi+1

2 ). Basi-cally, we are averaging the columns of A.

• When multiplying the G block, in every step we are computing√

2(xi−xi+1

2 ). So, inthis case, we are computing the differences for every column.

Remark 3.12. Notice that√

2(xi+xi+1

2 ) +√

2(xi−xi+1

2 ) =√

2xi, which is the reason thatexplains that we can invert the process. In some books and talks, one may be introducedto the Haar filter as the “averaging” filter. And all the theory may be developed focusingon this interesting property. Although we have decided to present the abstract theoryhere, one may find some references about it in this wavelet workshop [16].

So, for our block process, we had that:

WAW T =

(H

G

)A

(H

G

)T=

(HA

GA

)(HT GT

)=

(HAHT HAGT

GAHT GAGT

).

Now, we can explicitly explain every subband:

• HAHT averages along the columns of A and then along the rows of HA. It is clearthat this will produce an approximation (blur) of A, losing the details due to theaverage factor.

• HAGT averages along the columns of A and then differences along the rows of HA.This will produce vertical differences between A and the blur of A.

• GAHT differences along the columns of A and then averages along the rows of GA.In this case, this will produce the horizontal differences between A and the blur ofA.

• And finally, GAGT differences along columns and arrows. This will produce thediagonal differences between A and its blur.

This simple example clarifies all the strong theory that is involved in the discretewavelet transform. In the next section, we will explore further, with some visual examples,the effect of various wavelets on our field of research: images.

3.9 Visualization and effects of some wavelets

We will now prepare a bank of examples of wavelets and its different effects to images. Ifthe reader is interested in using wavelets as a programming tool, we suggest using MatLab,which has them integrated, or Python [17], which has multiple libraries for it. In orderto compare the effects of different wavelets, we will be using the same image as example:Lena, see Figure 2.

13

Figure 2: “Hello world” computer vision version.

In Figure 1, we have already seen the shape of the Haar functions. Let us check theimpact on an image, which is what we are interested in.

(a) Haar result. (b) Inverted Haar result.

Figure 3: The inverted result is given just for better appreciation of details.

Another known wavelet, with extended use, is the Daubechie wavelet. Although Al-freed Haar was first in describing the Haar wavelets, this family of wavelets is a gen-eralization of his definition. They are characterized by a maximal number of vanishingmoments given some support, i.e., given some subset of the domain where all elementsare not mapped to zero. There are lots of papers and articles describing and analyzing it.In [1], we can find a whole chapter of properties and in [18], we can read the discovererherself. Again, in the workshop [16], there is a different approach to this wavelet. First,we will present the marriage of functions that gives life to this wavelet.

Remark 3.13. When reading more about wavelets, one will classically see this waveletlabeled as “Daubechie D4”. This number is referred to its vanishing moments, a con-cept explained in definition 3.3. Vanishing moments affect smoothness and wave length,therefore, they produce different effects. In our experiments, we have used DaubechieD4, which will now be presented. Another interesting fact is that Haar wavelets are alsoknown as Daubechie D1 wavelets, thus stating the fact that Daubechie generalizes theHaar wavelets.

Let us check the effect of this vanishing moments in a more visual way.

14

(a) Daubechie D4 scaling function (b) Daubechie D4 mother wavelet

Figure 4: Classic Daubechie filter presented everywhere.

(a) Daubechie D6 scaling function (b) Daubechie D6 mother wavelet

Figure 5: Daubechie D6 is given by these functions.

Figure 4 illustrates the common Daubechie D4 filter while Figure 5 illustrates theDaubechie D6 wavelet. Notice how, in the second image, the function appears to bemuch smoother. Also, notice how in the D6 wavelet, the domain for both the scalingfunction and the mother wavelet is larger than in the case of the D4. We will show theeffect of the Daubechie D4 filter applied to Lena on Figure 8.

(a) Daubechie D4 transform result. (b) Inverted Daubechie D4 transform result.

Figure 6: Applied Daubechie D4 transform result.

Our last example are Coiflet wavelets, also designed by Ingrid Daubechie upon askedfor a wavelet whose scaling function also had vanishing moments. These wavelets havebeen used in many applications using Calderon-Zygmund operators, in a special field ofharmonic analysis. More on this can be found on [18] and [19].

15

(a) Coiflet D3 scaling function (b) Coiflet D3 mother wavelet

Figure 7: Coiflet D3 filter is defined by these two functions.

The effect on an image can be seen on the following figure.

(a) Coiflet D3 transform result. (b) Inverted Coiflet D3 transform result.

Figure 8: Applied Coiflet D3 transform result.

3.10 Introduction and basic concepts of neural networks

For now, the abstract mathematical theory is completely defined, so it is time to changesubject to the computer science one. We will assume a basic knowledge of deep learning,but first, we will review briefly the main concepts.

Definition 3.14. Machine learning is an application of artificial intelligence that providessystems the ability to automatically learn and improve from experience without beingexplicitly programmed.

Definition 3.15. Deep Learning is a subfield of machine learning concerned with algo-rithms inspired by the structure and function of the brain called artificial neural networks.

This takes to the core of our the problem.

Definition 3.16. A neural network is a system that accepts an input signal which isprocessed by a set of interconnected components – called neurons – to produce an output.

16

Inputlayer

Hiddenlayer

Outputlayer

Input 1

Input 2

Input 3

Input 4

Input 5

Output

Figure 9: High-level picture of how neural networks are

Those definitions establish the base on which we will build our so called convolutionalneural networks. There are plenty of papers working on this matter, a perfect introductoryone could be the Standford class [21]. Figure 9 shows, in general terms, how a neuralnetwork works. The computations, even though they can be tough and obscure, comefrom a very easy concept: a mix of linear algebra and an optimization problem.

3.11 Convolutional neural networks

Convolutional neural networks are very similar to regular neural networks: they have aninput layer, an output layer, and are made up of neurons that have learnable weightsand biases. Then, why change? Neural networks have proved themselves very useful; whyshould we use anything different? The answer, similar to many programming problems, isbecause of computational cost. Convolutional neural networks are not the solution of allproblems: they just happen to outperform other network architectures in artificial visionproblems.

Figure 10: Comparison between neural networks architectures.

Left: common neural network. Right: convolutional neural network.

17

Take the MNIST dataset, its images are of size 28× 28× 1. A single fully-connectedneuron in a first hidden layer of a regular neural network would have 28 · 28 · 1 = 784weights. This does not look so bad. But MNIST is oversimplified. Tale an image ofCIFAR-10, which are of size 32× 32× 3. Then the neuron would have 32 · 32 · 3 = 3072weights. This may still seem manageable, but an image of size 200×200×3 would require120000 weights. Combining it with the other neurons produces an absurdly large amountof weights. Besides, the larger the amount of parameters, the easier it is for the neuralnetwork to overfit.

So, the key in all of this is that neurons in a convolutional neural network are arrangedin 3 dimensions: width, height and depth. At the end of the network, the full imagewill be transformed into a single vector of class scores.

A typical example of a general convolutional neural network architecture would involvethe following steps or layers:

• Image pre-processing. Convolutional neural networks are specially design towork on an specific kind of image. It is not the same problem to work with grayscaleimages, grayscale encoded with RGB or full RGB images. Additionally, one maywant to resize all images to be the same size or even apply transformations so theobjects of the study (say, for example, in a classification problem) are all imagecentered.

• Convolutional layer. The key of success. We will explain extensively this layerlater on.

• Pooling layer. A simple concept, a layer whose function is to reduce the size ofour data and so reduce computational cost. We will study this layer later, as thisis the focus of our research.

• Fully connected layer. Neurons in a fully connected layer have full connectionsto all activations in the previous layer, as seen in regular Neural Networks. Theiractivations can hence be computed with a matrix multiplication followed by a biasoffset. While the two previous layers are good finding low-level features, i.e., iden-tifying edges and such, we still lack a method to use this in order to classify theimage. Here is where the fully connected layer is useful, it will take this informationand based on the response to the activation function, it will be able to classify theimage.

3.12 Convolutional layers

The convolutional layer’s parameters consist of a set of learnable filters. Each of themtakes a small span along width and height and extends through the full depth. During theforward pass, this filter slides over the input, computing a dot product, which producesa 2D activation map. Convolutional layers have the property of local connectivity: aneuron is only connected to a local region of the input volume. The spatial extent of thisconnectivity is a hyper-parameter called the receptive field (or filter size).

There are three hyper-parameters that control the size of the output volume andhence the number of neurons. We need to know how these work together and how tocontrol them. Although there are lots of libraries that encapsulate this part, one needs to

18

compute the size of inputs and outputs in order to join layers and form an architecture.Such hyper-parameters are:

• Depth: the number of filters to apply.

• Stride: number of pixels with which slide the filter.

• Size of zero padding: sometimes it is necessary to pad the input with zeros toavoid either shrinking the input or misscalculate the filter. This happens when theinput size does not work well together with the filter size and the stride.

Remark 3.17. If W is the input volume size, F is the filter size, S is the stride and P isthe amount of zero-padding, the output volume size (and neurons) can be calculated with(W − F + 2P )/S + 1. This means that there are restrictions that must be respected.For example, if we have an input of size 7× 7 and a filter of size 5× 5, if we want to use 0amount of padding, then we cannot use a stride of 3 because (7− 5)/3 + 1 = 1.66 whichis not an integer. Instead, we should use P = 1 or S = 2, and we can play around withthis formula to obtain the desired values.

Convolutional layers use something called parameter sharing to control the numberof parameters. For example, let us take the architecture from [23] as seen in Figure 11 itwould accept images of size 227×227×3. On the first convolutional layer, it used neuronswith F = 11, S = 4 and P = 0. This gives a total of (227 − 11)/4 + 1 = 55. Togetherwith depth K = 96, the output volume was 55× 55× 96. Each neuron was connected toa region of size 11× 11× 3. In total, 55 · 55 · 96 neurons ·11 · 11 · 3 weights = 105.705.600weights.

Figure 11: AlexNet neural network structure, image taken from [23].

The number of parameters can be dramatically reduced by making a reasonable as-sumption: that if one feature is useful to compute at some spatial position (x, y), it willalso be useful to compute at a different position (x2, y2). This means considering togethereach different plane (x, y). It makes sense, considering an edge, for example, that thewhole edge will be displayed over the same plane; and the same plane will have the samenotorious features. Those planes are called depth slices and all the neurons in the samedepth slice are going to use the same weights and biases. So, with that example, we wouldnow have 96× 11× 11× 3 = 34848 weights, which is much more manageable.

Remark 3.18. If all neurons in a single depth slice are using the same weight vector,then the forward pass can be computed as a convolution of the neuron’s weights with theinput volume (hence the name of convolutional layer). However, sometimes the parametersharing assumption does not make sense. For example, if the images have some specific

19

structure that requires the network to learn different features on each side of the image.For example, one may have a dataset composed by image centered faces. In this context,one could expect that different eye-specific or hair-specific features could be learned indifferent spatial locations. In this case, we talk about locally connected layers [27]

3.13 Pooling layers

Pooling layers are designed to progressively reduce the spatial size of the representation,thus reducing the amount of parameters and computation, preventing overfitting. Theyare also important to implement multi-scale analysis: reducing the information helpsfocus on the source of the activation, thus identifying the important features. Poolingmethods bring tolerance to the identification of features: by working on a neighborhood,they become independent of the exact pixel where the activation happened, focusing onthe area instead. It is another name for subsampling. Following the paper [20], we willbe here reviewing some of the principal pooling methods.

3.13.1 Max pooling

Perhaps the most known pooling method, max pooling function is expressed as:

ai,j,k = max(p,q)∈Ri,j

(ap,q,k)

where Ri,j is a region of our input volume and ai,j,k is the output activation of the kth

feature map at (i, j). So, basically, what we are doing here is define a window of somesize and take the maximum value within this window, Figure 12 show this effect. For amax pooling with window size 2 and stride 2, we will halve the size and discard 75% ofthe activation map.

29 35

58 22

71 63

98 41

87 46

61 29

55 20

58 25

58 87

58 98

Figure 12: Example of performing max pooling with window size 2 and stride 2.

It is a simple and easy way to perform pooling, although it has its flaws. As explainedin [24], if the main features have less intensity than the details, then max pooling willerase important information from an image. In the same paper, we can find some proofthat this method commonly overfits training data.

20

3.13.2 Average pooling

The secondly most known pooling method is the average pooling, which goes by thefunction:

ai,j,k =1

|Ri,j |∑

(p,q)∈Ri,j

ap,q,k

using the same notation as described before, additionally |Ri,j | is the size of the poolingregion. Figure 13 shows the effect of average pooling.

29 35

58 22

71 63

98 41

87 46

61 29

55 20

58 25

40 56

36 68

Figure 13: Example of performing average pooling with window size 2 and stride 2.

In the previously cited paper [24], it is also explained that average pooling can dilutedetails from an image. This happens when the insignificant details have much lower valuesthan our main feature.

We have seen the problems with both famous pooling methods. In order to combatthese issues, researchers have created probabilistic pooling methods.

3.13.3 Probabilistic pooling methods

Let us start with mixed pooling, which is shown in the equation:

ai,j,k = λ · max(p,q)∈Ri,j

(ap,q,k) + (1− λ)1

|Ri,j |∑

(p,q)∈Ri,j

ap,q,k

using the same notation as described before, λ is a random value 0 or 1, indicating whichpooling method should be used.

There is not an unique way to perform this pooling method. It can be applied indifferent ways: for all features within a layer, mixed between features within a layer ormixed between regions for different features within a layer. More on this can be found in[24] and [25].

Another probabilistic pooling method is the stochastic pooling, which improvesmax pooling by randomly sampling from neighborhood regions based on the probabilityvalues of each activation ([26]). These probabilities p for each region are calculated bynormalizing the activations within the region

pp,q =ap,q∑

(p,q)∈Ri,j ap,q.

21

The pooled activation is sampled from a multinomial distribution based on p to picka location l within the region, as seen in the following equation ([26]):

ai,j,k = al where l ∼ P (p1, . . . , p|Ri,j |).

This process is basically stating that the highest the value, the more probability thatit will be the chosen value (thus approximating a max pooling), but any value can bechosen since we are working with probabilities.

Probabilistic pooling avoid the shortcomings of max and average pooling while enjoyingsome of the advantages of both, but they require further operations and processing.

4 Wavelet pooling

Now, we have reached the main point of this work: to explore wavelet pooling. In theprevious sections, we have been studying all of the theory that we needed to argument theuse of this pooling method. Max pooling as an effective pooling method, but perhaps toosimple; average pooling can produce blurring; and probabilistic methods do not completelysolve the two previous presented problems. Wavelets are a strong tool in lots of imageapplications, such as denoising and compressing. They have interesting properties, sucha non-image dependent matrix which means that we can compute it offline and thenapply the pooling method without performing further operations. We think that waveletpooling can become a strong and solid choice as a pooling technique, since it focuses onthe inherent image properties to produce a reduced output.

The algorithm used for wavelet pooling is:

1. Compute your favorite wavelet-based wavelet matrix W . In the paper of reference[20], they have used Haar wavelets.

2. Present the feature image F and perform the discrete wavelet transform WFW T .Remember that this yields a matrix(

LL HL

LH HH

).

3. Discard HL, LH, HH, only taking into account the approximated image LL.

4. Pass on to the next layer and repeat this process when necessary.

Some heavy testing is necessary here and the advantages and disadvantages of thismethod will be discussed once the experiments are presented in the next sections.

4.1 Multiwavelet pooling

Upon reviewing this looking method, we thought that it was interesting to try a similaridea to convolutions. Since every wavelet basis identifies some detail in particular andperform differently, it was an interesting idea to see how we could combine different basisin the same pooling. In a sense, we were creating a “bank” of pooling techniques in thesame way that we create banks of filters with different convolutions. This could help our

22

network to identify features based on the different effects that wavelets have on them;and, at the same time, still performing the pooling operation. In this case, the algorithmis as follows.

1. Choose two different wavelet basis and compute their associated matrices W1 andW2.

2. Present the feature image F and perform, in parallel, the two associated discretewavelet transforms W1FW

T1 and W2FW

T2 .

3. Discard HL, LH, HH from every matrix, this only tacking into account the ap-proximated image LL1 and LL2 by the two different basis.

4. Concatenate the two results and pass on to the next layer.

23

5 Validation

In this section, we discuss the 3 datasets we used for the purposes of the validation, thevalidation measures, tests and comparison of the different pooling layers performance.

5.1 Dataset

The first used dataset has been MNIST, as seen in Figure 14.

Figure 14: Example of MNIST dataset.

This dataset is composed of images of handwritten digits, all grayscale images of size28 × 28. We use the full training set of 60.000 images and the full testing set of 10.000images.

Our second used dataset has been CIFAR-10, as seen in Figure 15.

Figure 15: Example of CIFAR-10

This dataset is composed of RGB images of size 32× 32 from ten different categories:airplanes, automobiles, birds, cats, deers, dogs, frogs, horses, ships and trucks. The fulltraining set of 50,000 images is used, as well as the full testing set of 10,000 images.

Our third and last dataset is Street View House Number of SVHN for short, as seenin Figure 16.

24

Figure 16: Example of SVHN

For this one, we used the “cropped digits” version, all RGB images of size 32×32. Weuse the full training set of 55.000 images and the full testing set of 26.032 images.

5.2 Validation methodology

To evaluate the effectiveness of each pooling method, we will use the metric accuracy. Tocompare the convergence of each pooling method, we will use the categorical loss entropyas loss function.

5.3 Settings

We will now present the technical details about every neural network model.

5.3.1 MNIST

For the MNIST experiments, we have used a batch size of 600, we performed 20 epochsand we used a learning rate of 0.01.

5.3.2 CIFAR-10

For the CIFAR-10 experiments, we used a batch size of 500 and a dynamic learning rate.

For the case without dropout, we performed 45 epochs and the learning rate changes:0.05 for epochs between 1 and 30; 0.005 for epochs between 30 and 40; and 0.0005 forepochs between 40 and 45.

For the case with dropout, we used two layers of dropout with rate 0.01, we performed75 epochs and the learning rate changed as: 0.05 for epochs between 1 and 45; 0.005 forepochs between 40 and 60; 0.0005 for epochs between 60 and 70; and finally 0.00005 forepochs between epochs 70 and 75.

25

5.3.3 SVHN

For the SVHN experiments, we used a batch size of 700, we performed 45 epochs withoutdropout and we used the same dynamic changing learning rate that in the case of CIFAR-10 without dropout.

5.4 Experiments

5.4.1 MNIST

We conducted a set of experiments over MNIST, following the architecture described in[20], that we can see in Figure 17. We tested all the described pooling methods (max,average, wavelet and multiwavelet).

Figure 17: CNN block diagram structure for MNIST, image from [20].

Haar Average Max

Accuracy (%) 98.15 98.36 98.93

Table 1: Comparison of Haar, Average and Max pooling methods.

Haar Daubechie Coiflet Biorthogonal

Accuracy % 98.15 98.29 98.08 97.96

Table 2: Comparison between the choice of different wavelets basis.

In this case, Tables 1 and 2 show that although our proposed method is the thirdwith higher accuracy, it is very consistent independently of the choice of wavelet basis.What is more, the difference between the results is very small, so our method could bea safe choice for this dataset. In Figure 18, we can observe that the average and maxpooling converge faster than our proposed algorithm; in this case, Haar and Daubechieare the chosen basis that converge faster. No further experiments were conducted withthis dataset because of the simplicity of it. In Figure 19, we present a selection of thepredictions for each pooling method. For every result, the first row represents correctpredictions while the second one represents wrong predictions.

26

Figure 18: MNIST architecture loss function.

(a) Haar results. (b) Daubechie results.

(c) Coiflet results. (d) Biorthogonal results.

(e) Average results. (f) Max results.

Figure 19: Predictions for each different pooling method.

27

5.4.2 CIFAR-10

We conducted several experiments over CIFAR-10. We tried all pooling methods (max,average, wavelet and multiwavelet). We used two different architectures: one with moreepochs and with dropout; and one with less epochs and without dropout. Additionally, weused a dynamic learning rate in both cases. In Figure 20, we can see the CNN architecture.

Figure 20: CNN block diagram structure for CIFAR-10, image from [20].

For the case without dropout, Tables 3, 4 and 5 show that our proposed method out-performs max pooling and is consistently the second best method. Although the averagemethod is the one with highest accuracy, all different choices with wavelets (different ba-sis or combining different wavelets) come to a close second. Additionally, the Daubechiebasis performs almost as well as the average method.

Haar Average Max

Accuracy (%) 74.49 76.40 73.35

Table 3: Comparison for CIFAR-10 without dropout.


Accuracy % 74.49 76.12 74.94 74.82

Table 4: Different choice of wavelets for CIFAR-10 without dropout.

Haar + Daubechie Haar + Coiflet Daubechie + Coiflet

Accuracy (%) 75.39 75.43 75.13

Table 5: Combinations of multiwavelet for CIFAR-10 without dropout.

In Figure 22, we can see a selection of images of the three best performance archi-tectures: average pooling, Daubechie wavelet pooling and multiwavelet Haar and Coifletpooling. For every image, the first row represents correct predictions and the second row,wrong predictions.

In Figure 21, it becomes clear that every wavelet pooling method converges muchfaster than max pooling or average pooling. Additionally, the wavelet techniques presenta smoother descend than the other two pooling methods.

28

Figure 21: Cifar-10 without dropout loss function.

(a) Average results. (b) Daubechie results.

(c) Multiwavelet haar and coiflet results.


For the case with dropout, Tables 6, 7 and 8 show that the multiwavelet methodoutperforms the max pooling and the average pooling. In this case, the Daubechie basiscomes again to a close second to the one with the highest accuracy, outperforming theaverage one. We can see that all of the different choices of our method outperform themax pooling, thus being a solid and safe choice for this dataset as well.

29

Haar Average Max

Accuracy (%) 77.89 78.49 70.5

Table 6: Comparison for CIFAR-10 with dropout.


Accuracy % 77.89 78.95 77.36 77.44

Table 7: Different choice of wavelets for CIFAR-10 with dropout.


Accuracy (%) 79.33 78.8 79.19

Table 8: Combinations of multiwavelet for CIFAR-10 with dropout.

In Figure 23, we can observe that the multiwavelet pooling techniques converge muchfaster than average and max. Haar was, so far, the best convergence basis, but in this case,it performs similarly to average pooling. Daubechie, which yields the third best resultwith this architecture, converges just as fast as the other multiwavelet layers. Notice how,on epoch 40, the change of learning rate is noted for every method except for the maxpooling. For epoch 60, we have another change of learning rate, but it is not as noticeableas the other one, since all methods have almost completely converged.

Figure 23: Cifar-10 with dropout loss function.

In Figure 24, we can see a selection of predicted images for the three best performances:multiwavelet Haar and Daubeche, Multiwavelet Daubechie and Coiflet and Daubechiewavelet. For the three images, the first row represents correct predictions and the secondrow, wrong predictions.

30

(a) Multiwavelet haar and daubechie results. (b) Multiwavelet daubechie and coiflet results.

(c) Daubechie results.


Results from the case with dropout are much more interesting since, nowadays, seeingan architecture without dropout is very rare. We consider this special case, then, as anacademic exercise and as a means to test each method tolerance to overfitting. However,it is the case with dropout that hints the true potential of every method: and, in thiscase, wavelet and multiwavelet pooling outperform the max and the average pooling.

5.4.3 SVHN

We have conducted the same experiments over SVHN that we did with CIFAR-10, butonly focusing on the case without dropout due to computational resources. In Figure 25we can see the CNN architecture.

Figure 25: CNN block diagram structure for SVHN, image from [20].

Haar Average Max

Accuracy (%) 89.23 88.27 88.51

Table 9: Comparison for SVHN without dropout.

31


Accuracy % 89.23 88.27 89.20 89.42

Table 10: Different choice of wavelets for SVHN without dropout.


Accuracy (%) 91.01 90.61 90.88

Table 11: Combinations of multiwavelet for CIFAR-10 with dropout.

In this case, Tables 9, 10 and 11 show that our proposed method, in any of its variants,outperforms both the max and the average pooling. Again, the multiwavelet method usingHaar and Daubechie basis is the one with highest accuracy and the other multiwaveletchoices come to close second. All wavelet choices outperform max and average pooling,but in this case Haar proves to be more consistent.

In Figure 26, we can see that max pooling performs really well on this dataset, beingone of the fastest convergence method. However, again, the three combinations of themultiwavelet pooling perform just as fas as max pooling. Any other choice of a waveletbasis converges as fast as the average pooling.

Figure 26: SVHN architecture loss function.

We can see in Figure 27 a selection of predicted images by the two highest accu-racy networks. For every set of images, the first row represents correct predictions andthe second row, wrong predictions. The result, in the first case is quite good: for themultiwavelet with Haar and Daubechie basis, the first three misspredicted photos couldactually be what the neural network says it should be. The fifth, sixth and seventh onescould be interpreted as the neural network thinks, although it is not as clear as the firsttwo pictures. The last set of images proves that the last network is ”obsessed” with thenumber nine.

32

(a) Multiwavelet haar and daubechie results.

(b) Multiwavelet daubechie and coiflet results.

Figure 27: Predictions for different pooling methods.

5.5 Comparison

In Table 12 we can see the standing ranking of pooling methods, ordered by highestaccuracy achieved. In general, any of our different pooling methods are proven to be safechoices for those datasets. The lower performance in MNIST can be explained by theover simplicity of the dataset. For all of the other cases, our methods either outperformthe max and average or come second to one of them by a short difference. The Haar basishas very consistent results, while the other wavelet basis choices (Daubechie, Coiflet,Biorthogonal) are more sensible to changes. Multiwavelet combinations require muchmore computational time, but in general, yield better results and outperform the othermethods.

MNIST CIFAR-10 no dropout CIFAR-10 dropout SVHN

First Max Average Haar + Daubechie Haar + Daubechie

Second Average Daubechie Daubechie + Coiflet Daubechie + Coiflet

Third Daubechie Haar + Coiflet Daubechie Haar + Coiflet

Table 12: Ranking of pooling methods.

In terms of convergence, we have seen that multiwavelet combinations are the fastestconvergence methods, outperforming both max and average pooling. The Haar basisproves to be very consistent as well: it is not the best method in terms of convergence,but is never the worse. The Daubechie method converges really fast as well, sometimesat he same level as the multiwavelet combinations.

In summary, any of our pooling methods can be used as a standard choice at the samelevel as max and average pooling are used nowadays.

33

6 Conclusions and future lines

Wavelet pooling has proven to be a solid pooling method. The different choice of basisoffer a wide variety of possibilities, which means that this method is very versatile andcan be adapted to many datasets. When trying new things, one can chose max poolingor average pooling because they are standard methods that work decently in most cases;however, all of our variants of wavelet pooling have been proven to be perfect candidatesfor new architectures. They offer a safe choice: not always outperforming, but beingconsistent across different datasets and architectures. What is more, in some cases, theyare the best option.

Wavelet pooling takes longer to perform because it is based on matrices. For an averagepooling with windows size 2× 2, we will compute 4 operations (adding and dividing); fora wavelet pooling, we will have to perform n2 operations, where n is the size of ourfeature image. This means that the max pooling and average pooling time performancedepends exclusively of the window size; for wavelet pooling, however, it depends on theimage size. What is more, with max and average we can freely change the window size(together with stride) for a better setting of the output dimension. In the wavelet case,we can only perform reductions of step 2 (this means that first we reduce by half, thenby a quarter and so on). In every step, we need to perform a matrix multiplication,and so the cost rises. For the case of multiwavelet pooling, this cost is doubled as weare performing two wavelet pooling methods in parallel and then concatenating them.However, every method has proved to converge, if not the same, faster than average andmax pooling. This means that the same result could be achieved with less iterations, thussaving computational time.

In conclusion, wavelet pooling and multiwavelet pooling have been proven to be poolingmethods capable of competing with their counterparts max and average pooling. Due tothe good results of this work, we will try to publish in the ECCV, at the Women inComputer Vision workshop.

Finally, some insight about what could come next:

1. Due to computation resources, the extent of datasets is reduced. It would be inter-esting to further test the consistency of our pooling method against bigger datasets,such as CIFAR-100 and CALTECH-101/256.

2. The natural continuation of wavelets are shearlets. Shearlets are mathematicalfunctions that were first introduced in 2006 for analysis and sparse approximation offunctions [28]. Their implementation is very similar to wavelets, but they are basedon parabolic scaling matrices and shear matrices. This makes them an useful toolfor studying anisotropic features. It would be interesting to study the possibility ofintroducing shearlets as a pooling method, combining the pooling capacity togetherwith feature detection.

3. This work has been very focused on the importance of pooling and reducing di-mensions, but always working with powers of 2. T. Lindeberg introduced in 1994the theory of scale space [29]. The notion of scale selection refers to methods forestimating characteristic scales in images for automatically determining locally ap-propriate scales, so as to adapt subsequent processing to the local image structureand compute scale invariant image features and image descriptors [30]. It would be

34

interesting to preprocess our dataset to obtain a statistic to decide which image sizeof reduction would perform better for our data.

35

References

[1] Albert Boggess and Francis J. Narcowich. A First Course in Wavelets with FourierAnalysis, 2nd ed. Wiley, 2009, Hoboken, NJ.

[2] James S. Walker. Wavelet-based Image Compression, University of Wisconsin–EauClaire.

[3] Wavelets for Partial Differential Equations. In: Numerical Solutions of Partial Differ-ential Equations. Advanced Courses in Mathematics - CRM Barcelona. BirkhauserBasel (2009).

[4] Melani I. Plett. Transient Detection With Cross Wavelet Transforms and WaveletCoherence, IEEE Transactions on Signal Processing ( Volume: 55, Issue: 5, May2007 ).

[5] Tang, Y. Front. Comput. Sci. China (2008) 2: 268.https://doi.org/10.1007/s11704-008-0012-0

[6] Guan-Chen Pan. A Tutorial of Wavelet for Pattern Recognition, National TaiwanUniversity, Taipei, Taiwan, ROC.

[7] S. Livens, P. Scheunders, G. van de Wouwer, D. Van Dyck. Wavelets for textureanalysis, an overview, 1997 Sixth International Conference on Image Processing andIts Applications.

[8] Manojit Roy, V. Ravi Kumar, B. D. Kulkarni, John Sanderson, Martin Rhodes,Michel vander Stappen. Simple denoising algorithm using wavelet transform, Novem-ber 8, 2013.

[9] Mikko Ranta. Wavelet Multiresolution Analysis of Financial Time Series, VaasanYliopisto, April 2010.

[10] Jonathan N Bradley, Christopher M. Brislaw, Tom Hopper. The FBI WaveletScalarQuantization Standard for grayscale ngerprint image compression.

[11] Raghuram Rangarajan, Ramji Venkataramanan, Siddharth Shah. Image DenoisingUsing Wavelets.

[12] Sachin D. Ruikar, Dharmpal D. Doye. Wavelet Based Image Denoising Technique.(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 2, No.3, March 2011

[13] David H. Hubel, Tosrten N. Wiesel. Receptive fields, binocular in-teraction and functional architecture in the cat’s visual cortex.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1359523/pdf/jphysiol01247

-0121.pdf

[14] David H. Hubel, Tosrten N. Wiesel. Receptive fields and functionalarchitecture in two nonstriate visual areas (18 and 19) of the cat.https://www.physiology.org/doi/pdf/10.1152/jn.1965.28.2.229

[15] Michael W. Frazier, An Introduction to wavelets through linear algebra [Recurselectronic] / Michael W. Frazier, New York, Springer, cop. 1999

36

[16] Patrick J. Van Fleet. Wavelet Workshop. University of St. Thomas, June 7-10, 2006.http://personal.stthomas.edu/pjvanfleet/wavelets/ust2006workshop.html

[17] Python PyWavelets library. https://pywavelets.readthedocs.io/en/latest/

[18] Ingrid Daubechies. Ten Lectures on Wavelets. SIAM, 1992, Philadelphia, PA.

[19] G. Beylkin, R. Coifman, and V. Rokhlin (1991), Fast wavelet transforms and numer-ical algorithms, Comm. Pure Appl. Math.

[20] Travis Williams, Robert Li, Wavelet pooling for convolutional neural network, ICLR2018. https://openreview.net/pdf?id=rkhlb8lCZ

[21] Standford class, Module 1: Neural Networks,http://cs231n.github.io/neural-networks-1/

[22] Standford class, Module 2: Convolutional Neural Networks,http://cs231n.github.io/convolutional-networks/

[23] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, Ima-geNet Classification with Deep Convolutional Neural Networks.http://papers.nips.cc/paper/4824-imagenet-classification

-with-deep-convolutional-neural-networks

[24] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and AndrewY Ng. Reading digits in natural images with unsupervised feature learning. In NIPSWorkshop on Deep Learning and Unsupervised Feature Learning, 2011.

[25] Chen-Yu Lee, Patrick W. Gallagher, and Zhuowen Tu. Generalizing pooling functionsin convolutional neural networks: Mixed, gated, and tree. In Proceedings of the 19thInternational Conference on Artificial Intelligence and Statistics, pp. 464–472, 2016.

[26] Matthew Zeiler and Robert Fergus. Stochastic pooling for regularization of deepconvolutional neural networks. In Proceedings of the International Conference onLearning Representation (ICLR), 2013.

[27] Yu Chen, I. Lopez-Moreno, T. N. Sainath, M. Visontai, R. Alvarez, C. Parada.Locally-Connected and Convolutional Neural Networks for Small Footprint SpeakerRecognition, Google INTERSPEECH 2015.

[28] Guo, Kanghui, Gitta Kutyniok, and Demetrio Labate. ”Sparse multidimensionalrepresentations using anisotropic dilation and shear operators.” Wavelets and Splines(Athens, GA, 2005), G. Chen and MJ Lai, eds., Nashboro Press, Nashville, TN(2006): 189–201

[29] T. Lindeberg. “Scale-space theory: A basic tool for analysing structures at differentscales”, Journal of Applied Statistics 21(2): 224-270, (1994).

[30] T. Lindeberg.“Scale selection”, Computer Vision: A Reference Guide, (K. Ikeuchi,ed.), pages 701-713, (2014).

37

wavelet pooling for convolutional neural...

Documents