convolutional neural networkssergey/teaching/harbourcv19/... · 2019. 6. 5. · sergey nikolenko,...

convolutional neural networksMaster's Computer Vision by Neuromation

Sergey Nikolenko, Alex DavydowHarbour Space University, Barcelona, SpainMay 27, 2019

Random facts:

• on May 27, 1703, Peter the Great founded Saint Petersburg, soon to be capital of the Russianempire and still a wonderful city

• on May 27, 1931, Auguste Piccard and Paul Kipfer took off on a balloon from Augsburg andbecame the first human beings to enter the stratosphere, reaching a record altitude of 15,781m

• on May 27, 1933, Walt Disney released the cartoon Three Little Pigs, with its hit song Who'sAfraid of the Big Bad Wolf?

• on May 27, 1960, a military coup removed the Turkish President Celâl Bayar and the rest ofthe democratic government from office

• on May 27, 1977, Virgin released a Sex Pistols single God Save the Queen; the song wasimmediately banned on British radio but still reached #1 on the charts

Modern CNN architectures

ResNet

• Residual learning: let’s train the differences (residues) betweenone layer and the next.

• Then the gradients will be able to flow with no obstacle.• A function implemented by a residual unit looks like

y(𝑘) = 𝐹(x(𝑘)) + x(𝑘),

where x(𝑘) is the input vector of layer 𝑘, 𝐹(𝑥) is the functioncomputed by the layer, and y(𝑘) is the output of the residuallayer that will then become x(𝑘+1) for the next layer.

• Now the gradient can pass through and does not vanish when 𝐹becomes saturated:

𝜕y(𝑘)

𝜕x(𝑘) = 1 + 𝜕𝐹(x(𝑘))𝜕x(𝑘) .

3

ResNet

• This has allowed for very deep networks.• Another similar approach – highway networks by JürgenSchmidhuber.

• We again represent y(𝑘), output of layer 𝑘, as a linearcombination of x(𝑘) and 𝐹(x(𝑘)), but in a different way:

y(𝑘) = 𝐶(x(𝑘))x(𝑘) + 𝑇 (x(𝑘))𝐹(x(𝑘)),

where 𝐶 is the carry gate, and 𝑇 is the transform gate; usuallyit’s a convex combination, 𝐶 = 1 − 𝑇 .

• Practice shows that the residual connections should be as“straight” as possible.

3

ResNet: variations

4

Revolution of Depth (Kaiming He)

5


6


7

ResNeXt

• ResNeXt (Xie et al., 2016): let’s replace ResNet units with“split-transform-merge” units, similar to Inception.

• The input is divided into blocks w.r.t. channels, and every blockgets its own convolutions.

8

ResNeXt

• The idea is similar to group convolutions, used already inAlexNet for parallelization:

• They do yield a kind of a specialization in the results:

8

Inception v4 and Inception ResNet

• Another classic paper (Szegedy et al., 2016) introduced Inceptionv4 and Inception ResNet.

• Inception v4 – let’s standardize everything and simplify theunits. First, the “stem”:

9


• Second, we now have three basic blocks A, B, and C:

9


• And special reduction blocks to reduce the dimensions:

9


• Inception ResNet adds residual connections to these blocks:

9


• There is no pooling now but there are still reduction blocks:

9


• As a result, the architecture has become even simpler; Inceptionv4 (top), Inception ResNet (bottom):

9


• Inception ResNet v2:

9


• And it works quite well:

9

SqueezeNet

• SqueezeNet (Iandola et al., 2017) – how to reduce the number ofparameters:

• replace 3 × 3 filters with 1 × 1;• reduce the number of inputs for 3 × 3 convolutions;• delay downsampling as late as possible to increase the size ofactivation maps.

10

SqueezeNet

• Fire module:• squeeze convolutional layer (1 × 1 only);• expand layer (1 × 1 and 3 × 3).

10

SqueezeNet

• General SqueezeNet architecture:

10

SqueezeNet

• We get 50x fewer parameters than AlexNet. But:

10

MobileNet

• MobileNet (Howard et al., 2017): networks for mobile devices.• Depthwise separable convolutions: let’s decompose aconvolution into a depthwise convolution (one filter for eachchannel) and a 1 × 1 convolution.

11

MobileNet

• Then the structure of a layer will be more complex (but withfewer weights), and the overall architecture is not so deep:

11

MobileNet

• We see that we can save a lot of parameters at the price of asmall decrease in quality:

11

Adversarial examples


• Interesting feature of neural networks: you can fool any networkwith a picture completely indistinguishable for the naked eye.

• But how? Any ideas?..

13


• Let’s do gradient descent not along the weights 𝜃 but along theinput x!

• We only need to control that the new example x̂ remains similarto the original x, e.g., ‖x̂ − x‖∞ ≤ 𝜖 (or some other condition).How?

• Moreover, we can try to make x̂ stable to transformations suchas rotation.

• How would we do that?

13


• Intriguing properties of neural networks (Szegedy et al., 2013). Avery intriguing paper indeed...

• For instance, we have analyzed the activations of neurons.• I.e., supposedly, if we analyze the last layer neurons, they willform a nice basis in the latent space where it is easy to find thesemantics.

• Right?..

13


• ...not really:

• I.e., regular CNNs don’t have any reasonable disentanglement,the latent space is good but the basis is as good as random.

13


• The same paper introduced adversarial attacks; for AlexNeteverything on the right is an ostrich:

13


• Further in (Goodfellow, Shlens, Szegedy, 2014); all highlightedpictures are airplanes:

13


• Conclusions (Goodfellow, Shlens, Szegedy, 2014):• for a linear classifier it’s clear what to do: for x̂ = x + z we want toshi t w⊤x̂ = w⊤x + w⊤z, i.e., we take z = sign(w) and applyconstraints on the norm of x̂;

• the same can be done in any network; by taking the gradient wedo a linear approximation in a neighborhood:

z = 𝜖sign(∇x𝐿(𝜃, x, 𝑦));

• i.e., this is not because our models are very nonlinear, it’s becausethey are too linear;

• the shi t direction is important, not any specific point; i.e., we caneven generalize adversarial shi t to different examples;

• and we can try to regularize against it by adding adversarial shi tto the objective function:

𝐿′(𝜃, x, 𝑦) = 𝛼𝐿(𝜃, x, 𝑦) + (1 − 𝛼)𝐿(𝜃, x + 𝜖sign(∇x𝐿(𝜃, x, 𝑦), 𝑦).

• But that’s not the end of the story either... 13


• Lots of different attacks:• Deep Fool attack (Bastani et al., 2016): shi t the example to thehyperplane that divides classes, z = 𝑓(x0)

‖w‖22

w for a linear classifierand z𝑖 = 𝑓(x𝑖)

‖∇𝑓(x𝑖)‖22

∇𝑓(x𝑖) for any function;• (Carlini, Wagner, 2016): find minimal changes based on 𝐿0, 𝐿2, and

𝐿∞-norms, still some of the best attacks;• one can also look not for a direction but for specific features;• (Papernot et al., 2016): find out which pixels are the mostimportant and shi t them;

• and a lot more, hundreds of papers already...

13


• There are different approaches to defense too:• (Bastani et al., 2016): formalized the notion of robustness toadversarial attacks, proposed methods for evaluating it;

• (Lyu et al., 2015; Roth et al., 2018): other variations on gradientregularization;

• (Shabam et al., 2015; Madry et al., 2017): let’s train on “adversarial”examples, choosing the worst example in a neighborhood;

• (Brendel, Bethge, 2017): the more we have nonzero (small)gradients, the worse for attacks, so we can use simple numericalinstability as a regularizer;

• DeepCloak defense (Gao et al., 2017): let’s remove features that arenot needed for classification;

• and a lot more, hundreds of papers already...

13


• (Kurakin, Goodfellow, Bengio, 2016): attacks in the real world!Moreover, black box attacks: we attack one model and test onanother.

• There is an app that changes a photo adversarially:

13


• Even better – you can print out an adversarial example, and itstill works!

• It’s still unclear how realistic all this is, but quite possibly animportant direction for AI security in the future.

13

Thank you!

Thank you for your attention!

14

convolutional neural networkssergey/teaching/harbourcv19/... · 2019. 6. 5. · sergey nikolenko,...

Documents