visualizing and understanding stochastic depth … · visualizing and understanding stochastic...

12
Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University 450 Serra Mall, Stanford, CA 94305 {rjkaplan, rpalefsk, liujiang}@stanford.edu Abstract In this paper, we understand, analyze, and visualize Stochastic Depth Networks, an architecture introduced in March of 2016. Stochastic Depth Networks have enjoyed interest as a result of their significant reduction in training time while beating the then state of the art in accuracy. However, while Stochastic Depth Networks have delivered exceptional results, no academic paper has sought to understand the source of their performance or their limitations. In providing an analysis of Stochastic Depth Networks’ representations, error types, and strengths and weaknesses, we conduct seven experiments: t-SNE on layer activations, weight activations, maximally activated images, guided backpropagation, dead neuron counting, robustness to input noise, and linear classifier probes. By specifically comparing and contrasting Stochastic Depth Networks with Fixed Depth Networks (standard residual networks), we discover that Stochastic Depth Networks have a faster training time, a lower test error, similar clustering of data, and more strongly differentiated weight activations. 1. Introduction Stochastic Depth Networks have demonstrated an impressive ability to train extremely deep neural networks. Inspired by Dropout, Stochastic Depth Networks are essentially ResNets with one small tweak: they randomly drop some of the layers at training time and replace them with the identity function [6]. Stochastic Depth Networks have been shown to reduce training time and lower generalization error, and they can train extremely deep networks. The dropping of layers also helps with gradient flow and serves as a regularizer by effectively training a random ensemble of networks that are then averaged at test time. Previous experiments support the regularization hypothesis, but many questions remain about why Stochastic Nets perform so well. We explore the inner workings of Stochastic Depth Networks through a series of seven experiments. Figure 1. Layer Dropout: the third and fifth ”blocks” are replaced with an identity function. (Huang et al.) 2. Related Work The crux of our work involves analyzing deep networks with stochastic depth, the architecture of which is introduced by Huang et al. [6]. To address vanishing gradients and diminished forward flow, both of which are problems associated with 1

Upload: buitu

Post on 08-Sep-2018

244 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Visualizing and Understanding Stochastic Depth … · Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University

Visualizing and Understanding Stochastic Depth Networks

Russell Kaplan, Raphael Palefsky-Smith, Liu JiangStanford University

450 Serra Mall, Stanford, CA 94305{rjkaplan, rpalefsk, liujiang}@stanford.edu

Abstract

In this paper, we understand, analyze, and visualize Stochastic Depth Networks, an architecture introduced in Marchof 2016. Stochastic Depth Networks have enjoyed interest as a result of their significant reduction in training time whilebeating the then state of the art in accuracy. However, while Stochastic Depth Networks have delivered exceptional results,no academic paper has sought to understand the source of their performance or their limitations. In providing an analysisof Stochastic Depth Networks’ representations, error types, and strengths and weaknesses, we conduct seven experiments:t-SNE on layer activations, weight activations, maximally activated images, guided backpropagation, dead neuron counting,robustness to input noise, and linear classifier probes. By specifically comparing and contrasting Stochastic Depth Networkswith Fixed Depth Networks (standard residual networks), we discover that Stochastic Depth Networks have a faster trainingtime, a lower test error, similar clustering of data, and more strongly differentiated weight activations.

1. IntroductionStochastic Depth Networks have demonstrated an impressive ability to train extremely deep neural networks. Inspired by

Dropout, Stochastic Depth Networks are essentially ResNets with one small tweak: they randomly drop some of the layers attraining time and replace them with the identity function [6]. Stochastic Depth Networks have been shown to reduce trainingtime and lower generalization error, and they can train extremely deep networks. The dropping of layers also helps withgradient flow and serves as a regularizer by effectively training a random ensemble of networks that are then averaged attest time. Previous experiments support the regularization hypothesis, but many questions remain about why Stochastic Netsperform so well. We explore the inner workings of Stochastic Depth Networks through a series of seven experiments.

Figure 1. Layer Dropout: the third and fifth ”blocks” are replaced with an identity function. (Huang et al.)

2. Related WorkThe crux of our work involves analyzing deep networks with stochastic depth, the architecture of which is introduced by

Huang et al. [6]. To address vanishing gradients and diminished forward flow, both of which are problems associated with

1

Page 2: Visualizing and Understanding Stochastic Depth … · Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University

training deep convolutional networks with hundreds of layers, Huang et al. propose a training procedure called stochasticdepth that enables the contradictory setup to train short networks and use deep networks at test time [6]. Huang et al. beginwith deep networks but then randomly insert a subset of layers and bypass them with the identity function for each mini-batch. The identity connections when dropping a layer are preserved such that the inputs from a previous layer are fed intothe next layer in the stack. This approach is complementary to the recent success of residual networks and reduces trainingtime while improving the test error.

There are many unexplored facets of Stochastic Depth Networks. Huang et al. only experiment with architectures that useresidual connections to make benchmarking against prior work easy and isolate the benefit obtained from stochastic depth[5][6]. This is useful for demonstrating improved performance, but running experiments that implement stochastic depth onnetworks without residual connections would be more informative. As a direct follow-up to Huang et al., our work analyzeswhy their procedure works so well. While Huang et al. analyze their Stochastic Depth Network architecture with standardperformance techniques, they do not verify that their hypotheses for the drivers behind its high performance are actually true.For example, despite the fact that their regularizer hypothesis seems legitimate, the closest step they take towards verifyingtheir hypothesis is showing that there is less over-fitting.

While Stochastic Depth Nets have not been analyzed in great depth, other types of networks, particularly recurrent neuralnetworks (RNNs) and convolutional neural networks (CNNs), have been explored and visualized [7][9][15]. For example, ina spirit similar to ours, Karpathy et al. use character-level language models as an interpretable testbed to provide an analysisof RNNs representations, predictions, and error types [7]. Their experiments reveal the existence of interpretable cells thatkeep track of long-range dependences. Their comparative analysis with finite horizon n-gram models shows that the sourceof LSTM improvements is long-range structural dependences.

Because CNNs have demonstrated impressive classification performance on the ImageNet benchmark, there have been afew pieces of related work on visualizing and understanding CNNs and diagnosing additional possible improvements to theirperformance. Zeiler et al. [15] introduce a visualization technique that gives insight into the function of CNNs intermediatefeature layers and the operation of the classifier. Similarly, Simonyan et al. [9] consider two visualization techniques: one thatgenerates an image, which maximizes the class score and thus visualizes the notion of the class, and a second that computesa class saliency map, specific to a given image and class.

Previous work has also proposed either new frameworks for simplifying the training of deep neural networks [6] or newmethods for regularizing networks such as RNNs [8]. For example, He et al. reformulated the layers as learning residualfunctions referencing the layer inputs, as opposed to learning unreferenced functions [4]. Based on evaluation of residualnets of up to 152 layers on the ImageNet dataset, He et al. provide evidence that their residual networks are easier to optimizeand can gain accuracy as depth increases [4]. Another paper by He et al. analyzes the propogation formulations betweenresidual building blocks and proposes a new residual unit that makes training easier and improves generalizations [5]. Aseries of experiments support the importance of identity mappings, which can be used as the skip connections and after-addition activation [5]. Interestingly enough, Huang et al.s architecture is essentially an exact copy of Hes RNN architecture,just with some stochastic layer dropping [5][6]. In that sense, our work primarily builds off of He et al.s and Huang et al.spapers.

As another example, Krueger et al. propose a new method for regularizing RNNs known as zoneout [8]. Zoneout is a per-unit version of stochastic depth [5]. At each timestep, zoneout stochastically forces some hidden units in order to maintaintheir previous values. Like dropout, zoneout improves generalization by using random noise. However, by preserving ratherthan dropping hidden units, information on gradient and state are more easily propogated through time. Krueger et al.sempirical investigation of various RNN regularizers shows that zoneout has significant performance improvements acrosstasks [8]. It is important to note that Krueger et al.s work extends the stochastic depth method to RNNs and networks withhidden state [8].

Some more recent work abstracts away from specific neural network types and attempts to avoid the overarching issueof training a new model for every individual problem. For example, Zamir et al. train a model to learn fundamental visiontasks [15]. They employ a method to learn a generic 3D representation that generalizes to unseen 3D tasks with human-level performance on the supervised task and without any fine-tuning needs [15]. The learned representation shows traitsof abstraction abilities [15]. Zamir et al. developed independent semantic and 3D representations but integrating them is afuture direction of research that we similarly hope to undertake.

3. Model TrainingRather than using a pre-trained model, we opted to train our own model. We used the official code from the Huang et al.

paper [4]. We train two separate 110-layer ResNets: one with stochasticity of 0.5 on a linear decay schedule, and one with

2

Page 3: Visualizing and Understanding Stochastic Depth … · Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University

no stochasticity. The difference between the two networks lies in the layer death rate. The first network (Fixed Depth) is aconventional ResNet: every layer is trained, and there is no layer death. The second network, the Stochastic Depth Network[4], has a death rate of 0.5 that decays linearly across ResBlock layers (i.e., the later layers have a higher probability ofdropping for any particular minibatch. See Huang et al. for details.)

The networks are trained on CIFAR-10 with standard data augmentation. There are 45,000 training images, a 5,000-imagevalidation set, and a 10,000-image test set. The mini batch size is 128 with 18 residual modules, and we train on an AmazonGX2-large EC2 instance with an NVIDIA K80 GPU.

Training occurs for 500 epochs via a Stochastic Gradient Descent approach with a learning rate of 0.1, that decays to 0.01and then 0.001 after 250 and 375 epochs respectively, using weight decay of 1e-4 and Nesterov momentum. As expected, theStochastic Depth epochs were quicker because we stochastically skip the forward and backward computations for some ofthe ResBlocks. The Stochastic Depth Networks had an average epoch time of 210 seconds while the Fixed Depth Networkshad an average epoch time of 258 seconds.

Figure 2. Test Error vs Number of Epochs Trained

4. Methods and Technical Solution4.1. t-SNE on Layer Activations

For our first experiment, we used t-SNE to visualize 64-dimensional CNN codes of 4096 different image inputs [8]. T-SNE is a visualization technique that embeds high-dimensional vectors into a low-dimensional space, while trying to preservethe relative distances between different points in the high-dimensional space in the low-dimensional projection. In our case,we embed the 64-dimensional codes into a 2-dimensional space and plot each input image according to the projection ofits corresponding codes. We use a perplexity of 30 to produce t-SNE plots that show embeddings of image codes from theStochastic and Fixed Depth Networks respectively.

We employed t-SNE to visualize the activations of the following layers: layer 36 (after the first third of the network),layer 72 (after the second third of the network), and layer 108 (after the final spatial average pooling layer that is just beforethe final fully-connected layer). We take the outputs of all the neurons in each of those three layers and use them as featurevectors. In the case of layer 108, the output is a 64-length vector. In the case of layers 36 and 72, which are much earlierin the net, the output is large (greater than 16,000). Because t-SNE cannot run efficiently on such a large dataset, we usedSpatial Average Pooling, a 2D average-pooling operation over an input image, to drop the outputs of layer 72 and layer 36down to 64.

3

Page 4: Visualizing and Understanding Stochastic Depth … · Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University

4.2. Weight Activations

We plot the activations of weights at different layers for a given input image. At a high level, visualizing the activationsallows us to identify which neurons at different layers in the network get excited about or respond to which specific inputs.

4.3. Maximally Activated Images

We ran the entire test set through the network, examined a single neuron, and recorded its activations for each image. Wethen sorted the images by this activation and selected the top five. We completed this process for two different neurons acrosssix different layers (layers 1, 21, 41, 61, 81, and 101) on both networks.

4.4. Guided Backpropogation

To visualize what input is maximally exciting to specific neurons, we employ guided backpropogation, a method describedby Springenberg et al. [8]. Rather than masking out values corresponding to negative entries of the bottom data (backpro-pogation) or top gradient (deconvolutional network approach), guided backpropogation masks out the values for which atleast one of these values is negative [8]. In contrast to traditional backpropogation, guided backpropogation adds an addi-tional guidance signal, thereby preventing the backward flow of negative gradients [8]. Unlike the deconvolutional networkapproach, guided backpropogation works well without switches and allows for visualization of intermediate as well as lastlayers of networks.

Firstly, we found which input images result in the highest activations for several specific neurons early and late in thenetwork. Next, we change the backwards pass of our network so that the gradient of the layer whose neuron we wish tovisualize is set to all 0s, except for the specific neuron we visualize, which has all 1s. We then modify the gradients of theprevious ReLU layers.

4.5. Dead Neurons

We run the entire test set through the network. At each layer, we note neurons that have zero activation. Because this isdone after the ReLU, zero actually means less than or equal to 0. After running the entire test set through, we made note ofwhich neurons were zero-activated for every single image. We then tallied up the number of neurons per layer, and comparebetween Stochastic and Fixed. Note that we chose to do a cumulative plot due to the fact that the raw graph fluctuatesbetween 5 and 0 at almost every layer and is unreadable. By cumulative, we mean the total number of dead neurons up toand including this layer.

4.6. Robustness to Input Noise

Huang et. al. hypothesize that stochastic depth acts as a regularizer. They cite the higher training loss but lower test errorof the Stochastic Depth Network after convergence as evidence of a regularizing effect. The authors also draw a comparisonto Dropout, which has regularizing benefits that are well-studied by those like Wager et al. [11][13]. We test this hypothesisby adding different types and levels of noise to image inputs and comparing the networks performance.

For this experiment, we add noise to images and examine how much the noise affects error [3]. If the Stochastic DepthNetwork has a better test error than the Fixed Depth Network when the same amount of noise is added, the regularizationhypothesis would be supported. The test error is calculated across the entire test set. To achieve an understanding of theoverall effect of noise, we ran every image through the network, giving each image a different but equally strong bit of noise,and then record the overall accuracy on the entire test set. Each of the noise functions had a noise parameter that we varied(i.e. the x-axis), and as the parameter increased, the image became noisier. At this point, the images have already beenmean-subtracted and have a standard deviation of 1. Thus, most of the pixel values are in the range of 0 to 1, so adding noiseof 0.5 is pretty significant.

For the Gaussian normally distributed noise, we add a random value with zero mean and standard deviation from ourx-axis to each pixel. For the uniformly distributed noise, we added a random value to each pixel but we draw a randomvalue in the range from 0 to the x-axis. For the Gaussian blur, we blurred the image with a Gaussian kernel using the sameconvolution semantics, thereby keeping the image at its initial size. The noise parameter for the Gaussian blur is σ, whichcontrols the size of the filter. As an example, a filter with σ = 5 will combine more pixels than a filter with σ = 1.

4.7. Linear Classifier Probes

Because neural network models have a reputation for being black boxes, we employ methods to better visualize andunderstand what is being done at each layer of a Stochastic Depth Network. One way to do this is with a technique called

4

Page 5: Visualizing and Understanding Stochastic Depth … · Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University

”linear classifier probes” [1], which essentially measures how linearly separable the activations of a particular layer are intofinal class labels.

These probes can only use the hidden units of a specific intermediate layer as discriminating features, and these probes donot affect the training phase of our models as we add them after training.

Intermediate layers are particularly interesting as the first layers of a convolution network for image recognition containrelatively general filters in that they would likely continue to perform well even under a different image dataset. Furthermore,the last layers are often specific to a dataset and have to be retrained under a different dataset. Thus, intermediate layers arehighly relevant in terms of pinpointing when this transition occurs and if this transformation is progressive or sudden.

5. Experimental Results5.1. t-SNE on Layer Activations

As seen in Figure 3, both the Stochastic Depth Network as well as the Fixed Depth Network learned to cluster the data.Notice the clean separability of the final layer, and the different color distributions of the plots for layer 72. The clusterpatterns in the t-SNE of layer 72, two thirds deep into the network, show that the fixed network activations are close-by whenlow-level features like background color are close-by in the input space. (On a zoomed in version of this plot, one can clearlyobserve birds and planes intermixed in the same regions when their backgrounds are both a blue sky, for example. Similarlyfor deer and horses with grassy backgrounds.) This contrasts with the layer 72 t-SNE for the Stochastic Depth Network,where the background colors are relatively jumbled but there are more examples of images in the same class congregatingcloser together when they have different background patterns and colors.

One explanation of the differences in layer 72, supported by later experiments, is that the Stochastic Depth Network caresless about mastering low-level image feature extraction; it devotes more representational capacity to learning and separatinghigher-level features. We can begin to observe this in this t-SNE plot: by layer 72, the Stochastic Depth Network no longercares to cluster by background color, but rather it has begun to cluster by higher-level semantic significance. (Note that thisdoes not mean the data are more linearly separable into classes with layer 72 activations in the stochastic variant; as our laterlinear probe experiments show, the opposite is actually the case. But we can see that the high-level features are given morerepresentational weight at this layer, even if those representations aren’t yet class-separating.)

Figure 3. t-SNE plots for the activations at layers 36, 72, and 113 (the final spatial average pooling layer) of 4096 test set images.

5

Page 6: Visualizing and Understanding Stochastic Depth … · Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University

5.2. Weight Activations

We observe that across all inputs we tried visualizing, late-layer weight activations are more strongly differentiated be-tween neurons in the Stochastic Depth Network than in the Fixed Depth Network. The implication is that the same late-stagelayer in the Fixed Depth Network, the activations are more diffuse across filters (i.e. no one filter is activated as strongly,and more activate weakly) versus a corresponding layer in the Stochastic Depth Network. For a clarifying illustration of thisresult, see Figure 4.

The distribution and strength of weight activations might indicate that the Stochastic Depth Network better confidentlydiscriminates between different classes of the input image. Another observation is that immediately after each time wedouble the number of filters, otherwise known as neurons or tiles, nearly half of them is are often all black. Our dead neurondistribution experiment confirms this hypothesis: there is a spike in neuron death each time we double the number of filters.

Figure 4. Weight activations at various depths of the two different networks, for the same input image. Note that the actual input imagewas in color.

6

Page 7: Visualizing and Understanding Stochastic Depth … · Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University

5.3. Maximally Activated Images

By the end of the process, the neurons learned higher order features. Our results validate the hierarchical assumption ofConvNets. As shown in Figure 5, at Layer 1, we see very basic responses, which makes sense because it is early in thenetwork. Specifically the Fixed Neuron 2 Layer 1 likes red objects regardless of what they are, and Stochastic Neuron 6Layer 1 likes green objects regardless of background. Fixed Neuron 2 Layer 101 is an emu neuron, Fixed Neuron 6 Layer101 is a car neuron, and both Stochastic Neurons Layer 101 are horse neurons.

Figure 5. This chart displays the top-5 maximally activating images for various ReLU neurons within the Fixed-Depth and Stochastic-Depthnetworks. At each layer, we plot the images that maximally activate Neurons 2 and 6 (randomly selected).

7

Page 8: Visualizing and Understanding Stochastic Depth … · Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University

5.4. Guided Backpropogation

In both the Stochastic and Fixed Depth Networks, neurons perform as expected. The early layer neurons react strongly tocolor and texture, whereas the late layer neurons react to more semantically meaningful units (e.g. the wheels and headlightson cars, the heads of birds, sticks that the birds often sit on, and so forth). These results were consistent throughout the variousneurons we visualized at different layers of each network. Overall, there was no strong difference between the Stochastic andFixed Depth Networks that we observed through this method.

Figure 6. Guided backpropagation visualizations of the excitations of neurons in the first and last ReLU layers of both the Fixed andStochastic depth network. Within each block, each row represents a different neuron in the layer. The 6 tiles to the right are the top 6 max-imally activating images for that neuron, and the tiles to the left are the guided backpropagation visualizations of the neuron correspondingto each of those 6 image inputs. The all-black tiles for the last row in the first ReLU of the fixed network show a dead neuron: it has anactivation of 0 (and thus no gradient signal) for all images in the dataset.

5.5. Dead Neurons

The plot in Figure 7 shows that stochasticity does not help with the ”dead neurons” problem; in fact the problem is actuallymore pronounced in the early layers. Nonetheless, the Stochastic Depth Network has relatively fewer dead neurons in laterlayers. One intuition for this second point is that the later layers drop with higher probability due to the linear decay schedule,in which the probability of survival decays linearly as we go deeper. Because the later layers in the Stochastic Depth Networkare dropped frequently, having more neurons is more important because it is less likely that they are present.

8

Page 9: Visualizing and Understanding Stochastic Depth … · Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University

Figure 7. This plot shows the accumulation of dead neurons in each network: i.e., how many neurons up through the layer marked on thex-axis do not activate for any input image? We note that the stochastic depth network accumulates more dead neurons earlier, but the fixeddepth network gains more later. They end up with a roughly equal total number of dead neurons.

5.6. Robustness to Input Noise

As shown in Figure 8, the Stochastic Depth Network is less robust to image noise than the Fixed Depth Network for bothGaussian normally distributed noise and uniformly distributed noise. The Stochastic Depth Network performs slightly betterfor Gaussian blur perturbations, although it is questionable how meaningful these results are for σ > 3, given how much ofthe image is destroyed for larger σ. For examples, please view the images below the graphs of Figure 8.

The regularization hypothesis may therefore not be universally true. This is especially apparent for low-level perturbationslike image noise. The Stochastic Depth Network has nearly twice the number of dead neurons as the Fixed Depth Networkin the earliest layers, as those layers are responsible for the pixel-level pattern matching that the image noise is most likelyto interfere with. This, in conjunction with the dead neurons experiment described earlier, suggests that the early layers of aStochastic Depth Network are actually less robust than those in a Fixed Depth Network.

The fact that test-time performance is still generally better for Stochastic Depth suggests that perhaps having the mostrobust early layers is not that important. The main sources of remaining error on datasets like CIFAR may potentially lie notin problems with early layer feature activations but in layer ones. This supports the general observation made by Deng et al.in [2] that CNNs can often vastly outperform humans on fine-grained pattern recognition tasks in images (e.g. distinguishingbetween many close breeds of dogs) but be inferior in classification when high-level features of the image are very skewed(e.g. extreme occlusions).

9

Page 10: Visualizing and Understanding Stochastic Depth … · Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University

Figure 8. These plots display the effect of noise on test error for both Fixed Depth and Stochasti -Depth Networks. The x-axis is the amountof noise applied to images in the test set, and the y-axis is the corresponding error on the noise-corrupted test sets. Each plot is a differenttype of noise – Unformly Distributed, Normally Distributed, and Gaussian Blur – and the images below provide an example of each noiseapplied at various amounts to an image.

5.7. Linear Classifier Probes

In Figure 9, we plot the results of our linear probe experiments. Interestingly, the fixed networks intermediate layeractivations are generally more linearly separable into the class labels than those of the Stochastic Depth Network. The onlyexception to this is at the earliest layer we probed, layer 18, and the last non-fully-connected layer of the network (the outputof the 8 × 8 average spatial pooling layer). Clearly, the activations at the last layer will be more linearly separable for theStochastic Depth Network, as this is the network that ultimately had lower test error. However, it is interesting that essentiallyall of its intermediary layers produce activations that are less separable into classes.

Recall that it is not the real job of intermediate layers to produce linearly separable class activations. That is only the job ofthe last layer of the network; the remaining layers are simply supposed to produce the most useful possible feature activationsfor further processing by the next layer. Here we see that in the process of doing a better job, Stochastic Depth Networksproduce less separating intermediary activations. Why does that happen, and what does it suggest? One interpretation canbe made by recalling what stochastic depth actually does: by randomly dropping the activations of some layers, and onlyletting activations flow through the skip connection when that happens, stochasticity essentially asks ”more” from each ofthe intermediate layers: be useful to the next layer, but also be useful to the layer after when the next layer is not present. Wesuspect this results in a kind of ”representational hedging”: because the task demanded of each intermediate layer changesfrom epoch to epoch, depending on which layers are dropped, they do worse on any one individual layer’s task request, likelinear separability. They can be thought of as ”blurry” representations that need to work well in multiple different contexts.

Figure 9. Test errors of linear probes trained independently at different layers. Probes are trained with no pooling (which means early layerprobes have many tens of thousands of parameters) and a learning rate of 0.0001 until convergence.

10

Page 11: Visualizing and Understanding Stochastic Depth … · Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University

6. ConclusionWe conducted seven experiments: t-SNE on layer activations, weight activations, maximally activated images, guided

backpropogation, dead neurons, robustness to input noise, and linear classifier probes. One of our overarching conclusions,which is supported by the overall test error and our dead neuron, t-SNE, and linear probes experiments, is that StochasticDepth Nets are less tuned for low-level feature extraction but more tuned for higher level feature differentiation. This issupported by their higher susceptibility to error after low-level noise is introduced, and the intermediate t-SNE plots thatshow higher level features being ”paid attention to” earlier in the network, whereas the background color is the primaryclustering factor for the fixed depth network.

The different in robustness to noise also adds nuance to the analysis of Huang et al.’s suggestion that stochasticity actsas a regularizer. Increased regularization would normally be expected to provide greater invariance to input noise. Ourinterpretation is that while stochasticity still likely has a regularizing effect (as test error is lower but training loss is higherafter convergence), the effect regularizes across ”higher-level” features in the image, as opposed to low-level perturbations.

Overall, it seems that the representations learned by these networks are still rather similar. The performance is different, butnot drastically; maximally activating neurons and guided backpropagation visualizations do not reveal major contrasts. Butthere is nonetheless the hint that the distribution of representational power is slightly different for each network. Stochasticdepth networks are a fascinating architectural idea and we look forward to continued research on their utility.

7. Future WorkWe see many promising avenues for future work and plan to conduct the following additional experiments, among others:

1. Performing analyses on datasets beyond CIFAR-10, including MNIST and (a subset of) ImageNet. This way, we cancollect quantitative results independent of the specific dataset, thereby ensuring that our findings do not depend on theproperties of CIFAR-10 in particular.

2. Evaluating more architectures, including fully-connected nets and nets without any residual connections.

3. Determining how well the representations learned with Stochastic Depth Networks can be used for transfer learningwith new tasks.

4. Finally, as more techniques for neural network visualization and understanding are developed, we would like to applythese generalized techniques to Stochastic Depth Nets in particular, perhaps uncovering relationships that our analysesmissed.

References

[1] G. Alain and Y. Bengio. Understanding Intermediate Layers Using Linear Classifier Probes. 2016.[2] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. In

CVPR, 2009.[3] S. Dodge and L. Karam. Understanding How Image Quality Affects Deep Neural Networks. 2016.[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.[5] K. He, X. Zhang, S. Ren, and J. Sun. Identity Maps in Deep Residual Networks. In ECCV, 2016.[6] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger. Deep Networks with Stochastic Depth. In NIPS, 2016.[7] Karpathy, J. Johsnon, and F. Li. Visualizing and Understanding Recurrent Networks. In ICLR, 2016.[8] D. Krueger, T. Maharaj, J. Kramr, M. Pezeshki, N. Ballas, N. Ke, A. Goyal, Y. Bengio, H. Larochelle, A. Courville,

and C. Pal. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. 2016.[9] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: Visualizing Image Classification

Models and Saliency Maps. In ICLR, 2014.[10] J.T. Springenberg, A. Dosovitskiy, T. Brox, M. Riedmiller. Striving for Simplicty: The All Convolutional Net. In

ICLR, 2015.[11] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov: Dropout: A simple way to prevent neural

networks from overfitting. The Journal of Machine Learning Research 15(1): 1929-1958, 2014.[12] L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine

Learning Research 9(Nov): 2579-2605, 2008.[13] S. Wager, S. Wang, P. Liang. ”Dropout Training as Adaptive Regularization.” In NIPS, 2013.

11

Page 12: Visualizing and Understanding Stochastic Depth … · Visualizing and Understanding Stochastic Depth Networks Russell Kaplan, Raphael Palefsky-Smith, Liu Jiang Stanford University

[14] A. Zamir, P. Agrawal, T. Wekel, J. Malik, and S. Savarese. Generic 3D Representations via Pose Estimation andMapping. In ECCV, 2016.

[15] M. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. In ECCV, 2014.

12