multilayer networks regularization and representationsu.cs.biu.ac.il/~89-687/lec3.pdf · common...

Multilayer Networks Regularization

and RepresentationsYoav Goldberg

Last Time

• Linear models cannot do XOR. Linear-Max-Linear can.

• Loss functions.

• Gradient-based Training, and the (stocahstic) gradient-descent algorithm.

Today

• From linear functions to MLPs.

• Overfitting and Regularization.

• Input Representations.

Linear Classifier

f(x) = W · x+ b

Binary:

Multi class:

f(x) = w · x+ b �(w · x+ b)

sign(w · x+ b)

argmaxi

softmax(W · x+ b)[i]

Non-Linear Classifier

f✓(x) = wg(W0 · x+ b0) + b

Non-Linear Classifier

f✓(x)

44 4. FEED FORWARD NEURAL NETWORKS

NNMLP2(x) =y

h1 =g1(xW1 + b1)

h2 =g2(h1W2 + b2)

y =h2W3

When applying dropout training to MLP2, we randomly set some of the values of h1

and h2 to 0 at each training round:

NNMLP2(x) =y

h1 =g1(xW1 + b1)

m1 ⇠Bernouli(r1)

h1 =m1 � h1

h2 =g2(h1W2 + b2)

m2 ⇠Bernouli(r2)

h2 =m2 � h2

y =h2W3

(4.7)

Here, m1 and m2 are random masking vectors with the dimensions of h1 and h2 respec-tively, and � is the element-wise multiplication operation. The values of the elements inthe masking vectors are either 0 or 1, and are drawn from a Bernouli distribution withparameter r (usually r = 0.5). The values corresponding to zeros in the masking vectorsare then zeroed out, replacing the hidden layers h with h before passing them on to thenext layer.

Work by Wager et al. [2013] establishes a strong connection between the dropoutmethod and L2 regularization. Another view links dropout to model averaging and ensembletechniques.

The dropout technique is one of the key factors contributing to very strong results ofneural-network methods on image classification tasks [Krizhevsky et al., 2012], especiallywhen combined with ReLU activation units [Dahl et al., 2013]. The dropout technique ise↵ective also in NLP applications of neural networks.

4.7 EMBEDDING LAYERSAs will be further discussed in chapter 8, when the input to the neural network containssymbolic categorical features (e.g. features that take on of k distinct symbols, such as

=

Multi-layer Perceptron (MLP):

Common Non-linearitiesSigmoid 4.6. REGULARIZATION AND DROPOUT 43

Figure 4.3: Activation functions (top) and their derivatives (bottom).

the network’s output y given the true expected output y. The loss functions discussed forlinear models in 2.7.1 are relevant and widely used also for neural networks. For furtherdiscussion on loss functions in the context of neural networks see [Bengio et al., 2016, LeCunand Huang, 2005, LeCun et al., 2006].

4.6 REGULARIZATION AND DROPOUT

Multi-layer networks can be large and have many parameters, making them especially proneto overfitting. Model regularization is just as important in deep neural networks as it is inlinear models, and perhaps even more so. The regularizers discussed in 2.7.2, namely L2, L1

and the elastic-net, are also relevant for neural networks. In particular, L2 regularization,also called weight decay is essential for achieving good generalization performance in manycases, and tuning the regularization strength � is advisable.

Another e↵ective technique for preventing neural networks from overfitting the train-ing data is dropout training [Hinton, 2014, Hinton et al., 2012]. The dropout method isdesigned to prevent the network from learning to rely on specific weights. It works by ran-domly dropping (setting to 0) half of the neurons in the network (or in a specific layer)in each training example in the stochastic-gradient training. For example, consider themultilayer perceptron with two hidden layers (MLP2):


Sigmoid The sigmoid activation function �(x) = 1/(1 + e�x), also called the logistic

function, is an S-shaped function, transforming each value x into the range [0, 1]. Thesigmoid was the canonical non-linearity for neural networks since their inception, but iscurrently considered to be deprecated for use in internal layers of neural networks, as thechoices listed below prove to work much better empirically.

Hyperbolic tangent (tanh) The hyperbolic tangent tanh(x) = e2x�1

e2x+1activation func-

tion is an S-shaped function, transforming the values x into the range [�1, 1].

Hard tanh The hard-tanh activation function is an approximation of the tanh functionwhich is faster to compute and take derivatives of:

hardtanh(x) =

8><

>:

�1 x < �1

1 x > 1

x otherwise

(4.5)

Rectifier (ReLU) The Rectifier activation function [Glorot et al., 2011], also known asthe rectified linear unit is a very simple activation function that is easy to work with andwas shown many times to produce excellent results.6 The ReLU unit clips each value x < 0at 0. Despite its simplicity, it performs well for many tasks, especially when combined withthe dropout regularization technique (see Section 4.6).

ReLU(x) = max(0, x) =

(0 x < 0

x otherwise(4.6)

As a rule of thumb, ReLU units work better than tanh, and tanh works better thansigmoid.

Figure 4.3 show the shapes of the di↵erent activations functions, together with theshapes of their derivatives.

4.5 LOSS FUNCTIONS

When training a neural network (more on training in chapter 5 below), much like whentraining a linear classifier, one defines a loss function L(y,y), stating the loss of predictingy when the true output is y. The training objective is then to minimize the loss acrossthe di↵erent training examples. The loss L(y,y) assigns a numerical score (a scalar) to

6The technical advantages of the ReLU over the sigmoid and tanh activation functions is that it does notinvolve expensive-to-compute functions, and more importantly that it does not saturate. The sigmoid and tanhactivation are capped at 1, and the gradients at this region of the functions are near zero, driving the entiregradient near zero. The ReLU activation does not have this problem, making it especially suitable for networkswith multiple layers, which are susceptible to the vanishing gradients problem when trained with the saturatingunits.

Common Non-linearitiestanh








hardtanh(x) =

8><

>:

�1 x < �1

1 x > 1

x otherwise

(4.5)



(0 x < 0

x otherwise(4.6)



4.5 LOSS FUNCTIONS



4.6. REGULARIZATION AND DROPOUT 43







Common Non-linearitieshard-tanh 4.6. REGULARIZATION AND DROPOUT 43














hardtanh(x) =

8><

>:

�1 x < �1

1 x > 1

x otherwise

(4.5)



(0 x < 0

x otherwise(4.6)



4.5 LOSS FUNCTIONS



Common Non-linearitiesReLU (rectifier, rectified linear unit)








hardtanh(x) =

8><

>:

�1 x < �1

1 x > 1

x otherwise

(4.5)



(0 x < 0

x otherwise(4.6)



4.5 LOSS FUNCTIONS



4.6. REGULARIZATION AND DROPOUT 43







Which non-linearity to use?

• No good rules.

• Use sigmoid when you want a 0-1 behavior. Otherwise prefer not to use it.

• tanh and ReLU work well.

• There are also fancier ones (i.e. ELU)

Representation Power

• For every borel-measurable function, there is a multi-layer perceptron with one hidden layer that can approximate it to any desired epsilon.

How many layers to use? And how wide should they be?• No hard and fast rules.

• In vision, we see that "deeper is better".

• Not always the case in text / sequences (or we do not know how to do it properly yet).

• Can think of each layer as transforming the previous layer (remember the xor example).

• Narrower layers "compress" the information in the previous layer. Wider layers introduce redundancies.

Training a neural network

Training a neural network

2.7. TRAINING AS OPTIMIZATION 23

2.7 TRAINING AS OPTIMIZATION

Recall that the input to a supervised learning algorithm is a training set of n training ex-amples x1:n = x1, x2, ..., xn together with corresponding labels y1:n = y1, y2, ..., yn. Withoutloss of generality, we assume that the desired inputs and outputs are vectors: x1:n, y1:n.8

The goal of the algorithm is to return a function f() that accurately maps inputexamples to their desired labels, i.e. a function f() such that the predictions y = f(x) overthe training set are accurate. To make this more precise, we introduce the notion of a lossfunction, quantifying the loss su↵ered when predicting y while the true label is y. Formally,a loss function L(y,y) assigns a numerical score (a scalar) to a predicted output y giventhe true expected output y. The loss function should be bounded from below, with theminimum attained only for cases where the prediction is correct.

The parameters of the learned function (the matrix W and the biases vector b) arethen set in order to minimize the loss L over the training examples (usually, it is the sumof the losses over the di↵erent training examples that is being minimized).

Concretely, given a labeled training set (x1:n,y1:n), a per-instance loss function L

and a parameterized function f(x;⇥) we define the corpus-wide loss with respect to theparameters ⇥ as the average loss over all training examples:

L(⇥) =1

n

nX

i=1

L(f(xi;⇥),yi) (2.13)

In this view, the training examples are fixed, and the values of the parameters deter-mine the loss. The goal of the training algorithm is then to set the values of the parameters⇥ such that the value of L is minimized:

⇥ = argmin⇥

L(⇥) = argmin⇥

1

n

nX

i=1

L(f(xi;⇥),yi) (2.14)

Equation (2.14) attempts to minimize the loss at all costs, which may result in over-fitting the training data. To counter that, we often pose soft restrictions on the form of thesolution. This is done using a function R(⇥) taking as input the parameters and returninga scalar that reflect their “complexity”, which we want to keep low. By adding R to theobjective, the optimization problem needs to balance between low loss and low complexity:

⇥ = argmin⇥

0

BBB@

lossz }| {1

n

nX

i=1

L(f(xi;⇥),yi) +

regularizationz }| {

�R(⇥)

1

CCCA(2.15)

8In many cases it is natural to think of the expected output as a scalar (class assignment) rather than a vector. Insuch cases, y is simply the corresponding one-hot vector, and argmaxi y[i] is the corresponding class assignment.

Training a neural network• Initialize parameters to random values.

• while not done:

• select training example (vector, label)

• calculate loss w.r.t label using current parameters

• calculate gradients of parameters for loss

• update parameters in direction of gradient

• return parameters

Training a neural network• Initialize parameters to random values.

• while not done:

• select training example (vector, label)

• calculate loss w.r.t label using current parameters

• calculate gradients of parameters for loss

• update parameters in direction of gradient

• return parameters

<-- when do we stop training?

Training loss / iterations.

30 2. LEARNING BASICS AND LINEAR MODELS

that the cumulative loss of f on the training examples is small. The algorithm works asfollows:

Algorithm 2 Online Stochastic Gradient Descent Training

Input:- Function f(x;⇥) parameterized with parameters ⇥.- Training set of inputs x1, . . . ,xn and desired outputs y1, . . . ,yn.- Loss function L.

1: while stopping criteria not met do2: Sample a training example xi,yi

3: Compute the loss L(f(xi;⇥),yi)4: g gradients of L(f(xi;⇥),yi) w.r.t ⇥5: ⇥ ⇥� ⌘tg

6: return ⇥

The goal of the algorithm is to set the parameters ⇥ so as to minimize the totalloss L(⇥) =

Pni=1

L(f(xi; ✓),yi) over the training set. It works by repeatedly sampling atraining example and computing the gradient of the error on the example with respect tothe parameters ⇥ (line 4) – the input and expected output are assumed to be fixed, andthe loss is treated as a function of the parameters ⇥. The parameters ⇥ are then updatedin the opposite direction of the gradient, scaled by a learning rate ⌘t (line 5). The learningrate can either be fixed throughout the training process, or decay as a function of the timestep t.10 For further discussion on setting the learning rate, see Section 5.2.

Note that the error calculated in line 3 is based on a single training example, and isthus just a rough estimate of the corpus-wide loss L that we are aiming to minimize. Thenoise in the loss computation may result in inaccurate gradients. A common way of reducingthis noise is to estimate the error and the gradients based on a sample of m examples. Thisgives rise to the minibatch SGD algorithm:

In lines 3 – 6 the algorithm estimates the gradient of the corpus loss based on theminibatch. After the loop, g contains the gradient estimate, and the parameters ⇥ areupdated toward g. The minibatch size can vary in size from m = 1 to m = n. Higher valuesprovide better estimates of the corpus-wide gradients, while smaller values allow moreupdates and in turn faster convergence. Besides the improved accuracy of the gradientsestimation, the minibatch algorithm provides opportunities for improved training e�ciency.For modest sizes of m, some computing architectures (i.e. GPUs) allow an e�cient parallelimplementation of the computation in lines 3–6. With a properly decreasing learning rate,

10Learning rate decay is required in order to prove convergence of SGD.

(training curve + test curve)

Overfitting

Dealing with Overfitting

• Early stopping.

• Regularization.

• Dropout.

• All of the above.


• Early stopping

• stop training when performance on dev drops. (should we check loss, or actual task accuracy?)


• Regularization

• Add an objective term that prefers "good" parameters.

• In practice: don't get any one parameter become too big.

Regularization








L(⇥) =1

n

nX

i=1

L(f(xi;⇥),yi) (2.13)


⇥ = argmin⇥

L(⇥) = argmin⇥

1

n

nX

i=1

L(f(xi;⇥),yi) (2.14)


⇥ = argmin⇥

0

BBB@

lossz }| {1

n

nX

i=1

L(f(xi;⇥),yi) +


�R(⇥)

1

CCCA(2.15)


Regularization








L(⇥) =1

n

nX

i=1

L(f(xi;⇥),yi) (2.13)


⇥ = argmin⇥

L(⇥) = argmin⇥

1

n

nX

i=1

L(f(xi;⇥),yi) (2.14)


⇥ = argmin⇥

0

BBB@

lossz }| {1

n

nX

i=1

L(f(xi;⇥),yi) +


�R(⇥)

1

CCCA(2.15)



RL2(W) = ||W||22=

X

i,j

(W[i,j])2 (2.25)

The L2 regularizer is also called a gaussian prior or weight decay.Note that L2 regularized models are severely punished for high parameter weights,

but once the value is close enough to zero, their a↵ect becomes negligible. The model willprefer to decrease the value of one parameter with high weight by 1 than to decrease thevalue of ten parameters that already have relatively low weights by 0.1 each.

L1 regularization In L1 regularization, R takes the form of the L1 norm of the param-eters, trying to keep the sum of the absolute values of the parameters low:

RL1(W) = ||W||1 =X

i,j

|W[i,j]| (2.26)

In contrast to L2, the L1 regularizer is punished uniformly for low and high values,and has an incentive to decrease all the non-zero parameter values towards zero. It thusencourages a sparse solutions – models with many parameters with a zero value. The L1

regularizer is also called a sparse prior or lasso [Tibshirani, 1994].

Elastic-Net The elastic-net regularization [Zou and Hastie, 2005] combines both L1 andL2 regularization:

Relastic-net(W) = �1RL1(W) + �2RL2(W) (2.27)

2.8 GRADIENT BASED OPTIMIZATION

In order to train the model, we need to solve the optimization problem in equation 2.24.A common solution is to use a gradient based method. Roughly speaking, gradient-basedmethods work by repeatedly computing an estimate of the loss L over the training set,computing the gradients of the parameters ⇥ with respect to the loss estimate, and mov-ing the parameters in the opposite directions of the gradient. The di↵erent optimizationmethods di↵er in how the error estimate is computed, and how “moving in the oppositedirection of the gradient” is defined. We describe the basic algorithm, stochastic gradientdescent (SGD), and then briefly mention the other approaches with pointers for furtherreading.

Motivating Gradient Based Optimization Consider the task of finding the scalarvalue x that minimizes a function y = f(x). The canonical approach is computing thesecond derivative f

00(x) of the function, and solving for f00(x) = 0 to get the extrema

L2 regularization

Regularization








L(⇥) =1

n

nX

i=1

L(f(xi;⇥),yi) (2.13)


⇥ = argmin⇥

L(⇥) = argmin⇥

1

n

nX

i=1

L(f(xi;⇥),yi) (2.14)


⇥ = argmin⇥

0

BBB@

lossz }| {1

n

nX

i=1

L(f(xi;⇥),yi) +


�R(⇥)

1

CCCA(2.15)



RL2(W) = ||W||22=

X

i,j

(W[i,j])2 (2.25)




RL1(W) = ||W||1 =X

i,j

|W[i,j]| (2.26)









L2 regularization

L1 regularization


RL2(W) = ||W||22=

X

i,j

(W[i,j])2 (2.25)




RL1(W) = ||W||1 =X

i,j

|W[i,j]| (2.26)









Regularization








L(⇥) =1

n

nX

i=1

L(f(xi;⇥),yi) (2.13)


⇥ = argmin⇥

L(⇥) = argmin⇥

1

n

nX

i=1

L(f(xi;⇥),yi) (2.14)


⇥ = argmin⇥

0

BBB@

lossz }| {1

n

nX

i=1

L(f(xi;⇥),yi) +


�R(⇥)

1

CCCA(2.15)



RL2(W) = ||W||22=

X

i,j

(W[i,j])2 (2.25)




RL1(W) = ||W||1 =X

i,j

|W[i,j]| (2.26)









L2 regularization

L1 regularization


RL2(W) = ||W||22=

X

i,j

(W[i,j])2 (2.25)




RL1(W) = ||W||1 =X

i,j

|W[i,j]| (2.26)










RL2(W) = ||W||22=

X

i,j

(W[i,j])2 (2.25)




RL1(W) = ||W||1 =X

i,j

|W[i,j]| (2.26)









elastic net

Regularization• L2 Regularization

• Also called "weight decay".

• Punishes large values.

• L1 Regularization

• Prefers sparse solutions.

why?

Dropout• At each iteration, select a random subset of

"neurons" and "drop" them.

• Like training 2n different networks.

• Prevents co-adaptation of neurons (prevents neurons from depending on each other).

Dropout44 4. FEED FORWARD NEURAL NETWORKS

NNMLP2(x) =y

h1 =g1(xW1 + b1)

h2 =g2(h1W2 + b2)

y =h2W3



NNMLP2(x) =y

h1 =g1(xW1 + b1)

m1 ⇠Bernouli(r1)

h1 =m1 � h1

h2 =g2(h1W2 + b2)

m2 ⇠Bernouli(r2)

h2 =m2 � h2

y =h2W3

(4.7)






NNMLP2(x) =y

h1 =g1(xW1 + b1)

h2 =g2(h1W2 + b2)

y =h2W3



NNMLP2(x) =y

h1 =g1(xW1 + b1)

m1 ⇠Bernouli(r1)

h1 =m1 � h1

h2 =g2(h1W2 + b2)

m2 ⇠Bernouli(r2)

h2 =m2 � h2

y =h2W3

(4.7)





Initialization• With a (log) linear model, initialization doesn't

matter much.

• With MLPs or more complex networks, initialization is crucial for achieving good performance.

Initialization• With a (log) linear model, initialization doesn't

matter much.

• With MLPs or more complex networks, initialization is crucial for achieving good performance.

Rn⇥m ⇠ Uniform[�✏,+✏]

✏ =

p(6)p

m+ n

Xavier Glorot et al's suggestion:

Gradient Checks• How do we know if our gradients are correct?

• We can compute the numerical gradients:

@f(x, y, z)

@x=

f(x+ h, y, z)� f(x� h, y, z)

2h

@f(x, y, z)

@y=

f(x, y + h, z)� f(x, y � h, z)

2h

@f(x, y, z)

@z=

f(x, y, z + h)� f(x, y, z � h)

2h

Representations

Representations


y = f(x) = x ·W + b

y = argmaxi

y[i](2.7)

Here y 2 R6 is a vector of the scores assigned by the model to each language, and weagain determine the predicted language by taking the argmax over the entries of y.

2.4 REPRESENTATIONS

Consider the vector y resulting from applying equation 2.7 of a trained model to a doc-ument. The vector can be considered as a representation of the document, capturing theproperties of the document that are important to us, namely the scores of the di↵erentlanguages. The representation y contains strictly more information than the predictiony = argmaxi y[i]: for example, y can be used to distinguish documents in which the mainlanguage in German, but which also contain a sizeable amount of French words. By clus-tering documents based on their vector representations as assigned by the model, we couldperhaps discover documents written in regional dialects, or by multilingual authors.

The vectors x containing the normalized letter-bigram counts for the documents arealso representations of the documents, arguably containing a similar kind of informationto the vectors y. However, the representations in y is more compact (6 entries instead of784) and more specialized for the language prediction objective (clustering by the vectors xwould likely reveal document similarities that are not due to a particular mix of languages,but perhaps due to the document’s topic or writing styles).

The trained matrix W 2 R784⇥6 can also be considered as containing learned repre-sentations. Each of the 6 columns of the matrix correspond to a particular language, andcan be taken to be a 784-dimensional vector representation of this language in terms of itscharacteristic letter-bigram patterns. We can then cluster the 6 language vectors accordingto their similarity. Similarly, each of the 784 rows of W correspond to a particular letter-bigram, and provides a 6-dimensional vector representation of that bigram in terms of thelanguages it prompts.

Representations are central to deep learning. In fact, one could argue that the mainpower of deep-learning is the ability to learn good representations. In the linear case, therepresentations are interpretable, in the sense that we can assign a meaningful interpreta-tion to each dimension in the representation vector (e.g., each dimension corresponds to aparticular language or letter-bigram). This is in general not the case – deep learning modelsoften learn a cascade of representations of the input that build on top of each other, in orderto best model the problem at hand, and these representations are often not interpretable– we do not know which properties of the input they capture. However, they are still veryuseful for making predictions. Moreover, at the boundaries of the model, i.e. at the input

2.3. LINEAR MODELS 17

We usually have many more than two features. Moving to a language setup, considerthe task of distinguishing documents written in English from documents written in Ger-man. It turns out that letter frequencies make for quite good predictors (features) for thistask. Even more informative are counts of letter bigrams, i.e. pairs of consecutive letters.Assuming we have an alphabet of 28 letters (a-z, space, and a special symbol for all othercharacters including digits, punctuations, etc) we represent a document as a 28⇥ 28 di-mensional vector x 2 R784, where each entry x[i] represents a count of a particular lettercombination in the document, normalized by the document’s length. For example, denotingby xab the entry of x corresponding to the letter-bigram ab:

xab =#ab

|D| (2.3)

where #ab is the number of times the bigram ab appears in the document, and |D| is thetotal number of bigrams in the document (the document’s length).

Figure 2.2: Character-bigram histograms for documents in English (left, blue) and German

(right, green). Underscores denote spaces.

Figure 2.2 shows such bigram histograms for several German and English texts. Forreadability, we only show the top frequent character-bigrams and not the entire feature

Language Identification for 6 languages based on letter bigram counts.

Representations


y = f(x) = x ·W + b

y = argmaxi

y[i](2.7)


2.4 REPRESENTATIONS





assume 28 letters (including space).

the vector x is 784 dimensional vector.

each entry is the count for a particular letter pair.

Representations


y = f(x) = x ·W + b

y = argmaxi

y[i](2.7)


2.4 REPRESENTATIONS





consider the values y

Representations


y = f(x) = x ·W + b

y = argmaxi

y[i](2.7)


2.4 REPRESENTATIONS





consider the 6 columns of W

Representations


y = f(x) = x ·W + b

y = argmaxi

y[i](2.7)


2.4 REPRESENTATIONS





consider the 784 rows of W

Representations


y = f(x) = x ·W + b

y = argmaxi

y[i](2.7)


2.4 REPRESENTATIONS





think of x as a sum of one-hot vectors.

what is xW ?

Representations

what happens if we add layers?

y = g(xW)U

W 2 R784⇥30 U 2 R30⇥6

multilayer networks regularization and representationsu.cs.biu.ac.il/~89-687/lec3.pdf · common...

Documents