artificial intelligence ai & machine learning & neural nets · neural networks became a...

60
Artificial Intelligence AI & Machine Learning & Neural Nets Marc Toussaint University of Stuttgart Winter 2019/20

Upload: others

Post on 24-May-2020

19 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Artificial Intelligence

AI & Machine Learning & Neural Nets

Marc ToussaintUniversity of StuttgartWinter 2019/20

Page 2: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Motivation:Neural networks became a central topic for Machine Learning and AI. But in principle, they’rejust parameterized functions that can be fit to data. They lack many appealing aspects thatwere the focus of ML research in the ’90-’10. So why are they so successful now? This lectureintroduces the basics and tries to discuss the success of NNs.

AI & Machine Learning & Neural Nets – – 1/43

Page 3: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

What is AI?

AI & Machine Learning & Neural Nets – What is AI? – 2/43

Page 4: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

What is AI?

• AI is a research field

• AI research is about systems that take decisions– AI formalizes decision processes: interactive (decisions change world

state), or passive– AI distinguishes between agent(s) and the external world: decision

variables

• ... systems that take optimal/desirable decisions– AI formalizes decision objectives– AI aims for systems to exhibit functionally desirable behavior

• ... systems that take optimal decisions on the basis of all availableinformation

– AI is about inference– AI is about learning

AI & Machine Learning & Neural Nets – What is AI? – 3/43

Page 5: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

What is Machine Learning?

• In large parts, ML is: (let’s call this ML0)Fitting a function f : x 7→ y to given data D = {(xi, yi)}ni=1

• And what does that have to do with AI?

• Taken literally, not much:– The decision made by a ML0 method is only a single decision: Decide on

the function f ∈ H in the hypothesis space H

– This “single-decision process” is not interactive; ML0 does notformalize/model/consider how the choice of f changes the world

– The objective L(f,D) in ML0 depends only on the given static data D andthe decision f , not how f might change the world

– “Learning” in ML0 is not an interactive process, but some method to pick fon the basis of the static D. The typically iterative optimization process isnot the decision process that ML0 focusses on.

AI & Machine Learning & Neural Nets – What is AI? – 4/43

Page 6: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

What is Machine Learning?

• In large parts, ML is: (let’s call this ML0)Fitting a function f : x 7→ y to given data D = {(xi, yi)}ni=1

• And what does that have to do with AI?

• Taken literally, not much:– The decision made by a ML0 method is only a single decision: Decide on

the function f ∈ H in the hypothesis space H

– This “single-decision process” is not interactive; ML0 does notformalize/model/consider how the choice of f changes the world

– The objective L(f,D) in ML0 depends only on the given static data D andthe decision f , not how f might change the world

– “Learning” in ML0 is not an interactive process, but some method to pick fon the basis of the static D. The typically iterative optimization process isnot the decision process that ML0 focusses on.

AI & Machine Learning & Neural Nets – What is AI? – 4/43

Page 7: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

But, function approximation can be used to help solving AI (interactivedecision process) problems:

• Building a trained/fixed f into an interacting system based on humanexpert knowledgeE.g., an engineer trains a NN f to recognize street signs based on a large fixed data setD. S/he builds f into a car to drive autonomously. That car certainly solves an AIproblem; f continuously makes decisions that change the state of the world. But f wasnever optimized literally for the task of interacting with the world; it was only optimized tominimze a loss on the static data D. It is not a priori clear that minimizing the loss of f onD is related to maximizing rewards in car driving. That fact is the expertise of theengineer; it is not some AI algorithm that discovered that fact. At no place there was anAI method that “learns to drive a car”. There was only an optimization method tominimize the loss of f on D.

• Using ML in interactive decision processesE.g., to approximate a Q-function, state evaluation function, reward function, the systemdynamics, etc. Then, other AI methods can use these approximate models to actuallytake decisions in the interactive context.

AI & Machine Learning & Neural Nets – What is AI? – 5/43

Page 8: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

[Machine Learning beyond ML0]

• Many areas of ML go beyond plain function approximation:

• Fitting more structured models to data, which includes– Time series, recurrent processes– Graphical Models– Unsupervised learning (semi-supervised learning)

...but in all these cases, the scenario is still not interactive, the data D is static,the decision is about picking a single model f from a hypothesis space, andthe objective is a loss based on f and D only.

• Active Learning, where the “ML agent” makes decisions about whatdata label to query next

• Bandits

• Reinforcement Learning (roughly, marriage of ML and decision theory)

AI & Machine Learning & Neural Nets – What is AI? – 6/43

Page 9: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Machine Learning primer

• Basic ML0 (function approximation) is all about specifying the– Model: How is the function f : x 7→ y represented/parameterized?– Objective: Given a data set D, what loss should f minimize?– Solver: What algorithm do we use to minimize the loss?

• Given data D = {(xi, yi)}ni=1, the standard objective is to minimize the“error” on the data

f∗ argminf∈H

n∑i=1

`(f(xi), yi) ,

where `(y, y) > 0 penalizes a discrepancy between a model output yand the data y.

– Squared error `(y, y) = (y − y)2

– Classification error `(y, y) = [y 6= y]

– neg-log likelihood `(y, y) = − log p(y | y)

– etc

AI & Machine Learning & Neural Nets – Machine Learning primer – 7/43

Page 10: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Machine Learning primer

• Basic ML0 (function approximation) is all about specifying the– Model: How is the function f : x 7→ y represented/parameterized?– Objective: Given a data set D, what loss should f minimize?– Solver: What algorithm do we use to minimize the loss?

• Given data D = {(xi, yi)}ni=1, the standard objective is to minimize the“error” on the data

f∗ argminf∈H

n∑i=1

`(f(xi), yi) ,

where `(y, y) > 0 penalizes a discrepancy between a model output yand the data y.

– Squared error `(y, y) = (y − y)2

– Classification error `(y, y) = [y 6= y]

– neg-log likelihood `(y, y) = − log p(y | y)

– etcAI & Machine Learning & Neural Nets – Machine Learning primer – 7/43

Page 11: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Example: Polynomial Ridge Regression• Model: Given input x ∈ Rd and output y ∈ R, we represent functionsfβ as polynomials over x,

f(x) =

k∑j=1

φj(x) βj = φ(x)>β

where φ(x) is the feature vector of all monomials, e.g. quadratic:φ(x) = (1, x1, .., xd, x

21, x1x2, x1x3, .., x

2d)

• Objective: Given data D = {(xi, yi)}ni=1 we define the loss as

Lridge(β) =n∑i=1

(yi − φ(xi)>β)2 + λ

k∑j=2

β2j

• Solver: We can solve this analytically:

βridge = argminβ

Lridge(β) = (X>X + λI)-1X>y

where y = (y1, .., yn) is the vector of outputs, and Xi· = φ(xi) contains the featurevectors in rows.

AI & Machine Learning & Neural Nets – Machine Learning primer – 8/43

Page 12: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Example: Polynomial Ridge Regression• Model: Given input x ∈ Rd and output y ∈ R, we represent functionsfβ as polynomials over x,

f(x) =

k∑j=1

φj(x) βj = φ(x)>β

where φ(x) is the feature vector of all monomials, e.g. quadratic:φ(x) = (1, x1, .., xd, x

21, x1x2, x1x3, .., x

2d)

• Objective: Given data D = {(xi, yi)}ni=1 we define the loss as

Lridge(β) =

n∑i=1

(yi − φ(xi)>β)2 + λ

k∑j=2

β2j

• Solver: We can solve this analytically:

βridge = argminβ

Lridge(β) = (X>X + λI)-1X>y

where y = (y1, .., yn) is the vector of outputs, and Xi· = φ(xi) contains the featurevectors in rows.

AI & Machine Learning & Neural Nets – Machine Learning primer – 8/43

Page 13: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Example: Polynomial Ridge Regression• Model: Given input x ∈ Rd and output y ∈ R, we represent functionsfβ as polynomials over x,

f(x) =

k∑j=1

φj(x) βj = φ(x)>β

where φ(x) is the feature vector of all monomials, e.g. quadratic:φ(x) = (1, x1, .., xd, x

21, x1x2, x1x3, .., x

2d)

• Objective: Given data D = {(xi, yi)}ni=1 we define the loss as

Lridge(β) =

n∑i=1

(yi − φ(xi)>β)2 + λ

k∑j=2

β2j

• Solver: We can solve this analytically:

βridge = argminβ

Lridge(β) = (X>X + λI)-1X>y

where y = (y1, .., yn) is the vector of outputs, and Xi· = φ(xi) contains the featurevectors in rows. AI & Machine Learning & Neural Nets – Machine Learning primer – 8/43

Page 14: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Neural Networks

AI & Machine Learning & Neural Nets – Neural Networks – 9/43

Page 15: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

What is a Neural Network?

• A NN is a parameterized function fβ : Rd 7→ RM

– β are called weights– Given a data set D = {(xi, yi)}ni=1, we minimize some loss

β∗ = argminβ

n∑i=1

`(fβ(xi), yi) + regularization

• In that sense, they just replace our previous model assumptionf(x) = φ(x)>β, the reset is “in principle” the same

AI & Machine Learning & Neural Nets – Neural Networks – 10/43

Page 16: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Neural Network models

• A (fwd-forward) NN Rh0 7→ RhL with L layers, each hl-dimensional,defines the function

1-layer: fβ(x) = W1x+ b1, W1 ∈ Rh1×h0 , b1 ∈ Rh1

2-layer: fβ(x) = W2σ(W1x+ b1) + b2, Wl ∈ Rhl×hl-1 , bl ∈ Rhl

L-layer: fβ(x) = WLσ(· · ·σ(W1x+ b1) · · · ) + bL

• The parameter β = (W1:L, b1:L) is the collection of allweights Wl ∈ Rhl×hl-1 and biases bl ∈ Rhl

AI & Machine Learning & Neural Nets – Neural Networks – 11/43

Page 17: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

feature-based regression

AI & Machine Learning & Neural Nets – Neural Networks – 12/43

Page 18: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

feature-based classification(same features for all outputs)

AI & Machine Learning & Neural Nets – Neural Networks – 12/43

Page 19: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

neural network

AI & Machine Learning & Neural Nets – Neural Networks – 12/43

Page 20: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Neural Network models

• To describe the mapping as an iteration, we introduce notation for theintermediate values:

– the input to layer l is zl = Wlxl-1 + bl ∈ Rhl

– the activation of layer l is xl = σ(zl) ∈ Rhl

• Then the L-layer NN model can be computed using the forwardpropagation:

∀l=1,..,L-1 : zl = Wlxl-1 + bl , xl = σ(zl)

where x0 ≡ x is the input, and fβ(x) ≡ zL the output

– xl-1 7→ zl is a linear transformation, highly parameterized with Wl, bl

– zl 7→ xl is a non-linear transformation, element-wise and withoutparameters

• L-layer means L− 1 hidden layers plus 1 output layer. (The input x0 isnot counted.)

AI & Machine Learning & Neural Nets – Neural Networks – 13/43

Page 21: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Neural Network models

• The activation function σ(z) is applied element-wise,rectified linear unit (ReLU) σ(z) = [z]+ = max{0, z} = z[z ≥ 0]

leaky ReLU σ(z) = max{0.01z, z} =

0.01z z < 0

z z ≥ 0

sigmoid, logistic σ(z) = 1/(1 + e−z)

tanh σ(z) = tanh(z)

AI & Machine Learning & Neural Nets – Neural Networks – 14/43

Page 22: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Neural Network models

• We can think of the second-to-last layer xL-1 as a feature vector

φβ(x) = xL-1

• This aligns NNs models with what we discussed so far. But the crucialdifference is:In NNs, the features φβ(x) are also parameterized and trained!While in previous lectures, we had to fix φ(x) by hand, NNs allow us tolearn features and intermediate representations

• Note: It is a common approach to train NNs as usual, but after training fix the trainedfeatures φ(x) (“remove the head (=output layer) and fix the remaining body of the NN”)and use these trained features for similar problems or other kinds of ML on top.

AI & Machine Learning & Neural Nets – Neural Networks – 15/43

Page 23: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

NNs as unversal function approximators

• A 1-layer NN is linear in the input

• Already a 2-layer NN with h1 →∞ hidden neurons is a universalfunction approximator

– Corresponds to k →∞ features φ(x) ∈ Rk that are well tuned

AI & Machine Learning & Neural Nets – Neural Networks – 16/43

Page 24: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Objectives to train NNs

AI & Machine Learning & Neural Nets – Neural Networks – 17/43

Page 25: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Loss functions as usual

• Squared error regression, for hL-dimensional output:– for a single data point (x, y∗), `(f(x), y∗) = (f(x)− y∗)2

– the loss gradient is ∂`∂f

= 2(f − y∗)>

• For multi-class classification we have hL = M outputs, andfβ(x) ∈ RM represents the discriminative function, which means thatthe predicted class is the argmaxy∈{1,..,M}[fθ(x)]y.

• Neg-log-likelihood or cross entropy loss:– for a single data point (x, y∗), `(f(x), y∗) = − log p(y∗|x)

– the loss gradient at output y is ∂`∂fy

= p(y|x)− [y = y∗]

• One-vs-all hinge loss:– for a single data point (x, y∗), `(f(x), y∗) =

∑y 6=y∗ [1− (fy∗ − fy)]+,

– the loss gradient at non-target outputs y 6= y∗ is ∂`∂fy

= [fy∗ < fy + 1]

– the loss gradient at the target output y∗ is ∂`∂fy∗

= −∑y 6=y∗ [fy∗ < fy + 1]

– this is also called Perceptron AlgorithmAI & Machine Learning & Neural Nets – Neural Networks – 18/43

Page 26: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

New types of regularization

• Conventionally: add a L2 or L1 regularization.– adds a penalty λW 2

l,ij (Ridge) or λ|Wl,ij | (Lasso) for every weight– In practise, compute the unregularized gradient as usual, then add λWl,ij

(for L2), or λ signWl,ij (for L1) to the gradient– Historically, this is called weight decay, as the additional gradient

(executed after the unregularized weight update) just decays weights

• Core modern approach: Dropout– (see next slide)

• Others:– Data Augmentation– Training ensembles, bagging (averaging bootstrapped models)– Additional embedding objectives (e.g. semi-supervised embedding)– Early stopping

AI & Machine Learning & Neural Nets – Neural Networks – 19/43

Page 27: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Dropout

• What is dropout?– During both, forward passing and gradient computation, pretend that

randomly selected “neurons” would not exist– For each entry i in each layer xl,i, decide with probability p (≈ .5) whether

it drops out: set the respective activation xl,i, and gradient λl,i, equal tozero

– At prediction time, “average” dropouts, i.e., multiply each activation xl,iwith p

• Theoretial foundation?– Each dropout pattern defines another network. We have a random

ensemble of networks, sharing parameters, each of which is trained onthe same task. Eventually, we want to average of this ensemble.

– Srivastava et al: Dropout: a simple way to prevent neural networks fromoverfitting, JMLR 2014.

– “a way of approximately combining exponentially many different neuralnetwork architectures efficiently”

– “p can simply be set at 0.5, which seems to be close to optimal for a widerange ofnetworks and tasks”

AI & Machine Learning & Neural Nets – Neural Networks – 20/43

Page 28: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Optimization

AI & Machine Learning & Neural Nets – Neural Networks – 21/43

Page 29: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Computing the gradient

• Recall forward propagation in an L-layer NN:

∀l=1,..,L-1 : zl = Wlxl-1 + bl , xl = σ(zl)

• For a single data point (x, y∗), assume we have a loss `(f(x), y∗)

We define δL∆= d`df = d`

dzL∈ R1×M as the gradient (as row vector) w.r.t.

output values zL.

• Backpropagation: We can recursivly compute the gradientd`dzl∈ R1×hl w.r.t. all other layers zl as:

∀l=L-1,..,1 : δl∆=d`

dzl=

d`

dzl+1

∂zl+1

∂xl

∂xl∂zl

= [δl+1 Wl+1] ◦ [σ′(zl)]>

where ◦ is an element-wise product. The gradient w.r.t. parameters:

d`

dWl,ij=

d`

dzl,i

∂zl,i∂Wl,ij

= δl,i xl-1,j ord`

dWl= δ>l x

>l-1 ,

d`

dbl= δ>l

AI & Machine Learning & Neural Nets – Neural Networks – 22/43

Page 30: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

• This forward and backward computations are done for each data point(xi, yi).

• Since the total loss is the sum L(β) =∑i `(fβ(xi), yi), the total

gradient is the sum of gradients per data point.

• Efficient implementations send multiple data points (tensors)simultaneously through the network (fwd and bwd), which speeds upcomputations.

AI & Machine Learning & Neural Nets – Neural Networks – 23/43

Page 31: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Optimization

• For small data size:We can compute the loss and its gradient

∑ni=1∇β`(fβ(xi), yi).

– Use classical gradient-based optimization methods– default: L-BFGS, oldish but efficient: Rprop– Called batch learning (in contrast to online learning)

• For large data size: The∑ni=1 is highly inefficient!

– Adapt weights based on much smaller data subsets, mini batches

AI & Machine Learning & Neural Nets – Neural Networks – 24/43

Page 32: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Stochastic Gradient Descent• Compute the loss and gradient for a mini batch D ⊂ D of fixed size k.

L(β, D) =∑i∈D

`(fβ(xi), yi)

∇βL(β, D) =∑i∈D

∇β`(fβ(xi), yi)

• Naive Stochastic Gradient Descent, iterate

β ← β − η∇βL(β, D)

– Choice of learning rate η is crucial for convergence!– Exponential cooling: η = ηt0

Yurii Nesterov (1983): A method for solving the convex programming problm withconvergence rate O(1/k2)

• Modern SGD versions (Adam and Nadam) make this more efficient.See appendix.

AI & Machine Learning & Neural Nets – Neural Networks – 25/43

Page 33: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Behavior of Gradient Propagation

• Propagating δl back through many layers can lead to problems

• For the classical sigmoid σ(z), σ(z)′ is always < 1 ⇒ vanishinggradient→ Modern activations functions (ReLU) reduce this problem

• The initialization of weights is very important!

AI & Machine Learning & Neural Nets – Neural Networks – 26/43

Page 34: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Initialization

• Choose random weights that don’t grow or vanish the gradient:– E.g., initialize weight vectors in Wl,i· with standard deviation 1, i.e., each

entry with sdv 1√hl-1

– Roughly: If each element of zl has standard deviation ε, the same shouldbe true for zl+1.

• Choose each weight vector Wl,i· to point in a uniform random direction→ same as above

• Choose biases bl,i randomly so that the ReLU hinges cover the inputwell (think of distributing hinge features for continuous piece-wise linearregression)

AI & Machine Learning & Neural Nets – Neural Networks – 27/43

Page 35: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Discussion: Why are NNs so successful now?

AI & Machine Learning & Neural Nets – Neural Networks – 28/43

Page 36: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Historical discussion(This is completely subjective.)

• Early (from 40ies):– McCulloch Pitts, Hebbian learning, Rosenblatt, Werbos (backpropagation)

• 80ies:– Start of connectionism, NIPS– ML wants to distinguish itself from pure statistics (“machines”, “agents”)

• ’90-’10:– More theory, better grounded, Statistical Learning theory– Good ML is pure statistics (again) (Frequentists, SVM)– ...or pure Bayesian (Graphical Models, Bayesian X)– sample-efficiency, great generalization, guarantees, theory– Great successes, in applications across disciplines; supervised,

unsupervised, structured

• ’10-:– Big Data. NNs. Size matters. GPUs.– Disproportionate focus on images– Software engineering becomes central

AI & Machine Learning & Neural Nets – Neural Networks – 29/43

Page 37: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

• NNs did not become “better” than they were 20y ago. What changed isthe metrics by which they’re are evaluated:

• Old:– Sample efficiency & generalization; get the most from little data– Guarantees (both, w.r.t. generalization and optimization)– generalize much better than nearest neighbor

• New:– Ability to cope with billions of samples→ no batch processing, but stochastic optimization (Adam) without

monotone convergence→ nearest neighbor methods infeasible, compress data into high-capacity

NNs

AI & Machine Learning & Neural Nets – Neural Networks – 30/43

Page 38: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

• NNs did not become “better” than they were 20y ago. What changed isthe metrics by which they’re are evaluated:

• Old:– Sample efficiency & generalization; get the most from little data– Guarantees (both, w.r.t. generalization and optimization)– generalize much better than nearest neighbor

• New:– Ability to cope with billions of samples→ no batch processing, but stochastic optimization (Adam) without

monotone convergence→ nearest neighbor methods infeasible, compress data into high-capacity

NNs

AI & Machine Learning & Neural Nets – Neural Networks – 30/43

Page 39: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

NNs vs. nearest neighbor

• Imagine an autonomous car. Instead of carrying a neural net, it carries 1Petabyte of data (500 hard drives, several billion pictures). In every splitsecond it records an image from a camera and wants to query the database toreturen the 100 most similar pictures. Perhaps with a non-trivial similaritymetric. That’s not reasonable!

• In that sense, NNs are much better than nearest neighbor. Theystore/compress/memorize huge amounts of data. Sample efficiency and theprecise generalization behavior beome less relevant.

• That’s how the metrics changed from ’90-’10 to nowadays

AI & Machine Learning & Neural Nets – Neural Networks – 31/43

Page 40: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Images & Time Series

AI & Machine Learning & Neural Nets – Images & Time Series – 32/43

Page 41: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Images & Time Series

• My guess: 90% of the recent success of NNs is in the areas of imagesor time series

• For images, convolutional NNs (CNNs) impose a very sensible prior;the representations that emerge in CNNs are in fact similar torepresentations in the visual area of our brain.

• For time series, long-short term memory (LSTM) networks representlong-term dependencies in a way that is well trainable – something thatis hard to do with other model structures.

• Both these structural priors, combined with huge data and capacity,make these methods very strong.

AI & Machine Learning & Neural Nets – Images & Time Series – 33/43

Page 42: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Convolutional NNs

• Standard fully connected layer: full matrix Wi has hihi+1 parameters

• Convolutional: Each neuron (entry of zi+1) receives input from asquare receptive field, with k × k parameters. All neurons share theseparameters→ translation invariance. The whole layer only has k2

parameters.

• There are often multiple neurons with the same receitive field(channels), to represent different “filters”. Stride leads todownsampling. Padding at borders.

• Pooling applies a predefined operation on the receptive field (noparameters): max or average. Typically for downsampling.

AI & Machine Learning & Neural Nets – Images & Time Series – 34/43

Page 43: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Learning to read these diagrams...

AlexNet

AI & Machine Learning & Neural Nets – Images & Time Series – 35/43

Page 44: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

ResNet

AI & Machine Learning & Neural Nets – Images & Time Series – 36/43

Page 45: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

ResNeXt

AI & Machine Learning & Neural Nets – Images & Time Series – 37/43

Page 46: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Pretrained networks

• ImageNet5k, AlexNet, VGG, ResNet, ResNeXt

AI & Machine Learning & Neural Nets – Images & Time Series – 38/43

Page 47: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

LSTMs

AI & Machine Learning & Neural Nets – Images & Time Series – 39/43

Page 48: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

LSTM

• c is a memory signal, that is multiplied with a sigmoid signal Γf . If thatis saturated (Γf ≈ 1), the memory is preserved; and backpropagationcopies gradients back

• If Γi is close to 1, a new signal c is written into memory

• If Γo is close to 1, the memory contributes to the normal neuralactivations a

AI & Machine Learning & Neural Nets – Images & Time Series – 40/43

Page 49: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Gated Recurrent Units

• Cleaner and more modern: Gated Recurrent Unitsbut perhaps just similar performance

• Gated Feedback RNNs

AI & Machine Learning & Neural Nets – Images & Time Series – 41/43

Page 50: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Deep RL

• Value Network

• Advantage Network

• Action Network

• Experience Replay

• Fixed Q-targets

• etc, etc

AI & Machine Learning & Neural Nets – Images & Time Series – 42/43

Page 51: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Conclusions

• Conventional feed-forward neural networks are by no means magic. They’re aparameterized function, which is fit to data.

• Convolutional NNs do make strong and good assumptions about howinformation processing on images should be structured. The results are greatand related to some degree to human visual representations. A large part ofthe success of deep learning is on images.Also LSTMs make good assumptions about how memory signals helprepresent time series.The flexibility of “clicking together” network structures and generaldifferentiable computation graphs is great.All these are innovations w.r.t. formulating structured models for ML

• The major strength of NNs is in their capacity and that, using massiveparallelized computation, they can be trained on tons of data. Maybe they don’teven need to be better than nearest neighbor lookup, but they can be queriedmuch faster.

AI & Machine Learning & Neural Nets – Images & Time Series – 43/43

Page 52: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Appendix

AI & Machine Learning & Neural Nets – Appendix – 44/43

Page 53: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Computation Graphs

– A great collateral benefit of NN research!– Perhaps a new paradigm to design large scale systems, beyond what

software engineering teaches classically– [see section 3.2 in “Maths” lecture]

AI & Machine Learning & Neural Nets – Appendix – 45/43

Page 54: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Example

• Three real-valued quantities x, g and f which depend on each other:

f(x, g) = 3x+ 2g and g(x) = 2x .

What is ∂∂xf(x, g) and what is d

dxf(x, g)?

• The partial derivative only considers a single function f(a, b, c, ..) andasks how the output of this single function varies with one of itsarguments. (Not caring that the arguments might be functions of yetsomething else).

• The total derivative considers full networks of dependencies betweenquantities and asks how one quantity varies with some other.

AI & Machine Learning & Neural Nets – Appendix – 46/43

Page 55: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Example

• Three real-valued quantities x, g and f which depend on each other:

f(x, g) = 3x+ 2g and g(x) = 2x .

What is ∂∂xf(x, g) and what is d

dxf(x, g)?

• The partial derivative only considers a single function f(a, b, c, ..) andasks how the output of this single function varies with one of itsarguments. (Not caring that the arguments might be functions of yetsomething else).

• The total derivative considers full networks of dependencies betweenquantities and asks how one quantity varies with some other.

AI & Machine Learning & Neural Nets – Appendix – 46/43

Page 56: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Computation Graphs

• A function network or computation graph is a DAG of n quantities xiwhere each quantity is a deterministic function of a set of parentsπ(i) ⊂ {1, .., n}, that is

xi = fi(xπ(i))

where xπ(i) = (xj)j∈π(i) is the tuple of parent values

• (This could also be called deterministic Bayes net.)

• Total derivative: Given a variation dx of some quantity, how would allchild quantities (down the DAG) vary?

AI & Machine Learning & Neural Nets – Appendix – 47/43

Page 57: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Chain rules

• Forward-version:

f

g

x

(I use in robotics)

df

dx=

∑g∈π(f)

∂f

∂g

dg

dx

Why “forward”? You’ve computed dgdx

already, now you move forward to dfdx

.Note: If x ∈ π(f) is also a direct argument to f , the sum includes the term ∂f

∂xdxdx≡ ∂f

∂x.

To emphasize this, one could also write dfdx

= ∂f∂x

+∑

g∈π(f)g 6=x

∂f∂g

dgdx

.

• Backward-version:f

x

g

(used in NNs!)

df

dx=

∑g:x∈π(g)

df

dg

∂g

∂x

Why “backward”? You’ve computed dfdg

already, now you move backward to dfdx

.

Note: If f ∈ π(g), the sum includes dfdf

∂f∂x≡ ∂f

∂x. We could also write

dfdx

= ∂f∂x

+∑

g:x∈π(g)g 6=f

dfdg

∂g∂x

.

AI & Machine Learning & Neural Nets – Appendix – 48/43

Page 58: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Stochastic Gradient Descent

• SGD with momentum:

∆β ← α∆β − η∇βL(β, D)

β ← β + ∆β

• Nesterov Accelerated Gradient (“Nesterov Momentum”):

∆β ← α∆β − η∇βL(β + ∆β, D)

β ← β + ∆β

Yurii Nesterov (1983): A method for solving the convex programming problm withconvergence rate O(1/k2)

AI & Machine Learning & Neural Nets – Appendix – 49/43

Page 59: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Adam

arXiv:1412.6980

(all operations interpreted element-wise)

AI & Machine Learning & Neural Nets – Appendix – 50/43

Page 60: Artificial Intelligence AI & Machine Learning & Neural Nets · Neural networks became a central topic for Machine Learning and AI. But in principle, they’re just parameterized functions

Adam & Nadam

• Adam interpretations (everything element-wise!):– mt ≈ 〈g〉 the mean gradient in the recent iterations– vt ≈ 〈g2〉 the mean gradient-square in the recent iterations– mt, vt are bias corrected (check: in first iteration, t = 1, we have mt = gt,

unbiased, as desired)– ∆θ ≈ −α 〈g〉〈g2〉 would be a Newton step if 〈g2〉 were the Hessian...

• Incorporate Nesterov into Adam: Replace parameter update by

θt ← θt-1 − α/(√vt + ε) · (β1mt +

(1− β1)gt1− βt1

)

Dozat: Incorporating Nesterov Momentum into Adam, ICLR’16

AI & Machine Learning & Neural Nets – Appendix – 51/43