introduction to machine learning -...
Post on 08-Jun-2018
239 Views
Preview:
TRANSCRIPT
Johannes Fürnkranz | 1
Introduction to Machine Learning
Johannes FürnkranzTU Darmstadt
Knowledge Engineering GroupHochschulstrasse 10D-64289 Darmstadt
06151/166238
juffi@ke.tu-darmstadt.de
Summer School on Logic, AI and Verification | Machine Learning 3 J. Fürnkranz
Neural Networks
Neurological Foundations Perceptrons Multilayer Perceptrons Backpropagation Deep Learning
Material fromRussell & Norvig,
chapters 18.1, 18.2, 20.5 and 21
Slides based on Slidesby Russell/Norvig,
Ronald Williams,and Torsten Reil
Summer School on Logic, AI and Verification | Machine Learning 4 J. Fürnkranz
Machine Learning Problem Definition
Definition (Mitchell 1997)
„A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.“
Given: a task T a performance measure P some experience E with the task
Goal: generalize the experience in a way that allows to improve your
performance on the task
Summer School on Logic, AI and Verification | Machine Learning 5 J. Fürnkranz
Learning to Play Backgammon
Task: play backgammon
Performance Measure: percentage of games won
Experience: previous games played
TD-Gammon: learned a neural network for evaluating backgammon boards from playing millions of games against itself successively improved to world-champion strength http://www.research.ibm.com/massive/tdl.html
GNU Backgammon: http://www.gnu.org/software/gnubg/
Summer School on Logic, AI and Verification | Machine Learning 6 J. Fürnkranz
Recognizing Spam-Mail
Task: sort E-mails into categories (e.g., Regular / Spam)
Performance Measure: Weighted Sum of Mistakes (letting spam through is not so bad as
misclassifying regular E-mail as spam)
Experience: Handsorted E-mail messages
in your folder
In Practice: Many Spam-Filters (e.g., Mozilla) use Bayesian Learning for
recognizing spam mails
Summer School on Logic, AI and Verification | Machine Learning 7 J. Fürnkranz
Handwritten Character Recognition
Task: Recognize a handwritten character
Performance Measure: Recognition rate
Experience: MNIST handwritten digit database
http://yann.lecun.com/exdb/mnist/
Summer School on Logic, AI and Verification | Machine Learning 9 J. Fürnkranz
Learning a function from examples
One of the simplest forms of inductive learning
f is the (unknown) target function
An example is a pair (x, f(x))
Problem: find a hypothesis h from a predefined hypothesis space given a training set of examples such that h ≈ f on all examples
i.e. the hypothesis must generalize from the training examples
This is a highly simplified model of real learning: Ignores prior knowledge Assumes examples are given
Summer School on Logic, AI and Verification | Machine Learning 10 J. Fürnkranz
Performance Measurement
How do we know that h ≈ f ? Use theorems of computational/statistical learning theory Or try h on a new test set of examples where f is known
(use same distribution over example space as training set)
Learning curve = % correct on test set over training set size
Summer School on Logic, AI and Verification | Machine Learning 12 J. Fürnkranz
Pigeons as Art Experts
Famous experiment (Watanabe et al. 1995, 2001)
Pigeon in Skinner box Present paintings of two different artists (e.g. Chagall / Van Gogh) Reward for pecking when
presented a particular artist
Summer School on Logic, AI and Verification | Machine Learning 13 J. Fürnkranz
Results
Pigeons were able to discriminate between Van Gogh and Chagall with 95% accuracy when presented with pictures they had been trained on
Discrimination still 85% successful for previously unseen paintings of the artists
Pigeons do not simply memorise the pictures They can extract and recognise patterns (the ‘style’) They generalise from the already seen to make predictions
This is what neural networks (biological and artificial) are good at (unlike conventional computer)
Summer School on Logic, AI and Verification | Machine Learning 14 J. Fürnkranz
The Three Lives of Neural Network Learning
1. Modelling of Neurophysiological Networks (1950s-1960s) individual Neurons (Pereceptrons), simple networks basic learning rule killed with book by Minsky and Papert that showed the limitations of
such networks
2. Paralled Distributed Processing (1990s) book by Rumelhart & MacClelland (1987) rejunivated interest intensive research and many successes popularity vanished with success of alternative statistical learning
algorithms (e.g., support vector machines)
3. Deep Learning (2010s) discovery that many known algorithms perform much better when
trained with sufficiently big databases (was previously not possible)
Summer School on Logic, AI and Verification | Machine Learning 15 J. Fürnkranz
A Biological Neuron
Neurons are connected to each other via synapses If a neuron is activated, it spreads its activation to all connected neurons
Summer School on Logic, AI and Verification | Machine Learning 16 J. Fürnkranz
An Artificial Neuron
Neurons correspond to nodes or units
A link from unit j to unit i propagates activation aj from j to i
The weight Wj,i of the link determines the strength and sign of the connection
The total input activation is the sum of the input activations The output activation is determined by the activiation function g
(McCulloch-Pitts,1943)
ai=g ini=g ∑j=0
n
W ji⋅a j
Summer School on Logic, AI and Verification | Machine Learning 17 J. Fürnkranz
Perceptron
A single node
connecting n input signals aj with one output signal a
typically signals are −1 or +1
Activation function A simple threshold function:
Thus it implements a linear separator i.e., a hyperplane that divides
n-dimensional space into a region with output −1and a region with output 1
a={−1 if ∑j=0
n
W j⋅a j≤0
1 if ∑j=0
n
W j⋅a j> 0
(Rosenblatt 1957, 1960)
Summer School on Logic, AI and Verification | Machine Learning 18 J. Fürnkranz
Perceptrons and Boolean Fucntions
a Perceptron can implement all elementary logical functions
more complex functions like XOR cannot be modeled
no linearseparationpossible
(McCullogh & Pitts, 1943)
(Minsky & Papert, 1969)
W 0=0
W 1=−1
−1−1
−1−1
−1−1
W 0=−0.5
Summer School on Logic, AI and Verification | Machine Learning 19 J. Fürnkranz
Perceptron Learning
The task of perceptron learning is to find the weights w such that given an input x, the perceptron output the right prediction f (x)
This should hold not only for a single data point <x, f (x)>
not only for all data point <xi, f (x
i)>
for all possible data points (as well as possible)
Key idea of learning algorithm: start with a random assignment of the weights compare the output to the desired output adjust the weights so that the result is closer to the desired output
f (x )=w⋅x
Summer School on Logic, AI and Verification | Machine Learning 21 J. Fürnkranz
Perceptron Learning
Perceptron Learning Rule for Supervised Learning
Training Example (1,1) → -1
W j ←W j+ α⋅( f (x)−h(x))⋅x j
1
-1
1
0.2
0.5
0.8
1
Computation of output signal h(x)in x=−1⋅0.21⋅0.51⋅0.8=1.1h( x)=1 because in ( x)>0 (activation function)
Target value f (x) = −1 (and α = 0.5)W 0 0.20.5⋅−1−1⋅−1=0.21=1.2
W 1 0.50.5⋅−1−1⋅1=0.5−1=−0.5
W 2 0.80.5⋅−1−1⋅1=0.8−1=−0.2
learning rate error
Summer School on Logic, AI and Verification | Machine Learning 22 J. Fürnkranz
Measuring the Error of a Network
The error for one training example x can be measured by the squared error the squared difference of the output value h(x) and the desired target
value f (x)
For evaluating the performance of a network, we can try the network on a set of datapoints and average the value
(= sum of squared errors)
E (x)=12
Err2=
12( f (x)−h(x))2
=12 ( f (x)−g (∑
j=0
n
W j⋅x j))2
E Network =∑i=1
NE x i
Summer School on Logic, AI and Verification | Machine Learning 23 J. Fürnkranz
Error Landscape
The error function for one training example may be considered as a function in a multi-dimensional weight space
The best weight setting for one example is where the error measure for this example is minimal
E (W )=12 ( f (x)−g (∑
j=0
n
W j⋅x j))2
Summer School on Logic, AI and Verification | Machine Learning 24 J. Fürnkranz
Error Minimization via Gradient Descent
In order to find the point with the minimal error: go downhill in the direction where it is steepest
… but make small steps, or you might shoot over the target
E (W )=12 ( f (x)−g (∑
j=0
n
W j⋅x j))2
Summer School on Logic, AI and Verification | Machine Learning 25 J. Fürnkranz
Error Minimization
It is easy to derive a perceptron training algorithm that minimizes the squared error
Change weights into the direction of the steepest descent of the error function
To compute this, we need a continuous and differentiable activation function g!
Weight update with learning rate α: positive error → increase network output
increase weights of nodes with positive input
decrease weights of nodes with negative input
E=12
Err2=12( f (x)−h(x))2=
12 ( f (x)−g (∑
j=0
n
W j⋅x j))2
∂ E∂W j
=Err⋅∂ Err∂W j
=Err⋅ ∂∂W j
( f (x)−g (∑k=0
n
W k⋅xk))=−Err⋅g ' (in)⋅x j
W j←W j+α⋅Err⋅g ' (in)⋅x j
Summer School on Logic, AI and Verification | Machine Learning 26 J. Fürnkranz
Threshold Activation Function
The regular threshold activation function is problematic
g'(x) = 0, therefore
→ no weight changes!
g x={0 if x≤01 if x0
g ' x=0
∂E∂W j , i
=−Err⋅g ' (in i)⋅x j=0
Summer School on Logic, AI and Verification | Machine Learning 27 J. Fürnkranz
Sigmoid Activation Function
A commonly used activation function is the sigmoid function similar to the threshold function easy to differentiate non-linear
g x=1
1e−x
g ' x=g x 1−g x
g x={0 if x≤01 if x0
g ' x=0
Summer School on Logic, AI and Verification | Machine Learning 28 J. Fürnkranz
Multilayer Perceptrons
Perceptrons may have multiple output nodes may be viewed as multiple parallel perceptrons
The output nodes may be combined with another perceptron which may also have multiple output nodes
The size of this hidden layer is determined manually
Summer School on Logic, AI and Verification | Machine Learning 29 J. Fürnkranz
Multilayer Perceptrons
• Information flow is unidirectional
• Data is presented to Input layer
• Passed on to Hidden Layer
• Passed on to Output layer
• Information is distributed
• Information processing is parallel
Summer School on Logic, AI and Verification | Machine Learning 30 J. Fürnkranz
Expressiveness of MLPs
Every continuous function can be modeled with three layers i.e., with one hidden layer
Every function can be modeled with four layers i.e., with two hidden layers
However, these results are more of theoretical interest the layer is assumed to be arbitrarily large
Learning with more layers may be much more practical e.g., fewer weights in total
Summer School on Logic, AI and Verification | Machine Learning 31 J. Fürnkranz
Learning in Multilayer Perceptron
The nodes in the output layer can be trained as usual compare the actual output with the desired output adjust the weights so that they get closer to each other
Δi is the error term of output node i times the derivation of its inputs
W ji←W ji+α⋅Err i⋅g ' (in i)⋅x j=W ji+α⋅Δi⋅x j
Summer School on Logic, AI and Verification | Machine Learning 32 J. Fürnkranz
Learning in Multilayer Perceptrons
But how to train interior nodes? There is no target value to compare to There might be multiple successor nodes
for which a target value is available
Key idea: Backpropagation
the error term Δi of the output layers is propagated back to the hidden layer
the training signal of hidden layer node j is the weighted sum of the errors of the output nodes
j=∑i
W ji⋅ i ⋅g ' in j
W kj←W kj+α⋅Δ j⋅xk
Δ1 Δ2
Δ j
W j ,1 W j ,2
Summer School on Logic, AI and Verification | Machine Learning 33 J. Fürnkranz
Minimizing the Network Error
The error landscape for the entire network may be thought of as the sum of the error functions of all examples will yield many local minima → hard to find global minimum
Minimizing the error for one training example may destroy what has been learned for other examples a good location in weight space for one
example may be a bad location for another examples
Training procedure: try all examples in turn make small adjustments
for each example repeat until convergence
One Epoch = One iteration through all examples
Summer School on Logic, AI and Verification | Machine Learning 34 J. Fürnkranz
Overfitting
Overfitting Given
a fairly general model class enough degrees of freedom
you can always find a model that perfectly fits the data even if the data contains error (noise in the data)
Such concepts do not generalize well!→ Overfitting Avoidance / Pruning / Regularization
Summer School on Logic, AI and Verification | Machine Learning 35 J. Fürnkranz
Overfitting - Illustration
Prediction for this value of x?
Polynomial degree 1(linear function)
Polynomial degree 4(n-1 degrees can always fit n points)
□ here
□ or here ?
Summer School on Logic, AI and Verification | Machine Learning 36 J. Fürnkranz
Overfitting Avoidance
A perfect fit to the data is not always a good idea data could be imprecise
e.g., random noise
the hypothesis space may be inadequate a perfect fit to the data might not even be possible or it may be possible but with bad generalization properties
(e.g., generating one rule for each training example)
Thus it is often a good idea to avoid a perfect fit of the data fitting polynomials so that
not all points are exactly on the curve
learning concepts so that not all positive examples have to be covered by the theory some negative examples may be covered by the theory
Summer School on Logic, AI and Verification | Machine Learning 37 J. Fürnkranz
Overfitting
Training Set Error continues to decrease with increasing number of training examples / number of epochs an epoch is a complete pass through all training examples
Test Set Error will start to increase because of overfitting
Simple training protocol: keep a separate validation set to watch the performance
validation set is different from training and test sets!
stop training if error on validation set gets down
Summer School on Logic, AI and Verification | Machine Learning 38 J. Fürnkranz
Regularization
In order to avoid overfitting, add a regularization term to the empirical error
typical regularization terms penalize complexity of the concept
Many other techniques for fighting overfitting e.g., Dropout randomly “drops” a subset of the hidden units during
each training iteration (one example)
RL (x )=L( f , h)+λ⋅R(h)
R(h)=∥w∥2
Summer School on Logic, AI and Verification | Machine Learning 40 J. Fürnkranz
Deep Learning
In the last years, great success has been observed with training „deep“ neural networks Deep networks are networks
with multiple layers
Successes in particular in image classification Idea is that layers sequentially
extract information from image 1st layer → edges,
2nd layer → corners, etc…
Key ingredients: A lot of training data are needed and available (big data) Fast processing and a few new tricks made fast training for big data possible Unsupervised pre-training of layers
Autoencoder use the previous layer as input and output for the next layer
Summer School on Logic, AI and Verification | Machine Learning J. Fürnkranz
Convolutional Neural Networks
Convolution: for each pixel of an image, a new feature is computed using a
weighted combination of its nxn neighborhood
5x5 image 3x3 convolutionruns over all possible 3x3 subimages of picture
resulting imageonly one
pixel shown
Summer School on Logic, AI and Verification | Machine Learning J. Fürnkranz
Convolution - Edge detection
Summer School on Logic, AI and Verification | Machine Learning J. Fürnkranz
Image Processing Networks
Convolutions can be encoded as network layers all possible 3x3 pixels of the input image are connected to the
corresponding pixel in the next layer
Convolutional Layers are at the heart of Image Recognition Several stacked on top of each other and parallel to each other
Example: LeNet (LeCun et al. 1989)
GoogLeNet is a modern variant of this architecture
Summer School on Logic, AI and Verification | Machine Learning J. Fürnkranz
Adversarial Examples and Networks
Deep learning networks are often very brittle (Goodfellow et al.) One can construct inperceptible modifications of the input that
change the output
Generalized Adversarial Networds try to stabilize them integrating a “judge” to discriminate between real and generated images
PerturbationPanda Gibbon
+ =
Summer School on Logic, AI and Verification | Machine Learning J. Fürnkranz
Neural Artistic Art Transfer(Gatys et al., 2016)
Style Image
Content Image
Synthesized Image
Summer School on Logic, AI and Verification | Machine Learning 49 J. Fürnkranz
Recurrent Neural Networks
Recurrent Neural Networks (RNN) allow to process sequential data by feeding back the output of the
network into the next input
Long-Short Term Memory (LSTM)
add „forgetting“ to RNNs good for mapping sequential input
data into sequential output data e.g., text to text, or time series to time series
Deep Learning often allows „end-to-end learning“
e.g., learn a network that does the complete translation of text in one language into another language
previously, learning often concentrated on individual components (e.g. word sense disambiguation)
Summer School on Logic, AI and Verification | Machine Learning 50 J. Fürnkranz
Wide Variety of Applications
Speech Recognition Autonomous Driving Handwritten Digit Recognition Credit Approval Backgammon etc.
Good for problems where the final output depends on combinations of many input features rule learning is better when only a few features are relevant
Bad if explicit representations of the learned concept are needed takes some effort to interpret the concepts that form in the hidden
layers
Summer School on Logic, AI and Verification | Machine Learning 52 J. Fürnkranz
Neural Networks – Summary
Neural Networks were modeled after human neurophyosiology
Learning implements error minimization in weight space
Backpropagation propagates the error on the output back to the interior nodes in the hidden layer
Different types of network architectures for various tasks
convolutional networks for images
recurrent networks and LSTM for sequences
top related