instructor: dr. benjamin thompson lecture 17: 17 march 2009 · 4.4.1) select a member from the...

Instructor:

Dr. Benjamin Thompson

Lecture 17: 17 March 2009

Announcements� Project Proposals due one week from today (oh my,

how time flies!)

� Midterm one week from Thursday

� We’ll review next Tuesday

� This is the only exam in this course

� It is worth 30% of your grade

� So you’d better start studying now.

Matters of Prebacchanalia� Neural Network Training as an Optimization Problem

� Drawbacks of Gradient Descent

� Other Optimization Approaches

� Random Search

� Particle Swarm Optimization

� Now with minty-fresh MATLAB Demo!

� Genetic Algorithms

Spring Broke� Genetic Algorithms – slightly more detail

� Simulated Annealing

Chapter 5: Radial Basis Function Networks

� Linear inseparability versus nonlinear separability

� Basis functions and interpolation

� Radial Basis Function (Neural) Networks

I ran a genetic algorithm on my previous lecture to optimize the presentation of the material in terms of ease-of-understanding and clarity of implementation. The result was a recipe for Pasta Bolognese, which, while quite tasty, wasn’t very useful for learning purposes. Instead, I give you the following.

� Recall, the genetic algorithm works on the principle of encoding the desired parameters-to-be-optimized into a sequence of genes on which we perform an evolutionary strategy involving two components:

� Mutation, in which individual genes randomly change their values with some small probability

� Crossover, in which two parent solution candidates merge their information to create a child solution candidate

The Genome

0 1 0 0 1 1 1 0 0 1 … 1 0 1 1 0

Each of these cells is a single gene

Each block, then, may be thought of as a chromosome

Free MATLAB Code To A Good

Home!function y=bindecode(x,minval,maxval)

if(~isstr(x))error('Input must be a string sequence of "1" and

"0" characters')else

for k=1:length(x)if(~(strcmp(x(k),'1') || strcmp(x(k),'0')))

error('Input must be a string sequence of "1" and "0" characters');

endy=bin2dec(x)*(2^-length(x))*(maxval-

minval)+minval;end

end

function y=bincode(x,numbits,minval,maxval)

if(numel(x)>1)error('This function only works for scalar

inputs');elseif(x>maxval || x<minval)

error('input to be encoded exceeds the bounds that you have supplied');else

xnorm=(x-minval)/(maxval-… minval)*2^numbits;

if(round(xnorm)>maxval)y=dec2bin(floor(xnorm),numbits);

elseif(round(xnorm)<minval)y=dec2bin(ceil(xnorm),numbits);

elsey=dec2bin(round(xnorm),numbits);

endend

Review of the GA In Greater Detail� 1) For some parameter vector to be optimized, w,

choose a maximum value vector wmax and a minimum value vector wmin

� 2) Choose algorithm parameters Np, Nbits, Nepochs, emin, and µ.

� 3) Create the first generation of solution “guesses” by creating Np random vectors on [wmin,wmax]

� 4) For each epoch up to Nepochs, do:

� (next slide)

� 4.1) For each parent “guess” indexed by p, do:� 4.1.1) Evaluate the fitness of that parameter vector� 4.1.2) Save the fitness value for this guess, F(p)

� 4.2) If the min(F)<emin, � 4.2.1) Save the solution corresponding to this error as your final

answer� 4.2.2) STOP THE ALGORITHM! I WANT TO GET OFF!

� 4.3) Normalize all the fitness values such that the minimum fitness is zero and the maximum fitness is 1.0

� 4.4) While you have fewer than 2Np breeding parents, do:� 4.4.1) Select a member from the population at random.� 4.4.2) Generate a U[0,1] random number q

4.4.3) If q>F(p), add this member to the breeding population� 4.5) For each two consecutive parents, do:

� (next slide)

Flip these in-equalities for maximization

problems!

� 4.5.1) For each parent vector, do:

� 4.5.1.1) Convert each element of the vector into a bit string

� 4.5.1.2) Append each of these bit strings in to one big long huge nasty bit string

� 4.5.2) Perform crossover as described on next slide to create a child vector

� 4.5.2) For each gene (bit) in the child vector, do:

� 4.5.2.1) Generate a U[0,1] random number, m

� 4.5.2.2) If m<µ, flip that bit

Crossing Over With Dr. Thompson� For a given mating pair:

� each chromosome is randomly (fair coin flip) selected from one or the other parent to form a new offspring

0 1 0 0 1 1 1 0 0 1 … 1 0 1 1 0

1 1 1 0 0 0 1 1 0 1 … 1 1 1 0 1

Ma

Pa

1 1 1 0 0 0 1 1 0 1 … 1 1 1 0 11 1 0 0 1 Baby

GAs for Neural Network Training� The parameter vector is simply all the weights.

� The fitness function evaluation step:

� Run all the training patterns through the neural network as defined by that particular set of weights

� Add up the error from each pattern in the usual manner:

� E is the value you save as F(p) in the appropriate step of the algorithm

( )2, ,

1 1

P M

j k j k

j k

E d o= =

= −∑∑

Some Notes on GA� The “parent selection” algorithm I presented is one of

many (many!) possible approaches� Generally, the goal is to tie fitness to the chances for that

parent to reproduce

� Note: Population size does not have to remain stable!

� For neural nets, the parameter vector itself can be large, which in turn means each “parent” genome, in terms of bits, will be, in technical terms, frickin’ huge.

� Convergence of the GA is typically quite slow

� Performs a quantized search based on the Nbits

parameter, so it may not find the optimal solution

Much better than unsimulated annealing when it comes to optimization problems, which can cause burns and heat rash. But you get a cool sword at the end, so that’s cool.

Simulated Annealing� Based on metallurgic principle of annealing, wherein

metal is slowly heated and cooled to result in the lowest energy state of the atoms in the metal for improved hardness and strength

� Given a particular solution, new solutions are searched for near that solution

� Better solutions (lower error) always accepted, worse solutions accepted with some probability based on an annealing schedule and overall error difference

Algorithm Details� Unlike PSO or GA, SA is a single-solution approach:

� At each iteration, we only make a single guess at the solution, and modify this guess in a (semi-)orderly fashion

� We adjust this guess randomly on each iteration

� Because there are fewer guess, convergence is (obviously) much slower than the “parallelized” approaches of PSO and GA

High-Level Algorithm� Initialize a guess w(0) randomly in the search space� Evaluate the fitness for w(o), e(o)� Initialize the best guess wbest as w(0)� Initialize ebest as e(0)� For each iteration k, do:

� Create a new guess w*=w(k-1)+σ(k)φφφφ, where φφφφ is a random (typically N(0,1)) vector equal in size to w

� Evaluate the fitness for w(k), e(k)� If e(k)<emin, quit – w* is your solution. � Else if e(k)<ebest, set w(k)=w*,wbest=w*, and ebest = e(k)� Else,

� Generate a U[0,1] random number t� If T(k)<t, set w(k)=w*� Else, set w(k)=w(k-1)

Some Notes on SA� Two major parameter choices:

� σ(k) is the neighborhood parameter� This defines how far away, on each iteration, we search for

possible better solutions

� In “classical” SA, σ(k)= σ, some fixed value

� This may be adjusted to improve performance heuristically

� T(k) is the temperature parameter, a.k.a. the annealing schedule� Typically a sequence of numbers gradually decreasing from

close-to-unity to zero

� Once it reaches zero, the algorithm obviously terminates

� Exponential-decay curves (a-bk) work well.

More Notes on SA� Given a sufficiently drawn-out annealing schedule, it

has been proven that SA will converge to the global optimum

� “Sufficiently drawn-out” really means “infinitely long”, which is really just random search, so that’s not exactly what we’d call a “helpful result”.

� I’m just sayin’.

Yes, Virginia, there is another chapter beyond chapter 4.

You say toe-may-toe, I say Solanum lycopersicum

Motivation: Linear Separability� Recall that the grand failure of Rosenblatt-type

perceptrons (and other classification algorithms such as Bayes, MAP, and ML) was that the classes must be linearly separable for them to work optimally

� Wouldn’t it be great for us to able to somehow transform a linearly-inseparable problem into a linearly-separable one?

� The key to accomplishing this lies in higher dimensions!

Cover’s Theorem� Thomas M. Cover (1965) discovered that “a complex

pattern classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated”

� Several key points to recognize here:

� “pattern classification problem”

� “linearly separable”

� “high-dimensional space”

� “nonlinearly”

Point by Point� “Pattern classification problem”

� For the sake of argument, assume a 2-class problem

� We are simply trying to determine to which of two classes a given data-point (vector) belongs

� “Linearly separable”

� That is, we can find some set of parameters w for some input feature vector y for which wTy>b implies one class, and wTy<b implies the other class

� w defines the hyperplane of separation in a linear sense

…by Point by Point� “Highly dimensional space”

� That is, we project the “low dimensional” input vector x into a higher dimension� For linear problems, this would be like drawing a line across the two-

dimensional plane, or imagining a plane (two dimensions) projected as a slice cutting through a three-dimensional space� The former case is one particular projection of ℜ1 into ℜ2, while the

latter is one particular project of ℜ2 into ℜ3.

� “Nonlinearly”� Rather than the linear projections described above, we take input

vector x and project it into some higher dimension using a set of nonlinear functions {ϕk(x)}, where k is greater than the number of elements in x

� Should be obvious from above discussion that linear projections won’t impact your linear separability by even a small amount

Simple Example

Example (cont.)� These two classes are obviously linearly inseparable

� What if we project the coordinate pair (x1, x2) into three dimensions, using the following transformation:

� Results from this in MATLAB demo!

� The moral: this transformation (projection) into higher dimensions made a linearly inseparable pair linearly separable!

1

1

2

22 2

1 2

xx

xx

x x

→

+

The Math In The Book� Haykin goes into depth about Cover’s Theorem using

some (fairly straightforward) probability, but let it suffice to say:

� The higher the dimension we project our input data into, the more likely it becomes that we will be able to linearly separate these transformed data sets into two classes

� Turns out this “more likely” is expressed as a binomial distribution, which is good to know.

� We’re going to jump around now, but we’ll come back to this notion of projections into higher dimensions

instructor: dr. benjamin thompson lecture 17: 17 march 2009 · 4.4.1) select a member from the...

Documents