genetic algorithms (evolutionary computing) genetic algorithms are used to try to “evolve” the...

Genetic Algorithms(Evolutionary Computing)

Genetic Algorithms are used to try to “evolve” the solution to a problem Generate prototype solutions called chromosomes

(individuals) Backpack problem as example:

http://home.ksp.or.jp/csd/english/ga/gatrial/Ch9_A2_4.html

All individuals form the population Generate new individuals by reproduction Use a fitness function to evaluate individuals Survival of the fittest: population has a fixed size Individuals with higher fitness is more likely to

reproduce

Reproduction Methods Mutation

Alter a single gene in the chromosome randomly to create a new chromosome

Example Cross-over

Pick a random location within chromosome New chromosome receives first set of genes

from parent 1, second set from parent 2 Example

Inversion Reverse the chromsome

Interpretation Genetic algorithms try to solve a

hill climbing problem Method is parallelizable The trick is in how you represent

the chromosome Tries to avoid local maxima by

keeping many chromsomes at a time

Another Example:Traveling Sales Rep Problem

How to represent a chromosome? What effects does this have on

crossover and mutation?

TSP Chromosome: Ordering of city numbers

(1 9 2 4 6 5 7 8 3) What can go wrong with crossover? To fix, use order crossover technique Take two chromosomes, and take two

random locations to cut p1 = (1 9 2 | 4 6 5 7 | 8 3) p2 = (4 5 9 | 1 8 7 6 | 2 3)

Goal: preserve as much as possible of the orderings in the chromosomes

Order Crossover p1 = (1 9 2 | 4 6 5 7 | 8 3) p2 = (4 5 9 | 1 8 7 6 | 2 3)

New p1 will look like: c1 = (x x x | 4 6 5 7 | x x)

To fill in c1, first produce ordered list of cities from p2, starting after cut, eliminating cities in c1 2 3 9 1 8

Drop them into c1 in order c1 = (2 3 9 4 6 5 7 1 8)

Do similarly in reverse to obtain c2 = (3 9 2 1 8 7 6 4 5)

Mutation & Inversion What can go wrong with mutation? What is wrong with inversion?

Mutation & Inversion Redefine mutation as picking two

random spots in path, and swapping p1 = (1 9 2 4 6 5 7 8 3) c1 = (1 9 8 4 6 5 7 2 3)

Redefine inversion as picking a random middle section and reversing: p1 = (1 9 2 | 4 6 5 7 8 | 3) c1 = (1 9 2 | 8 7 5 6 4 | 3)

Another example: http://home.online.no/~bergar/mazega.htm

Reinforcement Learning Game playing: So far, we have told

the agent the value of a given board position.

How can an agent learn which board positions are important? Play a whole bunch of games, and

receive reward at end (+ or -) How do you determine utility of states

that aren’t ending states?

The setup: Possible game states

Terminal states have reward Mission: Estimate utility of all possible game states

Passive Learning Agent learns by “watching” Fixed probability of moving from one

state to another

Sample Results

Technique #1: Naive Updating Also known as Least Mean Squares

(LMS) approach Starting at home, obtain sequence of

states to terminal state Utility of terminal state = reward loop back over all other states

utility for state i = running average of all rewards seen for state i

Naive Updating Analysis Minimizes mean square error with

respect to seen data Works, but converges slowly

Must play lots of games Ignores that utility of a state

should depend on successor

Technique #2: Adaptive Dynamic Programming

Utility of a state depends entirely on the successor state If a state has one successor, utility

should be the same If a state has multiple successors,

utility should be expected value of successors

) to from transition(

)(

)terminal()terminal(

)(

jiPM

UMiU

rewardU

ij

isuccessorsjjij

Finding the utilities To find all utilities, just solve equations

This is done via dynamic programming “Gold standard” – this gets you the right

values instantly, no convergence or iteration Completely intractable for large problems:

For a real game, it means finding actual utilities of all states

Assumes that you know Mij

)(

)(isuccessorsj

jijUMiU

Technique 3: Temporal Difference Learning

Want utility to depend on successors, but want to solve iteratively

Whenever you observe a transition from i to j:

))()(()()(


iUjUiUiU

rewardU

= learning rate difference between successive states =

temporal difference Converges faster than Naive updating

Passive Learning in Unknown Environment

Unknown environment = transition probabilities unknown

Only affects technique 2, Adaptive Dynamic Programming

Iteratively: Estimate transition probabilities

based on what you’ve seen Solve dynamic programming problem

with best estimates so far

Active Learning in an Unknown Environment

Probability of going from one state to another now depends on action

ADP equations are now:

) to from transition(

max)(


)(

jiPM

UMiU

rewardU

ij

isuccessorsjj

aij

a

Exploration: where should agent go to learn utilities?

Suppose you’re trying to learn optimal blackjack strategies Do you follow best utility, in order to win? Do you move around at random, hoping to

learn more (and losing lots in the process)? Following best utility all the time can get

you stuck at an imperfect solution Following random moves can lose a lot

Where should agent go to learn utilities?

f(u,n) = exploration function depends on utility of move, and

number of times that agent has tried it One possibility:

Try a move a bunch of times, then eventually settle

otherwiseu

Nnifnumberbignuf ),(

Generalization in Reinforcement Learning

Maintaining utilities for all seen states in a real game is intractable.

Instead, treat it as a supervised learning problem

Training set consists of (state, utility) pairs Learn to predict utility from state

This is a regression problem, not a classification problem

Can use neural network with multiple outputs

Other applications Applies to any situation where

something is to learn from reinforcement

Possible examples: Toy robot dogs Petz That darn paperclip “The only winning move is not to play”

genetic algorithms (evolutionary computing) genetic algorithms are used to try to “evolve” the...

Documents