evolutionary search artificial intelligence cmsc 25000 january 25, 2007

Evolutionary Search

Artificial Intelligence

CMSC 25000

January 25, 2007

Agenda

• Motivation:– Evolving a solution

• Genetic Algorithms– Modelling search as evolution

• Mutation • Crossover• Survival of the fittest• Survival of the most diverse

• Conclusions

Motivation: Evolution

• Evolution through natural selection– Individuals pass on traits to offspring– Individuals have different traits– Fittest individuals survive to produce more

offspring– Over time, variation can accumulate

• Leading to new species

Simulated Evolution

• Evolving a solution• Begin with population of individuals

– Individuals = candidate solutions ~chromosomes

• Produce offspring with variation– Mutation: change features– Crossover: exchange features between individuals

• Apply natural selection– Select “best” individuals to go on to next generation

• Continue until satisfied with solution

Genetic Algorithms Applications

• Search parameter space for optimal assignment– Not guaranteed to find optimal, but can approach

• Classic optimization problems:– E.g. Travelling Salesman Problem

• Program design (“Genetic Programming”)

• Aircraft carrier landings

Genetic Algorithm Example• Cookie recipes (Winston, AI, 1993)

• As evolving populations

• Individual = batch of cookies• Quality: 0-9

– Chromosomes = 2 genes: 1 chromosome each • Flour Quantity, Sugar Quantity: 1-9

• Mutation: – Randomly select Flour/Sugar: +/- 1 [1-9]

• Crossover:– Split 2 chromosomes & rejoin; keeping both

Fitness

• Natural selection: Most fit survive

• Fitness= Probability of survival to next gen

• Question: How do we measure fitness?– “Standard method”: Relate fitness to quality

• :0-1; :1-9:

Chromosome Quality Fitness

1 43 11 21 1

4321

0.40.30.20.1

if iq j

jii qqf /

GA Design Issues

• Genetic design: – Identify sets of features = genes; Constraints?

• Population: How many chromosomes?– Too few => inbreeding; Too many=>too slow

• Mutation: How frequent?– Too few=>slow change; Too many=> wild

• Crossover: Allowed? How selected?• Duplicates?

GA Design: Basic Cookie GA• Genetic design:

– Identify sets of features: 2 genes: flour+sugar;1-9

• Population: How many chromosomes?– 1 initial, 4 max

• Mutation: How frequent?– 1 gene randomly selected, randomly mutated

• Crossover: Allowed? No

• Duplicates? No

• Survival: Standard method

Basic Cookie GA Results

• Results are for 1000 random trials– Initial state: 1 1-1, quality 1 chromosome

• On average, reaches max quality (9) in 16 generations

• Best: max quality in 8 generations

• Conclusion:– Low dimensionality search

• Successful even without crossover

Basic Cookie GA+Crossover Results

• Results are for 1000 random trials– Initial state: 1 1-1, quality 1 chromosome

• On average, reaches max quality (9) in 14 generations

• Conclusion:– Faster with crossover: combine good in each gene– Key: Global max achievable by maximizing each

dimension independently - reduce dimensionality

Solving the Moat Problem

• Problem:– No single step mutation

can reach optimal values using standard fitness (quality=0 => probability=0)

• Solution A:– Crossover can combine fit

parents in EACH gene

• However, still slow: 155 generations on average

1 2 3 4 5 4 3 2 123454321

0 0 0 0 0 0 0 20 0 0 0 0 0 0 30 0 7 8 7 0 0 40 0 8 9 8 0 0 50 0 7 8 7 0 0 40 0 0 0 0 0 0 30 0 0 0 0 0 0 22 3 4 5 4 3 2 1

Questions

• How can we avoid the 0 quality problem?

• How can we avoid local maxima?

Rethinking Fitness

• Goal: Explicit bias to best– Remove implicit biases based on quality

scale

• Solution: Rank method– Ignore actual quality values except for ranking

• Step 1: Rank candidates by quality• Step 2: Probability of selecting ith candidate, given

that i-1 candidate not selected, is constant p. – Step 2b: Last candidate is selected if no other has been

• Step 3: Select candidates using the probabilities

Rank Method

Chromosome Quality Rank Std. Fitness Rank Fitness

1 41 31 25 27 5

4 3 2 1 0

1 2345

0.4 0.30.20.10.0

0.667 0.2220.0740.0250.012

Results: Average over 1000 random runs on Moat problem- 75 Generations (vs 155 for standard method)

No 0 probability entries: Based on rank not absolute quality

Diversity

• Diversity: – Degree to which chromosomes exhibit

different genes– Rank & Standard methods look only at quality– Need diversity: escape local min, variety for

crossover– “As good to be different as to be fit”

Rank-Space Method

• Combines diversity and quality in fitness• Diversity measure:

– Sum of inverse squared distances in genes

• Diversity rank: Avoids inadvertent bias• Rank-space:

– Sort on sum of diversity AND quality ranks– Best: lower left: high diversity & quality

i id

1

Rank-Space Method

Chromosome Q D D Rank Q Rank Comb Rank R-S Fitness

1 43 11 21 17 5

4 3 2 1 0

1 5342

1 2345

0.667 0.0250.2220.0120.074

Diversity rank breaks tiesAfter select others, sum distances to bothResults: Average (Moat) 15 generations

0.04 0.25 0.059 0.062 0.05

1 4253

W.r.t. highest ranked 5-1

GA’s and Local Maxima

• Quality metrics only:– Susceptible to local max problems

• Quality + Diversity:– Can populate all local maxima

• Including global max

– Key: Population must be large enough

GA Discussion

• Similar to stochastic local beam search– Beam: Population size– Stochastic: selection & mutation– Local: Each generation from single previous– Key difference: Crossover – 2 sources!

• Why crossover?– Schema: Partial local subsolutions

• E.g. 2 halves of TSP tour

Question

• Traveling Salesman Problem– CSP-style Iterative refinement

– Genetic Algorithm

• N-Queens– CSP-style Iterative refinement

– Genetic Algorithm

Iterative Improvement Example

• TSP– Start with some valid tour

• E.g. find greedy solution

– Make incremental change to tour• E.g. hill-climbing - take change that produces

greatest improvement– Problem: Local minima– Solution: Randomize to search other parts of space

• Other methods: Simulated annealing, Genetic alg’s

Machine Learning:Nearest Neighbor &

Information Retrieval SearchArtificial Intelligence

CMSC 25000

January 25, 2007

Agenda

• Machine learning: Introduction• Nearest neighbor techniques

– Applications:• Credit rating• Text Classification

– K-nn– Issues:

• Distance, dimensions, & irrelevant attributes• Efficiency:

– k-d trees, parallelism

Machine Learning

• Learning: Acquiring a function, based on past inputs and values, from new inputs to values.

• Learn concepts, classifications, values– Identify regularities in data

Machine Learning Examples

• Pronunciation: – Spelling of word => sounds

• Speech recognition:– Acoustic signals => sentences

• Robot arm manipulation:– Target => torques

• Credit rating:– Financial data => loan qualification

Complexity & Generalization

• Goal: Predict values accurately on new inputs• Problem:

– Train on sample data– Can make arbitrarily complex model to fit– BUT, will probably perform badly on NEW data

• Strategy:– Limit complexity of model (e.g. degree of equ’n)– Split training and validation sets

• Hold out data to check for overfitting

Nearest Neighbor

• Memory- or case- based learning

• Supervised method: Training– Record labeled instances and feature-value vectors

• For each new, unlabeled instance– Identify “nearest” labeled instance– Assign same label

• Consistency heuristic: Assume that a property is the same as that of the nearest reference case.

Nearest Neighbor Example

• Credit Rating:– Classifier: Good /

Poor– Features:

• L = # late payments/yr; • R = Income/Expenses

Name L R G/P

A 0 1.2 G

B 25 0.4 P

C 5 0.7 G

D 20 0.8 PE 30 0.85 P

F 11 1.2 G

G 7 1.15 GH 15 0.8 P


Name L R G/P

A 0 1.2 G

B 25 0.4 P

C 5 0.7 G

D 20 0.8 PE 30 0.85 P

F 11 1.2 G

G 7 1.15 GH 15 0.8 P L

R

302010

1 A

B

C D E

FG

H


L 302010

1 A

B

C D E

FG

HR

Name L R G/P

I 6 1.15

J 22 0.45

K 15 1.2

G

IP

J

??

K

Distance Measure:

Sqrt ((L1-L2)^2 + [sqrt(10)*(R1-R2)]^2))

- Scaled distance

Nearest Neighbor Analysis

• Problem: – Ambiguous labeling, Training Noise

• Solution:– K-nearest neighbors

• Not just single nearest instance

• Compare to K nearest neighbors– Label according to majority of K

• What should K be?– Often 3, can train as well

Text Classification

Matching Topics and Documents

• Two main perspectives:– Pre-defined, fixed, finite topics:

• “Text Classification”

– Arbitrary topics, typically defined by statement of information need (aka query)

• “Information Retrieval”

Vector Space Information Retrieval

• Task:– Document collection– Query specifies information need: free text– Relevance judgments: 0/1 for all docs

• Word evidence: Bag of words– No ordering information

Vector Space Model

Computer

Tv

Program

Two documents: computer program, tv programQuery: computer program : matches 1 st doc: exact: distance=2 vs 0 educational program: matches both equally: distance=1

Vector Space Model

• Represent documents and queries as– Vectors of term-based features

• Features: tied to occurrence of terms in collection

– E.g.

• Solution 1: Binary features: t=1 if present, 0 otherwise– Similiarity: number of terms in common

• Dot product

),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj tttqtttd

ji

N

ikijk ttdqsim ,

1,),(

Vector Space Model II

• Problem: Not all terms equally interesting– E.g. the vs dog vs Levow

• Solution: Replace binary term features with weights– Document collection: term-by-document matrix

– View as vector in multidimensional space• Nearby vectors are related

– Normalize for vector length

),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj wwwqwwwd

Vector Similarity Computation

• Similarity = Dot product

• Normalization:– Normalize weights in advance– Normalize post-hoc

ji

N

ikijkjk wwdqdqsim ,

1,),(

N

i ji

N

i ki

N

i jikijk

ww

wwdqsim

1

2,1

2,

1 ,,),(

Term Weighting

• “Aboutness”– To what degree is this term what document is about?– Within document measure– Term frequency (tf): # occurrences of t in doc j

• “Specificity”– How surprised are you to see this term?– Collection frequency– Inverse document frequency (idf):

)log(i

i n

Nidf

ijiji idftfw ,,

Term Selection & Formation

• Selection:– Some terms are truly useless

• Too frequent, no content– E.g. the, a, and,…

– Stop words: ignore such terms altogether

• Creation:– Too many surface forms for same concepts

• E.g. inflections of words: verb conjugations, plural

– Stem terms: treat all forms as same underlying

Efficient Implementations

• Classification cost:– Find nearest neighbor: O(n)

• Compute distance between unknown and all instances

• Compare distances

– Problematic for large data sets

• Alternative:– Use binary search to reduce to O(log n)

Efficient Implementation: K-D Trees

• Divide instances into sets based on features– Binary branching: E.g. > value– 2^d leaves with d split path = n

• d= O(log n)

– To split cases into sets,• If there is one element in the set, stop• Otherwise pick a feature to split on

– Find average position of two middle objects on that dimension

» Split remaining objects based on average position

» Recursively split subsets

K-D Trees: Classification

R > 0.825?

L > 17.5? L > 9 ?

No Yes

R > 0.6? R > 0.75? R > 1.025 ?R > 1.175 ?

NoYes No Yes

No

Poor Good

Yes No Yes

Good Poor

No Yes

Good Good

No

Poor

Yes

Good

Efficient Implementation:Parallel Hardware

• Classification cost:– # distance computations

• Const time if O(n) processors

– Cost of finding closest• Compute pairwise minimum, successively• O(log n) time

Nearest Neighbor: Issues

• Prediction can be expensive if many features

• Affected by classification, feature noise– One entry can change prediction

• Definition of distance metric– How to combine different features

• Different types, ranges of values

• Sensitive to feature selection

Nearest Neighbor: Analysis

• Issue: – What is a good distance metric?– How should features be combined?

• Strategy:– (Typically weighted) Euclidean distance– Feature scaling: Normalization

• Good starting point: – (Feature - Feature_mean)/Feature_standard_deviation– Rescales all values - Centered on 0 with std_dev 1

Nearest Neighbor: Analysis

• Issue: – What features should we use?

• E.g. Credit rating: Many possible features– Tax bracket, debt burden, retirement savings, etc..

– Nearest neighbor uses ALL – Irrelevant feature(s) could mislead

• Fundamental problem with nearest neighbor

Nearest Neighbor: Advantages

• Fast training:– Just record feature vector - output value set

• Can model wide variety of functions– Complex decision boundaries– Weak inductive bias

• Very generally applicable

Summary

• Machine learning:– Acquire function from input features to value

• Based on prior training instances

– Supervised vs Unsupervised learning• Classification and Regression

– Inductive bias: • Representation of function to learn• Complexity, Generalization, & Validation