evolutionary search artificial intelligence cmsc 25000 january 25, 2007
TRANSCRIPT
Evolutionary Search
Artificial Intelligence
CMSC 25000
January 25, 2007
Agenda
• Motivation:– Evolving a solution
• Genetic Algorithms– Modelling search as evolution
• Mutation • Crossover• Survival of the fittest• Survival of the most diverse
• Conclusions
Motivation: Evolution
• Evolution through natural selection– Individuals pass on traits to offspring– Individuals have different traits– Fittest individuals survive to produce more
offspring– Over time, variation can accumulate
• Leading to new species
Simulated Evolution
• Evolving a solution• Begin with population of individuals
– Individuals = candidate solutions ~chromosomes
• Produce offspring with variation– Mutation: change features– Crossover: exchange features between individuals
• Apply natural selection– Select “best” individuals to go on to next generation
• Continue until satisfied with solution
Genetic Algorithms Applications
• Search parameter space for optimal assignment– Not guaranteed to find optimal, but can approach
• Classic optimization problems:– E.g. Travelling Salesman Problem
• Program design (“Genetic Programming”)
• Aircraft carrier landings
Genetic Algorithm Example• Cookie recipes (Winston, AI, 1993)
• As evolving populations
• Individual = batch of cookies• Quality: 0-9
– Chromosomes = 2 genes: 1 chromosome each • Flour Quantity, Sugar Quantity: 1-9
• Mutation: – Randomly select Flour/Sugar: +/- 1 [1-9]
• Crossover:– Split 2 chromosomes & rejoin; keeping both
Fitness
• Natural selection: Most fit survive
• Fitness= Probability of survival to next gen
• Question: How do we measure fitness?– “Standard method”: Relate fitness to quality
• :0-1; :1-9:
Chromosome Quality Fitness
1 43 11 21 1
4321
0.40.30.20.1
if iq j
jii qqf /
GA Design Issues
• Genetic design: – Identify sets of features = genes; Constraints?
• Population: How many chromosomes?– Too few => inbreeding; Too many=>too slow
• Mutation: How frequent?– Too few=>slow change; Too many=> wild
• Crossover: Allowed? How selected?• Duplicates?
GA Design: Basic Cookie GA• Genetic design:
– Identify sets of features: 2 genes: flour+sugar;1-9
• Population: How many chromosomes?– 1 initial, 4 max
• Mutation: How frequent?– 1 gene randomly selected, randomly mutated
• Crossover: Allowed? No
• Duplicates? No
• Survival: Standard method
Basic Cookie GA Results
• Results are for 1000 random trials– Initial state: 1 1-1, quality 1 chromosome
• On average, reaches max quality (9) in 16 generations
• Best: max quality in 8 generations
• Conclusion:– Low dimensionality search
• Successful even without crossover
Basic Cookie GA+Crossover Results
• Results are for 1000 random trials– Initial state: 1 1-1, quality 1 chromosome
• On average, reaches max quality (9) in 14 generations
• Conclusion:– Faster with crossover: combine good in each gene– Key: Global max achievable by maximizing each
dimension independently - reduce dimensionality
Solving the Moat Problem
• Problem:– No single step mutation
can reach optimal values using standard fitness (quality=0 => probability=0)
• Solution A:– Crossover can combine fit
parents in EACH gene
• However, still slow: 155 generations on average
1 2 3 4 5 4 3 2 123454321
0 0 0 0 0 0 0 20 0 0 0 0 0 0 30 0 7 8 7 0 0 40 0 8 9 8 0 0 50 0 7 8 7 0 0 40 0 0 0 0 0 0 30 0 0 0 0 0 0 22 3 4 5 4 3 2 1
Questions
• How can we avoid the 0 quality problem?
• How can we avoid local maxima?
Rethinking Fitness
• Goal: Explicit bias to best– Remove implicit biases based on quality
scale
• Solution: Rank method– Ignore actual quality values except for ranking
• Step 1: Rank candidates by quality• Step 2: Probability of selecting ith candidate, given
that i-1 candidate not selected, is constant p. – Step 2b: Last candidate is selected if no other has been
• Step 3: Select candidates using the probabilities
Rank Method
Chromosome Quality Rank Std. Fitness Rank Fitness
1 41 31 25 27 5
4 3 2 1 0
1 2345
0.4 0.30.20.10.0
0.667 0.2220.0740.0250.012
Results: Average over 1000 random runs on Moat problem- 75 Generations (vs 155 for standard method)
No 0 probability entries: Based on rank not absolute quality
Diversity
• Diversity: – Degree to which chromosomes exhibit
different genes– Rank & Standard methods look only at quality– Need diversity: escape local min, variety for
crossover– “As good to be different as to be fit”
Rank-Space Method
• Combines diversity and quality in fitness• Diversity measure:
– Sum of inverse squared distances in genes
• Diversity rank: Avoids inadvertent bias• Rank-space:
– Sort on sum of diversity AND quality ranks– Best: lower left: high diversity & quality
i id
1
Rank-Space Method
Chromosome Q D D Rank Q Rank Comb Rank R-S Fitness
1 43 11 21 17 5
4 3 2 1 0
1 5342
1 2345
0.667 0.0250.2220.0120.074
Diversity rank breaks tiesAfter select others, sum distances to bothResults: Average (Moat) 15 generations
0.04 0.25 0.059 0.062 0.05
1 4253
W.r.t. highest ranked 5-1
GA’s and Local Maxima
• Quality metrics only:– Susceptible to local max problems
• Quality + Diversity:– Can populate all local maxima
• Including global max
– Key: Population must be large enough
GA Discussion
• Similar to stochastic local beam search– Beam: Population size– Stochastic: selection & mutation– Local: Each generation from single previous– Key difference: Crossover – 2 sources!
• Why crossover?– Schema: Partial local subsolutions
• E.g. 2 halves of TSP tour
Question
• Traveling Salesman Problem– CSP-style Iterative refinement
– Genetic Algorithm
• N-Queens– CSP-style Iterative refinement
– Genetic Algorithm
Iterative Improvement Example
• TSP– Start with some valid tour
• E.g. find greedy solution
– Make incremental change to tour• E.g. hill-climbing - take change that produces
greatest improvement– Problem: Local minima– Solution: Randomize to search other parts of space
• Other methods: Simulated annealing, Genetic alg’s
Machine Learning:Nearest Neighbor &
Information Retrieval SearchArtificial Intelligence
CMSC 25000
January 25, 2007
Agenda
• Machine learning: Introduction• Nearest neighbor techniques
– Applications:• Credit rating• Text Classification
– K-nn– Issues:
• Distance, dimensions, & irrelevant attributes• Efficiency:
– k-d trees, parallelism
Machine Learning
• Learning: Acquiring a function, based on past inputs and values, from new inputs to values.
• Learn concepts, classifications, values– Identify regularities in data
Machine Learning Examples
• Pronunciation: – Spelling of word => sounds
• Speech recognition:– Acoustic signals => sentences
• Robot arm manipulation:– Target => torques
• Credit rating:– Financial data => loan qualification
Complexity & Generalization
• Goal: Predict values accurately on new inputs• Problem:
– Train on sample data– Can make arbitrarily complex model to fit– BUT, will probably perform badly on NEW data
• Strategy:– Limit complexity of model (e.g. degree of equ’n)– Split training and validation sets
• Hold out data to check for overfitting
Nearest Neighbor
• Memory- or case- based learning
• Supervised method: Training– Record labeled instances and feature-value vectors
• For each new, unlabeled instance– Identify “nearest” labeled instance– Assign same label
• Consistency heuristic: Assume that a property is the same as that of the nearest reference case.
Nearest Neighbor Example
• Credit Rating:– Classifier: Good /
Poor– Features:
• L = # late payments/yr; • R = Income/Expenses
Name L R G/P
A 0 1.2 G
B 25 0.4 P
C 5 0.7 G
D 20 0.8 PE 30 0.85 P
F 11 1.2 G
G 7 1.15 GH 15 0.8 P
Nearest Neighbor Example
Name L R G/P
A 0 1.2 G
B 25 0.4 P
C 5 0.7 G
D 20 0.8 PE 30 0.85 P
F 11 1.2 G
G 7 1.15 GH 15 0.8 P L
R
302010
1 A
B
C D E
FG
H
Nearest Neighbor Example
L 302010
1 A
B
C D E
FG
HR
Name L R G/P
I 6 1.15
J 22 0.45
K 15 1.2
G
IP
J
??
K
Distance Measure:
Sqrt ((L1-L2)^2 + [sqrt(10)*(R1-R2)]^2))
- Scaled distance
Nearest Neighbor Analysis
• Problem: – Ambiguous labeling, Training Noise
• Solution:– K-nearest neighbors
• Not just single nearest instance
• Compare to K nearest neighbors– Label according to majority of K
• What should K be?– Often 3, can train as well
Text Classification
Matching Topics and Documents
• Two main perspectives:– Pre-defined, fixed, finite topics:
• “Text Classification”
– Arbitrary topics, typically defined by statement of information need (aka query)
• “Information Retrieval”
Vector Space Information Retrieval
• Task:– Document collection– Query specifies information need: free text– Relevance judgments: 0/1 for all docs
• Word evidence: Bag of words– No ordering information
Vector Space Model
Computer
Tv
Program
Two documents: computer program, tv programQuery: computer program : matches 1 st doc: exact: distance=2 vs 0 educational program: matches both equally: distance=1
Vector Space Model
• Represent documents and queries as– Vectors of term-based features
• Features: tied to occurrence of terms in collection
– E.g.
• Solution 1: Binary features: t=1 if present, 0 otherwise– Similiarity: number of terms in common
• Dot product
),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj tttqtttd
ji
N
ikijk ttdqsim ,
1,),(
Vector Space Model II
• Problem: Not all terms equally interesting– E.g. the vs dog vs Levow
• Solution: Replace binary term features with weights– Document collection: term-by-document matrix
– View as vector in multidimensional space• Nearby vectors are related
– Normalize for vector length
),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj wwwqwwwd
Vector Similarity Computation
• Similarity = Dot product
• Normalization:– Normalize weights in advance– Normalize post-hoc
ji
N
ikijkjk wwdqdqsim ,
1,),(
N
i ji
N
i ki
N
i jikijk
ww
wwdqsim
1
2,1
2,
1 ,,),(
Term Weighting
• “Aboutness”– To what degree is this term what document is about?– Within document measure– Term frequency (tf): # occurrences of t in doc j
• “Specificity”– How surprised are you to see this term?– Collection frequency– Inverse document frequency (idf):
)log(i
i n
Nidf
ijiji idftfw ,,
Term Selection & Formation
• Selection:– Some terms are truly useless
• Too frequent, no content– E.g. the, a, and,…
– Stop words: ignore such terms altogether
• Creation:– Too many surface forms for same concepts
• E.g. inflections of words: verb conjugations, plural
– Stem terms: treat all forms as same underlying
Efficient Implementations
• Classification cost:– Find nearest neighbor: O(n)
• Compute distance between unknown and all instances
• Compare distances
– Problematic for large data sets
• Alternative:– Use binary search to reduce to O(log n)
Efficient Implementation: K-D Trees
• Divide instances into sets based on features– Binary branching: E.g. > value– 2^d leaves with d split path = n
• d= O(log n)
– To split cases into sets,• If there is one element in the set, stop• Otherwise pick a feature to split on
– Find average position of two middle objects on that dimension
» Split remaining objects based on average position
» Recursively split subsets
K-D Trees: Classification
R > 0.825?
L > 17.5? L > 9 ?
No Yes
R > 0.6? R > 0.75? R > 1.025 ?R > 1.175 ?
NoYes No Yes
No
Poor Good
Yes No Yes
Good Poor
No Yes
Good Good
No
Poor
Yes
Good
Efficient Implementation:Parallel Hardware
• Classification cost:– # distance computations
• Const time if O(n) processors
– Cost of finding closest• Compute pairwise minimum, successively• O(log n) time
Nearest Neighbor: Issues
• Prediction can be expensive if many features
• Affected by classification, feature noise– One entry can change prediction
• Definition of distance metric– How to combine different features
• Different types, ranges of values
• Sensitive to feature selection
Nearest Neighbor: Analysis
• Issue: – What is a good distance metric?– How should features be combined?
• Strategy:– (Typically weighted) Euclidean distance– Feature scaling: Normalization
• Good starting point: – (Feature - Feature_mean)/Feature_standard_deviation– Rescales all values - Centered on 0 with std_dev 1
Nearest Neighbor: Analysis
• Issue: – What features should we use?
• E.g. Credit rating: Many possible features– Tax bracket, debt burden, retirement savings, etc..
– Nearest neighbor uses ALL – Irrelevant feature(s) could mislead
• Fundamental problem with nearest neighbor
Nearest Neighbor: Advantages
• Fast training:– Just record feature vector - output value set
• Can model wide variety of functions– Complex decision boundaries– Weak inductive bias
• Very generally applicable
Summary
• Machine learning:– Acquire function from input features to value
• Based on prior training instances
– Supervised vs Unsupervised learning• Classification and Regression
– Inductive bias: • Representation of function to learn• Complexity, Generalization, & Validation