survival of the fittest: using genetic algorithm for data mining optimization
TRANSCRIPT
Survival of the Fittest - Using Genetic
Algorithm for Data Mining Optimization
July 25, 2013
Or Levi
Introduction
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 2
•Better Results
•Higher Accuracy
•Knowledge
• Insights
Big DataMachine Learning on eBay
Data Mining Optimization
Genetic Algorithm
Agenda
What is Genetic Algorithm?
How GA can help improve Cluster Analysis?
Where it might be useful? An eBay Use Case
Questions and Answers
3Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
1
2
3
4
Genetic Algorithm
A Search Heuristic Inspired by the Natural Evolution
Genetic Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 5
0 0 1 0 0 1 0
Neck Length
Solution Representation Fitness Value Natural Selection Mechanism
EnvironmentChromosome
Tall Trees, Competition5’1
Adi
7 Genes
Genetic Algorithm
Initial Population
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 6
1 0 0 1 1 0 0
Joe
0 1 0 1 0 1 1
Zoe
1 1 0 1 1 1 0
Ron
1 0 1 0 1 0 1
0 0 1 0 0 1 0
1 0 1 0 1 0 1
Tom
0 0 1 0 0 1 0
Adi
Genetic Algorithm
Fitness Function
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 7
1 0 0 1 1 0 0
Joe
0 1 0 1 0 1 1
Zoe
1 1 0 1 1 1 0
Ron
1 0 1 0 1 0 1
Tom
0 0 1 0 0 1 0
Adi
5’6
4’2
5’8
4’9
5’1
Neck Length
7
Genetic Algorithm
Selection
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 8
1 0 0 1 1 0 0
Joe
0 1 0 1 0 1 1
Zoe
1 1 0 1 1 1 0
Ron
1 0 1 0 1 0 1
Tom
0 0 1 0 0 1 0
5’6
4’2
5’8
4’9
Elitism
Adi 5’1
Neck Length
Genetic Algorithm
Selection
Fitness proportionate selection
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 9
Ron
Joe
AdiTom
Zoe
Genetic Algorithm
Crossover
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 10
1 1 0 1 1 1 0
Ron
0 1 0 1 0 1 1
Zoe
1 1 0 1 0 1 1
Ron Junior
0 1 0 1 1 1 0
Zoe Junior
5’8
4’2
6’0
5’3
Crossover Probability
Genetic Algorithm
Mutation
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 11
1 1 0 1 0 1 1
Ron Junior
0 1 0 1 1 1 0
Zoe Junior
No Mutation
6’0
5’3
Mutation Probability: 0. 1
Fitness
Chromosome
Genetic Algorithm
Crossover
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 12
1 0 0 1 1 0 0
Joe
0 0 1 0 0 1 0
Adi
No Crossover
5’6
5’1
Crossover Probability
Genetic Algorithm
Mutation
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 13
1 0 0 1 1 0 0
Joe
0 0 1 0 0 1 0
Adi
0 0 1 0 1 1 0
Adi Junior 5’5
5’6
5’1
Genetic Algorithm
New Generation
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 14
1 1 0 1 1 1 0
Ron
1 1 0 1 0 1 1
Ron Junior
1 0 0 1 1 0 0
Joe
0 0 1 0 1 1 0
5’8
6’0
5’6
Neck Length
Previous
Adi Junior 5’5
0 1 0 1 1 1 0
Zoe Junior 5’3
5’1 5’5New
Genetic Algorithm
Results
15Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
1 0 1 1 0 1 0
Adi Junior VIII 7’0
Overview – Genetic Algorithm
16Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
eBay Structured Data
What inventory is on our shelves?
Structured Data
18Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
TaxonomyProducts
Item Finders
Attributes
Structured Data
19Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Structured Data
20Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Structured Data
21Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Data Vendors
eBay Sellers
eBay Items
Products
Everywhere
Products
ISBN
UPI
Choose Aggregation Set
Brand
Model
Color
Creating Products from Items
22
Product Features
Network: 4G
Camera: 8.0MP
Screen Size: 4 in.
Used iOS
16GB New
Unlocked
$525.0017 Bids
$649.99Buy It Now
$579.99or Best Offer
Storage
Carrier
Apple iPhone 5 – BlackSmartphones
Product Type eBay View Items Product
Apple iPhone 5 Black
Apple iPhone 5 Black
Black Apple iPhone 5
Other Features
Bluetooth: Yes
GPS: Yes
Dimensions:
Height: 4.87 in.
Depth: 0.30 in.
Width: 2.31 in.
Choose Aggregation Set Extract Relevant Attributes Aggregate Similar Items
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Creating Products from Items
23Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Structured Aggregation
Top-Down
Unstructured Clustering
Bottom-Up
Items
Products
Aggregation Set
Aggregation Set
Overview – eBay Use Case
24Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Use Case Example
Cluster Analysis
Discovering groups and structures that are in some way similar
K-Means Cluster Analysis
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 26
𝒙𝒋
𝒎𝒊𝒏
𝒊=𝟏
𝑲
𝑺𝑺𝑬𝒊
𝑺𝑺𝑬𝒊 =
𝒙𝒋∈𝑪𝒊
𝒙𝒋 − 𝝁𝒊𝟐
𝝁𝒊
𝑪𝒊
Model
Total Within
Cluster Variance
Observation
Center
Cluster
Objective
Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 27
Choose
K Random
Points
Initial Center
Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 28
Assign
Points to
Clusters
Cluster
Center
Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 29
Recalculate
the Clusters
Means
Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 30
1
33.7
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
Recalculate
the Clusters
Means
Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 31
2
26.8
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 32
3
23.6
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 33
4
21.6
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 34
5
19.5
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 35
6
18.8
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 36
7
18.7
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
Local Optimum
Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 37
Initial
Cluster
Centers
Initial Center
Local Optimum
Overview – Standard K-Means
38Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Use Case
Standard K-Means
Local Optimum
Genetic K-Means Algorithm
Applying genetic algorithm to the standard K-Means heuristic
Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 40
0 0 1 0 0 1 0
Chromosome
Adi
7 Genes
Solution Representation
Genetic K-Means Algorithm
Solution Representation
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 41
Genetic K-Means Algorithm
Solution Representation
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 42
Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 43
Solution Fitness
Neck Length
Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 44
𝒊=𝟏
𝑲
𝒙𝒋∈𝑪𝒊
𝒙𝒋 − 𝝁𝒊𝟐
Total Within
Cluster Variance
Solution Fitness
Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 45
𝑲𝑴𝒆𝒂𝒏𝒔 𝑰𝒕𝒆𝒓𝒂𝒕𝒊𝒐𝒏
Solution Fitness
Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 46
𝑨𝒓𝒊𝒕𝒉𝒎𝒆𝒕𝒊𝒄 𝑪𝒓𝒐𝒔𝒔𝒐𝒗𝒆𝒓
Solution 1
Solution 2
Crossover
Ron Zoe
Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 47
Offspring 1
Offspring 2
Crossover
Ron Junior Zoe Junior
𝑨𝒓𝒊𝒕𝒉𝒎𝒆𝒕𝒊𝒄 𝑪𝒓𝒐𝒔𝒔𝒐𝒗𝒆𝒓
Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 48
Mutation
Adi
Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 49
Mutation
Adi Junior
Overview – Genetic K-Means Algorithm
50Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Use Case
Apply GA to K-MeansStandard K-Means
Local Optimum
Demo
Genetic Algorithm VS Standard K-Means
Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 52
1
32.5
Best Solution
In Generation
Cluster
Center
Total Within
Cluster Variance
Solution Fitness
Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 53
2
21.9
Best Solution
In Generation
Total Within
Cluster Variance
Solution Fitness
Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 54
3
14.7
Best Solution
In Generation
Total Within
Cluster Variance
Solution Fitness
Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 55
4
12.3
Best Solution
In Generation
Total Within
Cluster Variance
Solution Fitness
Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 56
5
9.7
Best Solution
In Generation
Total Within
Cluster Variance
Solution Fitness
Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 57
6
9.1
Best Solution
In Generation
Total Within
Cluster Variance
Solution Fitness
Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 58
7
8.9
Best Solution
In Generation
Total Within
Cluster Variance
Solution Fitness
Genetic Algorithm VS Standard K-Means
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 59
0
5
10
15
20
25
30
35
40
0 2 4 6 8 10 12
Total
Within
Cluster
Variance
Generations
Total Within Cluster Variance Per Generation
Genetic Algorithm K-Means
Genetic Algorithm VS Standard K-Means
60Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
0
5
10
15
20
25
30
0 5 10 15 20
Total Within-Cluster Variance on Different Runs
K-Means Multiple K-Means Genetic Algorithm
Local Optimum
High Volatility
Global Optimum
51% 32%Average
ImprovementAcross 20 Different Runs
VS Standard K-Means VS Multiple K-Means
Total Within
Cluster Variance
Overview – GA VS Standard K-Means
61Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Use Case
Apply GA to K-Means
Global Optimum
Standard K-Means
Local Optimum
eBay Use Case
Extract Structured Data from groups of similar items
eBay Use Case
63Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Lumia 920 Red 32GB Lumia 520 Yellow 8GB Lumia 620 Green 8GB
Lumia 800 Blue 16GB
eBay Use Case
64Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Nokia Lumia 800 Blue 8GB phone
0.05 0.12 0 0.31 0 0.20 0.12 0 0.14
Clean Up
TF-IDF Weights
NOKIA | LUMIA | 800 | BLUE | 8GB | PHONE
NOKIA LUMIA 800 BLUE 8GB PHONE
Number of Unique Terms in All Titles
Original Title
97
9
25
520 620 800 920
50 Random Items
Text Dictionary: All Titles
Importance of
A term to a title
{Stop Words}
brand new
eBay Use Case
65Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Aggregation Set:
Model Color Storage
0.08 0.11 0 0.13 0 0.06 0.05 0 0.03
8GB 620
Average
Weight
GREEN 5MP CAMERA PHONE
Cluster Center
1 Item
NOKIA | LUMIA | 800 | BLUE | 8GB | PHONE46% 23%Average
ImprovementAcross 20 Different Runs
VS Standard K-Means VS Multiple K-Means
Accurate Item
Classifications
Overview – Example
66Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Use Case Example
Apply GA to K-Means
Global Optimum
Standard K-Means
Local Optimum
Questions & Answers
Open Discussion
?
Conclusion
Summing it all up
Conclusion
69Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Use Case Example
Apply GA to K-Means
Global Optimum
Standard K-Means
Local Optimum
+50% Accuracy
Thank You!
Or Levi
Data Analyst
Catalog & Classification
eBay Structured Data
Linked
Appendix – Genetic Algorithm Parameters
71Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Crossover
Probability65% 90%
5%
10%
Mutation
Probability
Population Size: 10 Number of Generations: 10
Crossover Probability: 75% Mutation Probability: 9%
Normalized
Score
100
0
Total Within
Cluster Variance
Average of 5 Runs