survival of the fittest: using genetic algorithm for data mining optimization

71
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization July 25, 2013 Or Levi

Upload: or-levi

Post on 17-Jul-2015

150 views

Category:

Software


6 download

TRANSCRIPT

Page 1: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Survival of the Fittest - Using Genetic

Algorithm for Data Mining Optimization

July 25, 2013

Or Levi

Page 2: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Introduction

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 2

•Better Results

•Higher Accuracy

•Knowledge

• Insights

Big DataMachine Learning on eBay

Data Mining Optimization

Genetic Algorithm

Page 3: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Agenda

What is Genetic Algorithm?

How GA can help improve Cluster Analysis?

Where it might be useful? An eBay Use Case

Questions and Answers

3Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

1

2

3

4

Page 4: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm

A Search Heuristic Inspired by the Natural Evolution

Page 5: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 5

0 0 1 0 0 1 0

Neck Length

Solution Representation Fitness Value Natural Selection Mechanism

EnvironmentChromosome

Tall Trees, Competition5’1

Adi

7 Genes

Page 6: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm

Initial Population

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 6

1 0 0 1 1 0 0

Joe

0 1 0 1 0 1 1

Zoe

1 1 0 1 1 1 0

Ron

1 0 1 0 1 0 1

0 0 1 0 0 1 0

1 0 1 0 1 0 1

Tom

0 0 1 0 0 1 0

Adi

Page 7: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm

Fitness Function

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 7

1 0 0 1 1 0 0

Joe

0 1 0 1 0 1 1

Zoe

1 1 0 1 1 1 0

Ron

1 0 1 0 1 0 1

Tom

0 0 1 0 0 1 0

Adi

5’6

4’2

5’8

4’9

5’1

Neck Length

7

Page 8: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm

Selection

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 8

1 0 0 1 1 0 0

Joe

0 1 0 1 0 1 1

Zoe

1 1 0 1 1 1 0

Ron

1 0 1 0 1 0 1

Tom

0 0 1 0 0 1 0

5’6

4’2

5’8

4’9

Elitism

Adi 5’1

Neck Length

Page 9: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm

Selection

Fitness proportionate selection

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 9

Ron

Joe

AdiTom

Zoe

Page 10: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm

Crossover

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 10

1 1 0 1 1 1 0

Ron

0 1 0 1 0 1 1

Zoe

1 1 0 1 0 1 1

Ron Junior

0 1 0 1 1 1 0

Zoe Junior

5’8

4’2

6’0

5’3

Crossover Probability

Page 11: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm

Mutation

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 11

1 1 0 1 0 1 1

Ron Junior

0 1 0 1 1 1 0

Zoe Junior

No Mutation

6’0

5’3

Mutation Probability: 0. 1

Fitness

Chromosome

Page 12: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm

Crossover

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 12

1 0 0 1 1 0 0

Joe

0 0 1 0 0 1 0

Adi

No Crossover

5’6

5’1

Crossover Probability

Page 13: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm

Mutation

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 13

1 0 0 1 1 0 0

Joe

0 0 1 0 0 1 0

Adi

0 0 1 0 1 1 0

Adi Junior 5’5

5’6

5’1

Page 14: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm

New Generation

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 14

1 1 0 1 1 1 0

Ron

1 1 0 1 0 1 1

Ron Junior

1 0 0 1 1 0 0

Joe

0 0 1 0 1 1 0

5’8

6’0

5’6

Neck Length

Previous

Adi Junior 5’5

0 1 0 1 1 1 0

Zoe Junior 5’3

5’1 5’5New

Page 15: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm

Results

15Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

1 0 1 1 0 1 0

Adi Junior VIII 7’0

Page 16: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Overview – Genetic Algorithm

16Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Page 17: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

eBay Structured Data

What inventory is on our shelves?

Page 18: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Structured Data

18Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

TaxonomyProducts

Item Finders

Attributes

Page 19: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Structured Data

19Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Page 20: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Structured Data

20Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Page 21: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Structured Data

21Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Data Vendors

eBay Sellers

eBay Items

Products

Everywhere

Products

ISBN

UPI

Page 22: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Choose Aggregation Set

Brand

Model

Color

Creating Products from Items

22

Product Features

Network: 4G

Camera: 8.0MP

Screen Size: 4 in.

Used iOS

16GB New

Unlocked

$525.0017 Bids

$649.99Buy It Now

$579.99or Best Offer

Storage

Carrier

Apple iPhone 5 – BlackSmartphones

Product Type eBay View Items Product

Apple iPhone 5 Black

Apple iPhone 5 Black

Black Apple iPhone 5

Other Features

Bluetooth: Yes

GPS: Yes

Dimensions:

Height: 4.87 in.

Depth: 0.30 in.

Width: 2.31 in.

Choose Aggregation Set Extract Relevant Attributes Aggregate Similar Items

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Page 23: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Creating Products from Items

23Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Structured Aggregation

Top-Down

Unstructured Clustering

Bottom-Up

Items

Products

Aggregation Set

Aggregation Set

Page 24: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Overview – eBay Use Case

24Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Use Case Example

Page 25: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Cluster Analysis

Discovering groups and structures that are in some way similar

Page 26: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

K-Means Cluster Analysis

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 26

𝒙𝒋

𝒎𝒊𝒏

𝒊=𝟏

𝑲

𝑺𝑺𝑬𝒊

𝑺𝑺𝑬𝒊 =

𝒙𝒋∈𝑪𝒊

𝒙𝒋 − 𝝁𝒊𝟐

𝝁𝒊

𝑪𝒊

Model

Total Within

Cluster Variance

Observation

Center

Cluster

Objective

Page 27: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Standard K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 27

Choose

K Random

Points

Initial Center

Page 28: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Standard K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 28

Assign

Points to

Clusters

Cluster

Center

Page 29: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Standard K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 29

Recalculate

the Clusters

Means

Page 30: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Standard K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 30

1

33.7

Solution

In Iteration

Total Within

Cluster Variance

Solution Score

Recalculate

the Clusters

Means

Page 31: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Standard K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 31

2

26.8

Solution

In Iteration

Total Within

Cluster Variance

Solution Score

Page 32: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Standard K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 32

3

23.6

Solution

In Iteration

Total Within

Cluster Variance

Solution Score

Page 33: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Standard K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 33

4

21.6

Solution

In Iteration

Total Within

Cluster Variance

Solution Score

Page 34: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Standard K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 34

5

19.5

Solution

In Iteration

Total Within

Cluster Variance

Solution Score

Page 35: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Standard K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 35

6

18.8

Solution

In Iteration

Total Within

Cluster Variance

Solution Score

Page 36: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Standard K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 36

7

18.7

Solution

In Iteration

Total Within

Cluster Variance

Solution Score

Local Optimum

Page 37: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Standard K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 37

Initial

Cluster

Centers

Initial Center

Local Optimum

Page 38: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Overview – Standard K-Means

38Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Use Case

Standard K-Means

Local Optimum

Page 39: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic K-Means Algorithm

Applying genetic algorithm to the standard K-Means heuristic

Page 40: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 40

0 0 1 0 0 1 0

Chromosome

Adi

7 Genes

Solution Representation

Page 41: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic K-Means Algorithm

Solution Representation

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 41

Page 42: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic K-Means Algorithm

Solution Representation

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 42

Page 43: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 43

Solution Fitness

Neck Length

Page 44: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 44

𝒊=𝟏

𝑲

𝒙𝒋∈𝑪𝒊

𝒙𝒋 − 𝝁𝒊𝟐

Total Within

Cluster Variance

Solution Fitness

Page 45: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 45

𝑲𝑴𝒆𝒂𝒏𝒔 𝑰𝒕𝒆𝒓𝒂𝒕𝒊𝒐𝒏

Solution Fitness

Page 46: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 46

𝑨𝒓𝒊𝒕𝒉𝒎𝒆𝒕𝒊𝒄 𝑪𝒓𝒐𝒔𝒔𝒐𝒗𝒆𝒓

Solution 1

Solution 2

Crossover

Ron Zoe

Page 47: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 47

Offspring 1

Offspring 2

Crossover

Ron Junior Zoe Junior

𝑨𝒓𝒊𝒕𝒉𝒎𝒆𝒕𝒊𝒄 𝑪𝒓𝒐𝒔𝒔𝒐𝒗𝒆𝒓

Page 48: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 48

Mutation

Adi

Page 49: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic K-Means Algorithm

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 49

Mutation

Adi Junior

Page 50: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Overview – Genetic K-Means Algorithm

50Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Use Case

Apply GA to K-MeansStandard K-Means

Local Optimum

Page 51: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Demo

Genetic Algorithm VS Standard K-Means

Page 52: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Demo

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 52

1

32.5

Best Solution

In Generation

Cluster

Center

Total Within

Cluster Variance

Solution Fitness

Page 53: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Demo

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 53

2

21.9

Best Solution

In Generation

Total Within

Cluster Variance

Solution Fitness

Page 54: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Demo

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 54

3

14.7

Best Solution

In Generation

Total Within

Cluster Variance

Solution Fitness

Page 55: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Demo

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 55

4

12.3

Best Solution

In Generation

Total Within

Cluster Variance

Solution Fitness

Page 56: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Demo

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 56

5

9.7

Best Solution

In Generation

Total Within

Cluster Variance

Solution Fitness

Page 57: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Demo

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 57

6

9.1

Best Solution

In Generation

Total Within

Cluster Variance

Solution Fitness

Page 58: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Demo

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 58

7

8.9

Best Solution

In Generation

Total Within

Cluster Variance

Solution Fitness

Page 59: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm VS Standard K-Means

Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 59

0

5

10

15

20

25

30

35

40

0 2 4 6 8 10 12

Total

Within

Cluster

Variance

Generations

Total Within Cluster Variance Per Generation

Genetic Algorithm K-Means

Page 60: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Genetic Algorithm VS Standard K-Means

60Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

0

5

10

15

20

25

30

0 5 10 15 20

Total Within-Cluster Variance on Different Runs

K-Means Multiple K-Means Genetic Algorithm

Local Optimum

High Volatility

Global Optimum

51% 32%Average

ImprovementAcross 20 Different Runs

VS Standard K-Means VS Multiple K-Means

Total Within

Cluster Variance

Page 61: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Overview – GA VS Standard K-Means

61Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Use Case

Apply GA to K-Means

Global Optimum

Standard K-Means

Local Optimum

Page 62: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

eBay Use Case

Extract Structured Data from groups of similar items

Page 63: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

eBay Use Case

63Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Lumia 920 Red 32GB Lumia 520 Yellow 8GB Lumia 620 Green 8GB

Lumia 800 Blue 16GB

Page 64: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

eBay Use Case

64Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Nokia Lumia 800 Blue 8GB phone

0.05 0.12 0 0.31 0 0.20 0.12 0 0.14

Clean Up

TF-IDF Weights

NOKIA | LUMIA | 800 | BLUE | 8GB | PHONE

NOKIA LUMIA 800 BLUE 8GB PHONE

Number of Unique Terms in All Titles

Original Title

97

9

25

520 620 800 920

50 Random Items

Text Dictionary: All Titles

Importance of

A term to a title

{Stop Words}

brand new

Page 65: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

eBay Use Case

65Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Aggregation Set:

Model Color Storage

0.08 0.11 0 0.13 0 0.06 0.05 0 0.03

8GB 620

Average

Weight

GREEN 5MP CAMERA PHONE

Cluster Center

1 Item

NOKIA | LUMIA | 800 | BLUE | 8GB | PHONE46% 23%Average

ImprovementAcross 20 Different Runs

VS Standard K-Means VS Multiple K-Means

Accurate Item

Classifications

Page 66: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Overview – Example

66Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Use Case Example

Apply GA to K-Means

Global Optimum

Standard K-Means

Local Optimum

Page 67: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Questions & Answers

Open Discussion

?

Page 68: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Conclusion

Summing it all up

Page 69: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Conclusion

69Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Use Case Example

Apply GA to K-Means

Global Optimum

Standard K-Means

Local Optimum

+50% Accuracy

Page 70: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Thank You!

Or Levi

Data Analyst

Catalog & Classification

eBay Structured Data

[email protected]

Linked

Page 71: Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization

Appendix – Genetic Algorithm Parameters

71Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization

Crossover

Probability65% 90%

5%

10%

Mutation

Probability

Population Size: 10 Number of Generations: 10

Crossover Probability: 75% Mutation Probability: 9%

Normalized

Score

100

0

Total Within

Cluster Variance

Average of 5 Runs