[@indeedeng] machine learning at indeed: scaling decision trees

309
go.indeed.com/IndeedEngTalks

Upload: indeedeng

Post on 21-Nov-2014

851 views

Category:

Technology


2 download

DESCRIPTION

Video available at: http://www.youtube.com/watch?v=MFilAoiV5nE Decision trees are a widely used machine learning technique for supervised classification. Indeed's data sets consist of tens of billions of documents with millions of distinct features. Since decision trees back some of our most important features, we built a custom distributed system to efficiently train them. Every day, we now build dozens of decision trees across this data. This same system now powers our internal analytical tools that enable quick data-driven decision-making at Indeed. This presentation provides a brief introduction to decision trees followed by a detailed overview of our approach to building them. The talk will be presented by our CTO, Andrew Hudson.

TRANSCRIPT

Page 2: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Machine Learning at Indeed

Scaling Decision Trees

Page 3: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Andrew HudsonCTO

Page 4: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

I help people get jobs.

Page 5: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Indeed is aSearch Engine for Jobs

Page 6: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 7: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Which jobs to show?

Page 8: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 9: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

18,749 jobs

Page 10: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Which jobs to show?

Maximize job seeker’s chance to get the job

Page 11: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Which jobs to show?

Maximize job seeker’s chance to get the job

● Will job seeker click on the job?● Is the job still available?● Will job seeker apply to the job?● Is job seeker qualified for the job?

Page 12: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Which jobs to show?

Maximize job seeker’s chance to get the job

● Will job seeker click on the job?● Is the job still available?● Will job seeker apply to the job?● Is job seeker qualified for the job?

Page 13: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

How?

Log job seeker behavior

Analyze logs, what best explains why they clicked on some jobs and not on others?

May help predict future behavior

Page 14: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

How?

Log job seeker behavior

Analyze logs, what best explains why they clicked on some jobs and not on others?

May help predict future behavior

Supervised learning

Page 15: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Supervised Learning Approaches

Neural networks Bayesian methods Decision trees

Genetic programming

Logistic model tree Nearest neighbor

Support Vector Machines

Random forests Boosting

Bagging Regression Ensemble methods

Page 16: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Supervised Learning Approaches

Neural networks Bayesian methods Decision trees

Genetic programming

Logistic model tree Nearest neighbor

Support Vector Machines

Random forests Boosting

Bagging Regression Ensemble methods

Page 17: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Supervised Learning Approaches

Decision trees

Genetic programming

Logistic model tree

Random forests Boosting

Bagging Ensemble methods

Page 18: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Decision Trees

Page 19: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

What is a Decision Tree?

A tree like structure that presents a relevant sequence of questions which determine a path and ultimately some outcome or prediction

Page 20: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

I’m Thinking About Buying a Laptop

Page 21: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

I’m Thinking About Buying a Laptop

Is quality important?

Page 22: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

I’m Thinking About Buying a Laptop

ASUSIs quality important?NO

Page 23: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

I’m Thinking About Buying a LaptopASUS -or whatever woot hasIs quality important?

NO

Page 24: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

I’m Thinking About Buying a Laptop

YES

ASUS -or whatever woot has

NO

Want to run linux?

Is quality important?

Page 25: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

I’m Thinking About Buying a Laptop

MACBOOKWant to run linux?

YES

ASUS -or whatever woot hasIs quality important?

NO

NO

Page 26: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

YES

I’m Thinking About Buying a Laptop

LENOVO

MACBOOKWant to run linux?

YES

ASUS -or whatever woot hasIs quality important?

NO

NO

Page 27: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

I’m Thinking About Buying a Laptop

DELLIDGAF

SYSTEM76HELLYESYES

LENOVO

MACBOOKWant to run linux?

YES

ASUS -or whatever woot hasIs quality important?

NO

NO

Page 28: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Benefits of Decision Trees

Algorithm relatively simple to understand and implement

Model produced also human understandable

Page 29: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Decision Tree Learning

Programmatic creation of decision trees

Page 30: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Decision Tree Learning

Given a set of documents, split it into two or more subsets that optimize some criteria

Repeat this process until a set can no longer be split

Page 31: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Titanic Example

1309 passengers500 survivors38.2% survival rate

What best explains who survived?

Page 32: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

classclass of ticket; first, second or third

fsizefamily size; number of family members onboard

gendermale or female

What best explains who survived?

Page 33: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 34: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1309 passengers500 survivors

38.2% survival

Page 35: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1309 passengers500 survivors

38.2% survival

class = 1

Page 36: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1309 passengers500 survivors

38.2% survival

class = 1323 passengers

200 survivors61.9% survival

Page 37: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class ≠ 1986 passengers

300 survivors30.4% survival

class = 1323 passengers

200 survivors61.9% survival

1309 passengers500 survivors

38.2% survival

Page 38: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class ≠ 1986 passengers

300 survivors30.4% survival

class = 1323 passengers

200 survivors61.9% survival

Score = ?1309 passengers

500 survivors38.2% survival

Page 39: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Score

conditional entropy

Page 40: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Conditional Entropy as Score

lower conditional entropy↓

less uncertainty about prediction based on term

Page 41: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class ≠ 1986 passengers

300 survivors30.4% survival

class = 1323 passengers

200 survivors61.9% survival

Score = 0.62671309 passengers

500 survivors38.2% survival

Page 42: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class ≠ 1986 passengers

300 survivors30.4% survival

class = 1323 passengers

200 survivors61.9% survival

Score = 0.6267 Best Score:0.6267, class = 1

1309 passengers500 survivors

38.2% survival

Page 43: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class = 1

Best Score:0.6267, class = 1

1309 passengers500 survivors

38.2% survival

Page 44: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.6267, class = 1

1309 passengers500 survivors

38.2% survival

class ≤ 2

Page 45: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.6267, class = 1

1309 passengers500 survivors

38.2% survival

class ≤ 2600 passengers

319 survivors53.2% survival

Page 46: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.6267, class = 1

1309 passengers500 survivors

38.2% survival

class ≤ 2600 passengers

319 survivors53.2% survival

class > 2709 passengers

181 survivors25.5% survival

Page 47: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.6267, class = 1

1309 passengers500 survivors

38.2% survival

class ≤ 2600 passengers

319 survivors53.2% survival

class > 2709 passengers

181 survivors25.5% survival

Score = 0.6244

Page 48: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.6244, class ≤ 2

1309 passengers500 survivors

38.2% survival

class ≤ 2600 passengers

319 survivors53.2% survival

class > 2709 passengers

181 survivors25.5% survival

Score = 0.6244

Page 49: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.6244, class ≤ 2

1309 passengers500 survivors

38.2% survival

class ≠ 3600 passengers

319 survivors53.2% survival

class = 3709 passengers

181 survivors25.5% survival

Score = 0.6244

Page 50: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.6244, class ≤ 2

1309 passengers500 survivors

38.2% survival

gender = female

Page 51: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.6244, class ≤ 2

1309 passengers500 survivors

38.2% survival

gender = female466 passengers

339 survivors72.7% survival

Page 52: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.6244, class ≤ 2

1309 passengers500 survivors

38.2% survival

gender ≠ female843 passengers

161 survivors19.1% survival

gender = female466 passengers

339 survivors72.7% survival

Page 53: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.6244, class ≤ 2

1309 passengers500 survivors

38.2% survival

gender ≠ female843 passengers

161 survivors19.1% survival

gender = female466 passengers

339 survivors72.7% survival

Score = 0.5525

Page 54: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.5525, gender=f

1309 passengers500 survivors

38.2% survival

gender ≠ female843 passengers

161 survivors19.1% survival

gender = female466 passengers

339 survivors72.7% survival

Score = 0.5525

Page 55: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.5525, gender=f

1309 passengers500 survivors

38.2% survival

fsize = 0790 passengers

239 survivors30.3% survival

Score = 0.6448

fsize ≠ 0519 passengers

261 survivors50.3% survival

Page 56: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Best Score:0.5525, gender=f

Page 57: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 58: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 59: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 60: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 61: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 62: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 63: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 64: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 65: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 66: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

19.1% survival

72.7% survival

Page 67: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 68: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 69: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 70: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 71: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 72: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 73: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 74: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

gender=male843 passengers

161 survivors19.1% survival

Page 75: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

gender=male843 passengers

161 survivors19.1% survival

class = 1179 passengers

61 survivors34.1% survival

Page 76: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

gender=male843 passengers

161 survivors19.1% survival

class = 1179 passengers

61 survivors34.1% survival

class ≠ 1664 passengers

100 survivors15.1% survival

Score = 0.4700

Page 77: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class = 1 class ≠ 1

Page 78: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class = 1 class ≠ 1

Page 79: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 80: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 81: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 82: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 83: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 84: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

15.1% survival

34.1% survival

Page 85: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

38.2%

Page 86: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

72.7%19.1%MALE

38.2%

FEMALE

Page 87: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

72.7%19.1%MALE

34.1%15.1%

38.2%

FEMALE

CLASS≠1 CLASS=1

Page 88: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

72.7%19.1%MALE

34.1%15.1%

13.1% 33.9%

38.2%

FEMALE

CLASS≠1 CLASS=1

FSIZE≠2 FSIZE=2

Page 89: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

72.7%19.1%MALE

34.1%15.1%

13.1% 33.9%

93.2%49.1%

38.2%

FEMALE

CLASS≠1 CLASS=1 CLASS>2 CLASS<=2

FSIZE≠2 FSIZE=2

Page 90: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

72.7%19.1%MALE

34.1%15.1%

13.1% 33.9%

93.2%49.1%

24.4% 54.9%

38.2%

FEMALE

CLASS≠1 CLASS=1 CLASS>2 CLASS<=2

FSIZE≠2 FSIZE=2 FSIZE>2 FSIZE<=2

Page 91: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Predicting Click Probabilities

Passenger → Job ImpressionSurvived → Clicked on Job

For each candidate job, follow path through tree then take click through rate of terminal node

Page 92: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

sales account manager

representative manager associate

outside service inside

YES

2.9%

YES

4.4%

YES

2.1%

NO

NO

YES

2.9%

NO

YES

1.8%

NO

1.9%

NO

2.6%

NO NO NO

4.6%

NO

3.8%

YES

YES

5.1%

YES

YES

Simplified Decision Tree for query="sales"

Page 93: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

sales

YES

representative

YES

outsideNO

serviceNO

insideNO

4.6%

job title = “sales representative”

YES YES

2.9%

YES

4.4%5.1%

NO

YES

3.8%

YES

2.1%

accountNO

manager

NO

YES

2.9%

managerNO

YES

1.8%

associate

NO

1.9%

NO

2.6%

Page 94: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

job title = “account executive”

YES YES

2.9%

YES

4.4%5.1%

NO

YES

3.8%

YES

2.1%

accountNO

manager

NO

YES

2.9%

managerNO

YES

1.8%

associate

NO

1.9%

NO

2.6%

sales

3.8%

account

NO NO NO

YES

representative

YES

outside service inside 4.6%

Page 95: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

YES

2.9%

YES

4.4%

YES

2.1%

NO

manager

NO

YES

2.9%

managerNO

YES

1.8%

associate

NO

1.9%

NO

2.6%

NO NO NO

service inside 4.6%

salesNO

representative

outside

5.1%

job title = “outside sales representative”

account

3.8%YES

YES

YES

YES

Page 96: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

YES

manager

1.8%

associate

job title = “sales associate”

YES

2.1%

NO

managerNO

1.9%NO

account

3.8%

YES

YES YES

2.9%

YES

4.4%5.1%

NO NO NO

outside service inside 4.6%

sales

representative

YES

NO

2.6%

YES

2.9%

NO NO

YES

Page 97: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

sales

representative

outside service

4.4%

inside

job title = “inside sales representative”

YES

2.1%

NO

manager

NO

YES

2.9%

managerNO

YES

1.8%

associate

NO

1.9%

NO

2.6%

NO

account

3.8%

YES

YES YES

2.9%5.1%

4.6%

YES

NO NO NO

YES

YES

Page 98: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

NO

2.9%

manager

job title = “sales manager”

YES

2.1%

NO

managerNO

1.9%NO

account

3.8%

YES

YES YES

2.9%

YES

4.4%5.1%

NO NO NO

outside service inside 4.6%

YES

sales

representative

YES

YES

1.8%

associateNO

2.6%NO

YES

Page 99: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

job title = “sales consultant”

YES YES

2.9%

YES

4.4%5.1%

NO NO NO

outside service inside 4.6%

YES

YES

2.9%

YES

1.8%

YES

2.1%

NO

managerNO

1.9%NO

account

3.8%

YES

manager associate

sales

representative

YES

NO

2.6%NO NO

Page 100: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

sales account manager

job title = “store manager”

YES YES

2.9%

YES

4.4%5.1%

NO

YES

2.9%

managerNO

YES

1.8%

associateNO

2.6%

NO NO NO

YES

representative

YES

outside service inside 4.6%

NO

1.9%

YES

2.1%3.8%

YES

NONO

Page 101: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

2.9%

job title = “service sales representative”

YES

2.1%

NO

manager

NO

YES

2.9%

managerNO

YES

1.8%

associate

NO

1.9%

NO

2.6%

NO

account

3.8%

YES

YES

4.4%

NO

inside 4.6%NO

YES

5.1%

sales

representative

outside service

YES

NO

YES

YES

Page 102: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

job title = “customer service representative”

YES YES

2.9%

YES

4.4%5.1%

NO

YES

2.9%

managerNO

YES

1.8%

associateNO

2.6%

NO NO NO

representative

YES

outside service inside 4.6%

YES

sales accountNONO

managerNO

1.9%

YES

2.1%3.8%

YES

Page 103: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Final CTR Predictions

5.1% outside sales representative4.6% sales representative4.4% inside sales representative3.8% account executive2.9% sales manager2.9% service sales representative2.6% sales consultant2.1% store manager1.9% customer service representative1.8% sales associate

Page 104: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Single Machine Implementation

Page 105: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Overview

Page 106: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Tree Building Strategies

One node at a time- depth first- breadth first

Page 107: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

Depth First

Page 108: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

2 3

Depth First

Page 109: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

3

Depth First

2

4 5

Page 110: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

3

Depth First

2

5

6 7

4

Page 111: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

3

Depth First

2

54

6 7

Page 112: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

3

Depth First

2

54

6 7

Page 113: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

3

Depth First

2

4

6 7

5

Page 114: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

Depth First

2

4

6 7

5

3

8 9

Page 115: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

Breadth First

Page 116: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

2 3

Breadth First

Page 117: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

3

Breadth First

2

4 5

Page 118: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

Breadth First

2

4 5

3

6 7

Page 119: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

Breadth First

2

5

3

6 7

8 9

4

Page 120: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

Breadth First

2 3

6 7

8 9

4 5

Page 121: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

Breadth First

2 3

8 9

4 5

10 11

6 7

Page 122: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

Breadth First

2 3

8 9

4 5

10 11

6

12 13

7

Page 123: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Tree Building Strategies

One node at a time- depth first- breadth first

One layer at a time, all nodes simultaneous

Page 124: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

Page 125: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1iter #1

Page 126: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

2 3

iter #1

Page 127: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

2 3

iter #1

iter #2

Page 128: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

2

4

3

5 6 7

iter #1

iter #2

Page 129: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

iter #3

1

2 3

5 6 7

iter #1

iter #2

4

Page 130: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

8 9 0 10 11 12 13

iter #3

1

2

4

3

5 6 7

iter #1

iter #2

Page 131: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

8 9 0 10 11 12 13

iter #3

1

2

4

3

5 6 7

iter #1

iter #2

Page 132: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

iter #48 9 0 10 11 12 13

iter #3

1

2

4

3

5 6 7

iter #1

iter #2

Page 133: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

8 9 0 10 11 12 13

iter #3

1

2

4

3

5 6 7

iter #1

iter #2

Page 134: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Data Formatid class fsize gender survived

0 1 0 f 1

1 1 3 m 1

2 1 3 f 0

3 1 3 m 0

4 1 3 f 0

5 1 0 m 1

6 1 1 f 1

7 1 0 m 0

8 1 2 f 1

9 1 0 m 0

id class fsize gender survived

10 1 1 m 0

11 1 1 f 1

12 1 0 f 1

13 1 0 f 1

14 1 0 m 1

15 1 0 m 0

16 1 1 m 0

17 1 1 f 1

18 1 0 f 1

19 1 0 m 0

….

Page 135: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Data Format

Create an inverted index

Key to efficiently building one layer at a time

Page 136: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

Maps terms to the list of documents that contain that term

Terms and docs stored in sorted order

Page 137: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….

Page 138: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….

Field

Page 139: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….

Term

Page 140: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….

Docs

Page 141: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….

Docs

Page 142: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….

Docs

Page 143: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

fsize=0 → 0,5,7,9,12,13,14,15,18,19,22….fsize=1 → 6,10,11,16,17,26,27,36,49,50….fsize=2 → 8,20,21,42,76,77,78,79,81,82….fsize=3 → 1,2,3,4,54,55,56,57,90,339….fsize=4 → 249,250,251,252,253,449,806….….

Page 144: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

gender=f → 0,2,4,6,8,11,12,13,17,18,21….gender=m → 1,3,5,7,9,10,14,15,16,19,20….

Page 145: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

survived=0 → 2,3,4,7,9,10,15,16,19,25….survived=1 → 0,1,5,6,8,11,12,13,14,17….

Page 146: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index Implementations

Lucene

Flamdex

Page 147: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Primary Lookup Tables

groups[doc]Where in the tree each doc isInitialized to all ones, all docs start in root

values[doc]Value to be classified, for each docIn this case it’s 1 if survived, 0 otherwise

Page 148: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Primary Lookup Tables

values[doc]

Constructed from an inverted index of the values

Invert the field of interest (e.g. survived)

Page 149: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop Overview

foreach fieldforeach term

get group statsevaluate splits

apply best splitsrepeat n times or until no more splits found

Page 150: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop - First Iteration

foreach field (class, fsize, gender)

Page 151: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop - First Iteration

foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)

Page 152: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop - First Iteration

foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)

get group stats

Page 153: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

count[grp]Count of how many documents within that group contain current term, initialized to zeros

vsum[grp]Summation of the value to be classified from the documents within that group that contain current term, initialized to zeros

Page 154: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term

Page 155: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/termforeach doc

Page 156: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/termforeach doc

grp = grps[doc]

Page 157: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/termforeach doc

grp = grps[doc]if grp == 0 skip

Page 158: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/termforeach doc

grp = grps[doc]if grp == 0 skipcount[grp]++vsum[grp] += vals[doc]

Page 159: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)

Page 160: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

Page 161: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)

Page 162: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skip

Page 163: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

Page 164: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 0, vsum[1] = 0

Page 165: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 1, vsum[1] = 1

Page 166: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 2, vsum[1] = 2

Page 167: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 3, vsum[1] = 2

Page 168: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 4, vsum[1] = 2

Page 169: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 5, vsum[1] = 2

Page 170: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 6, vsum[1] = 3

Page 171: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[1] = 323, vsum[1] = 200

Page 172: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1309 passengers500 survivors

38.2% survival

class = 1

Page 173: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1309 passengers500 survivors

38.2% survival

class = 1

Group 1

Page 174: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1309 passengers500 survivors

38.2% survival

class = 1

Group 1

Page 175: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1309 passengers500 survivors

38.2% survival

class = 1323 passengers count[1]

Group 1

Page 176: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1309 passengers500 survivors

38.2% survival

class = 1323 passengers

200 survivors

count[1]

vsum[1]

Group 1

Page 177: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=2)foreach doc (323,324,325,326,327,328,329...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (0,1,0,0,0,0,1,0,1…)

…count[1] = 277, vsum[1] = 119

Page 178: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=3)foreach doc (600,601,602,603,604,605,606...)

grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (0,0,0,1,1,1,1,1,0…)

…count[1] = 709, vsum[1] = 181

Page 179: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop - First Iteration

foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)

get group statsevaluate splits

Page 180: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Evaluate Splits

Consider current field/term as a potential split for each group

1) check if split is admissiblebalance check, significance check

2) score the splitconditional entropy or some other heuristic

3) keep best scoring split

Page 181: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Evaluate Splits

totalcount[group] / totalvalue[group]Total number of documents and total values for each group, i.e. # passengers / # survivors

bestsplit[group] / bestscore[group]Current best split and score for each group, initially nulls

Page 182: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

foreach field/term (class=1)get group stats (count[1]=323,vsum[1]=200)foreach group

if not admissible( … ) skipscore = calcscore(cnt[grp], vsum[grp],

totcnt[grp], totval[grp])if score < bestscore[grp]

bestscore[grp] = scorebestsplit[grp] = field/term

Page 183: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

foreach field/term (class=1)get group stats (count[1]=323,vsum[1]=200)foreach group

if not admissible( … ) skipscore = calcscore(cnt[grp], vsum[grp],

totcnt[grp], totval[grp])if score < bestscore[grp]

bestscore[grp] = scorebestsplit[grp] = field/term

Page 184: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop - First Iteration

foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)

get group statsevaluate splits

apply best splits (bestsplit[1]=“gender=f”)

Page 185: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group

Page 186: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group

target group: 1 1

Page 187: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group

target group: 1condition: gender=female

1

Page 188: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group

target group: 1condition: gender=femalepositive group: 3

3

1

Page 189: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group

target group: 1condition: gender=femalepositive group: 3negative group: 2 2 3

1

Page 190: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group

target group: 1condition: gender=femalepositive group: 3negative group: 2 2 3

1

Page 191: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

Using inverted index, iterate over docs that match split condition

If current document is in targeted group, move it to the positive group

At the end, move anything left in target group to negative group

Page 192: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 1 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Page 193: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 1 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Page 194: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 1 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Page 195: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 1 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Page 196: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Page 197: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Page 198: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 3 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Page 199: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 3 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1

Page 200: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 3 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 3group[4] = 3 group[11] = 3 group[18] = 3group[5] = 1 group[12] = 3 group[19] = 1group[6] = 3 group[13] = 3 group[20] = 1

Page 201: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 3 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 3group[4] = 3 group[11] = 3 group[18] = 3group[5] = 1 group[12] = 3 group[19] = 1group[6] = 3 group[13] = 3 group[20] = 1

Page 202: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 3 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 3group[4] = 3 group[11] = 3 group[18] = 3group[5] = 1 group[12] = 3 group[19] = 1group[6] = 3 group[13] = 3 group[20] = 1

Page 203: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Apply Best Splits

gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….

group[0] = 3 group[7] = 2 group[14] = 2group[1] = 2 group[8] = 3 group[15] = 2group[2] = 3 group[9] = 2 group[16] = 2group[3] = 2 group[10] = 2 group[17] = 3group[4] = 3 group[11] = 3 group[18] = 3group[5] = 2 group[12] = 3 group[19] = 2group[6] = 3 group[13] = 3 group[20] = 2

Page 204: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splitsrepeat n times or until no more splits found

Page 205: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

1

Page 206: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

iter #1

1

Page 207: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

iter #1

gender = female

1

Page 208: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

2 3

iter #1

gender = femalegender ≠ female

1

Page 209: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

iter #1

iter #22 3

iter #1

1

Page 210: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop - Second Iteration

foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)

get group stats

Page 211: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (3,2,3,2,3,2,3,2,3…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

Page 212: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)

grp = grps[doc] (3,2,3,2,3,2,3,2,3…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)

…count[2] = 179, vsum[2] = 61count[3] = 144, vsum[3] = 139

Page 213: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=2)foreach doc (323,324,325,326,327,328,329...)

grp = grps[doc] (2,3,2,2,2,2,3,2,2…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (0,1,0,0,0,0,1,0,1…)

…count[2] = 171, vsum[2] = 25count[3] = 106, vsum[3] = 94

Page 214: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (class=3)foreach doc (600,601,602,603,604,605,606...)

grp = grps[doc] (2,2,2,3,3,2,2,3,2…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (0,0,0,1,1,1,1,1,0…)

…count[2] = 493, vsum[2] = 75count[3] = 216, vsum[3] = 106

Page 215: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (gender=female)foreach doc (0,2,4,6,8,11,12,13,17,18,21,23….)

grp = grps[doc] (3,3,3,3,3,3,3,3,3,3,3,3…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,0,0,1,1,1,1,1,1…)

…count[2] = 0, vsum[2] = 0count[3] = 467, vsum[3] = 339

Page 216: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Get Group Stats

for current field/term (gender=male)foreach doc (1,3,5,7,9,10,14,15,16,19,20,22...)

grp = grps[doc] (2,2,2,2,2,2,2,2,2,2,2…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,0,1,0,0,0,1,0,0…)

…count[2] = 844, vsum[2] = 161count[3] = 0, vsum[3] = 0

Page 217: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

What AboutInequality Splits?

e.g. class ≤ 2

Page 218: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop + Inequality Splits

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Page 219: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop + Inequality Splits

foreach fieldreset inequality statsforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Page 220: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop + Inequality Splits

foreach fieldreset inequality statsforeach term

get group statsupdate inequality statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Page 221: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop + Inequality Splits

foreach fieldreset inequality statsforeach term

get group statsupdate inequality statsevaluate splitsevaluate inequality splits

apply best splits for each grouprepeat n times or until no more splits found

Page 222: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Scalability

Performs quite well on a single machine

Worked well for a while, but started to hit limits

Ultimately needed to distribute to multiple machines

Page 223: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Multiple Machine Implementation

Page 224: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Hadoop?

Page 225: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Hadoop

Experimented with using Hadoop

Each level took five sequential map reduce jobs

Much slower than single machine; repeatedly writes intermediate data and lots of shuffling

Page 226: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Hadoop

Experimented with using Hadoop

Each level took five sequential map reduce jobs

Much slower than single machine; repeatedly writes intermediate data and lots of shuffling

Hadoop not great for iterative algorithms

Page 227: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Partition Data

Page 228: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Page 229: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

Page 230: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

Page 231: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

Page 232: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Inverted Index

Shard 1 Shard 2

Page 233: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Shard 1 Shard 2

Machine 1 Machine 2

Page 234: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Page 235: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Page 236: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

FTGS

Page 237: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

FTGS

Page 238: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop

foreach field

foreach termget group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

FTGS

Page 239: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop

foreach field

foreach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

FTGS

Page 240: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream - Single Machine

class=1|1|323|200class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

Page 241: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 242: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 243: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 244: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

Sorted

FTGS Stream - Single Machine

Page 245: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 246: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119class=3|1|709|181

fsize=0|1|790|239fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 247: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181fsize=0|1|790|239

fsize=1|1|235|126fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 248: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239fsize=1|1|235|126

fsize=2|1|159|90fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 249: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126fsize=2|1|159|90

fsize=3|1|43|30fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 250: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90fsize=3|1|43|30

fsize=4|1|22|6fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 251: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30fsize=4|1|22|6

fsize=5|1|25|5fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 252: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6fsize=5|1|25|5

fsize=6|1|16|4fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 253: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5fsize=6|1|16|4

fsize=7|1|8|0fsize=10|1|11|0gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 254: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4fsize=7|1|8|0fsize=10|1|11|0gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 255: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0fsize=10|1|11|0gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 256: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 257: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339gender=m|1|843|161

FTGS Stream - Single Machine

Page 258: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

fsize=2|1|159|90

fsize=3|1|43|30

fsize=4|1|22|6

fsize=5|1|25|5

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|466|339

gender=m|1|843|161

FTGS Stream - Single Machine

Page 259: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream

How to distribute?

Page 260: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Shard 1 Shard 2

Machine 1 Machine 2

Page 261: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Shard 1 Shard 2

FTGS 1

Machine 2

Page 262: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Shard 1 Shard 2

FTGS 1 FTGS 2

Page 263: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Shard 1 Shard 2

FTGS 1 FTGS 2

Machine 3

Page 264: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Shard 1 Shard 2

FTGS 1 FTGS 2Merge

Machine 3

Page 265: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream Merge

class=1|1|198|111class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53

fsize=2|1|75|48

fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

Machine 1

Page 266: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream Merge

class=1|1|125|89class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

Machine 2

Page 267: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream Merge

class=1|1|198|111class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53

fsize=2|1|75|48

fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

Machine 1 Machine 2

Page 268: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream Merge

class=1|1|198|111class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53

fsize=2|1|75|48

fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

class=1|1|323|200

+

Machine 1 Machine 2

Page 269: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream Merge

class=1|1|198|111

class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48

fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52fsize=1|1|141|73

fsize=2|1|84|42fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

class=1|1|323|200

Machine 1 Machine 2

Page 270: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream Merge

class=1|1|198|111

class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48

fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52fsize=1|1|141|73

fsize=2|1|84|42fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

class=2|1|277|119class=1|1|323|200

Machine 1 Machine 2

Page 271: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream Mergeclass=1|1|198|111

class=2|1|277|119

class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52fsize=1|1|141|73

fsize=2|1|84|42fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

class=2|1|277|119class=1|1|323|200

Machine 1 Machine 2

Page 272: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream Mergeclass=1|1|198|111

class=2|1|277|119

class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17

fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52fsize=1|1|141|73

fsize=2|1|84|42fsize=3|1|22|13

fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

class=3|1|709|181class=2|1|277|119

class=1|1|323|200

+

Machine 1 Machine 2

Page 273: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream Mergeclass=1|1|198|111

class=2|1|277|119

class=3|1|511|129

fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39

class=3|1|709|181class=2|1|277|119

class=1|1|323|200

Machine 1 Machine 2

Page 274: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream Mergeclass=1|1|198|111

class=2|1|277|119

class=3|1|511|129

fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17fsize=4|1|3|1

fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39fsize=0|1|790|239

class=3|1|709|181class=2|1|277|119

class=1|1|323|200Machine 1 Machine 2

Page 275: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream Mergeclass=1|1|198|111

class=2|1|277|119

class=3|1|511|129

fsize=0|1|790|239

fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17fsize=4|1|3|1fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39fsize=0|1|790|239class=3|1|709|181

class=2|1|277|119class=1|1|323|200

Machine 1 Machine 2

Page 276: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS Stream Mergeclass=1|1|198|111

class=2|1|277|119

class=3|1|511|129

fsize=0|1|790|239

fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17fsize=4|1|3|1fsize=5|1|3|1

gender=f|1|308|237

gender=m|1|678|122

class=1|1|125|89

class=3|1|198|52

fsize=1|1|141|73fsize=2|1|84|42

fsize=3|1|22|13fsize=4|1|19|5

fsize=5|1|22|4

fsize=6|1|16|4

fsize=7|1|8|0

fsize=10|1|11|0

gender=f|1|158|102

gender=m|1|165|39fsize=1|1|235|126fsize=0|1|790|239

class=3|1|709|181class=2|1|277|119

class=1|1|323|200

+

Machine 1 Machine 2

Page 277: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6

Page 278: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6

Page 279: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6

k-way merge

Page 280: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS 1-6

FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6

Page 281: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS 1-6 FTGS 7-12 FTGS 13-18

Page 282: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS 1-6 FTGS 7-12 FTGS 13-18

FTGS 1-18

Page 283: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS 1-18 FTGS 19-36

FTGS 1-36

Page 284: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

FTGS

Page 285: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Page 286: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Page 287: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Main Loop

foreach fieldforeach term

get group statsevaluate splits

apply best splits for each grouprepeat n times or until no more splits found

Regroup

Page 288: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS 1-6 FTGS 7-12 FTGS 13-18

FTGS

Page 289: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Regroup 1-6 Regroup 7-12 Regroup 13-18

Regroup

Page 290: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

FTGS 1-6 FTGS 7-12 FTGS 13-18

FTGS

Page 291: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Regroup 1-6 Regroup 7-12 Regroup 13-18

Regroup

Page 292: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Imhotep

Page 293: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Imhotep

Distributed System that does efficient FTGS and Regroup operations on inverted indexes

Page 294: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Imhotep

32 machines

2 cpu x 6 core xeon westmere E5649128GB RAM10x1TB 7200 RPM SATA

Total:384 cores, 4TB RAM, 320TB disk

Page 295: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Decision tree on 13 billion documents

Imhotep

Page 296: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Decision tree on 13 billion documents330GB → ~25 bytes per doc

Imhotep

Page 297: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Decision tree on 13 billion documents330GB → ~25 bytes per doc

First FTGS: 314 secondsFirst Regroup: 9.6 seconds

Imhotep

Page 298: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Decision tree on 13 billion documents330GB → ~25 bytes per doc

First FTGS: 314 seconds (36.3 million terms)First Regroup: 9.6 seconds

Imhotep

Page 299: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Decision tree on 13 billion documents330GB → ~25 bytes per doc

First FTGS: 314 seconds (36.3 million terms)First Regroup: 9.6 seconds (7 groups)

Imhotep

Page 300: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Decision tree on 13 billion documents330GB → ~25 bytes per doc

First FTGS: 314 seconds (36.3 million terms)First Regroup: 9.6 seconds (7 groups)

Second FTGS: 57 secondsSecond Regroup: 23 seconds (217 groups)

Imhotep

Page 301: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Imhotep

Distributed System that does efficient FTGS and Regroup operations

Powers our internal analytical tools

Page 302: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Imhotep

Distributed System that does efficient FTGS and Regroup operations

Powers our internal analytical tools

… and more

Page 303: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Imhotep - Next @IndeedEng Talk

Sharding and shard managementSession / FTGS network protocolMemory managementInverted IndexesFTGS MergeRegroup operationsFault Tolerance

Page 304: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Conclusion

Now scales to larger and larger data sets by adding more machines

Increased freshness and frequency of builds

Decision trees have lots of tunable components, regularly get 1% wins via A/B test

Page 305: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Continuous Improvement

Sponsored Job Click-through Rate (CTR)

Page 306: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Thanks.

Page 307: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Q & A

Page 308: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

More Questions?Jason David James Jeff

Page 309: [@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Next @IndeedEng TalkImhotep: Large Scale Analytics

and Machine Learning at Indeed

Jeff Plaisance, Engineering ManagerMarch 26, 2014

http://engineering.indeed.com/talks