[@indeedeng] machine learning at indeed: scaling decision trees
DESCRIPTION
Video available at: http://www.youtube.com/watch?v=MFilAoiV5nE Decision trees are a widely used machine learning technique for supervised classification. Indeed's data sets consist of tens of billions of documents with millions of distinct features. Since decision trees back some of our most important features, we built a custom distributed system to efficiently train them. Every day, we now build dozens of decision trees across this data. This same system now powers our internal analytical tools that enable quick data-driven decision-making at Indeed. This presentation provides a brief introduction to decision trees followed by a detailed overview of our approach to building them. The talk will be presented by our CTO, Andrew Hudson.TRANSCRIPT
go.indeed.com/IndeedEngTalks
Machine Learning at Indeed
Scaling Decision Trees
Andrew HudsonCTO
I help people get jobs.
Indeed is aSearch Engine for Jobs
Which jobs to show?
18,749 jobs
Which jobs to show?
Maximize job seeker’s chance to get the job
Which jobs to show?
Maximize job seeker’s chance to get the job
● Will job seeker click on the job?● Is the job still available?● Will job seeker apply to the job?● Is job seeker qualified for the job?
Which jobs to show?
Maximize job seeker’s chance to get the job
● Will job seeker click on the job?● Is the job still available?● Will job seeker apply to the job?● Is job seeker qualified for the job?
How?
Log job seeker behavior
Analyze logs, what best explains why they clicked on some jobs and not on others?
May help predict future behavior
How?
Log job seeker behavior
Analyze logs, what best explains why they clicked on some jobs and not on others?
May help predict future behavior
Supervised learning
Supervised Learning Approaches
Neural networks Bayesian methods Decision trees
Genetic programming
Logistic model tree Nearest neighbor
Support Vector Machines
Random forests Boosting
Bagging Regression Ensemble methods
Supervised Learning Approaches
Neural networks Bayesian methods Decision trees
Genetic programming
Logistic model tree Nearest neighbor
Support Vector Machines
Random forests Boosting
Bagging Regression Ensemble methods
Supervised Learning Approaches
Decision trees
Genetic programming
Logistic model tree
Random forests Boosting
Bagging Ensemble methods
Decision Trees
What is a Decision Tree?
A tree like structure that presents a relevant sequence of questions which determine a path and ultimately some outcome or prediction
I’m Thinking About Buying a Laptop
I’m Thinking About Buying a Laptop
Is quality important?
I’m Thinking About Buying a Laptop
ASUSIs quality important?NO
I’m Thinking About Buying a LaptopASUS -or whatever woot hasIs quality important?
NO
I’m Thinking About Buying a Laptop
YES
ASUS -or whatever woot has
NO
Want to run linux?
Is quality important?
I’m Thinking About Buying a Laptop
MACBOOKWant to run linux?
YES
ASUS -or whatever woot hasIs quality important?
NO
NO
YES
I’m Thinking About Buying a Laptop
LENOVO
MACBOOKWant to run linux?
YES
ASUS -or whatever woot hasIs quality important?
NO
NO
I’m Thinking About Buying a Laptop
DELLIDGAF
SYSTEM76HELLYESYES
LENOVO
MACBOOKWant to run linux?
YES
ASUS -or whatever woot hasIs quality important?
NO
NO
Benefits of Decision Trees
Algorithm relatively simple to understand and implement
Model produced also human understandable
Decision Tree Learning
Programmatic creation of decision trees
Decision Tree Learning
Given a set of documents, split it into two or more subsets that optimize some criteria
Repeat this process until a set can no longer be split
Titanic Example
1309 passengers500 survivors38.2% survival rate
What best explains who survived?
classclass of ticket; first, second or third
fsizefamily size; number of family members onboard
gendermale or female
What best explains who survived?
1309 passengers500 survivors
38.2% survival
1309 passengers500 survivors
38.2% survival
class = 1
1309 passengers500 survivors
38.2% survival
class = 1323 passengers
200 survivors61.9% survival
class ≠ 1986 passengers
300 survivors30.4% survival
class = 1323 passengers
200 survivors61.9% survival
1309 passengers500 survivors
38.2% survival
class ≠ 1986 passengers
300 survivors30.4% survival
class = 1323 passengers
200 survivors61.9% survival
Score = ?1309 passengers
500 survivors38.2% survival
Score
conditional entropy
Conditional Entropy as Score
lower conditional entropy↓
less uncertainty about prediction based on term
class ≠ 1986 passengers
300 survivors30.4% survival
class = 1323 passengers
200 survivors61.9% survival
Score = 0.62671309 passengers
500 survivors38.2% survival
class ≠ 1986 passengers
300 survivors30.4% survival
class = 1323 passengers
200 survivors61.9% survival
Score = 0.6267 Best Score:0.6267, class = 1
1309 passengers500 survivors
38.2% survival
class = 1
Best Score:0.6267, class = 1
1309 passengers500 survivors
38.2% survival
Best Score:0.6267, class = 1
1309 passengers500 survivors
38.2% survival
class ≤ 2
Best Score:0.6267, class = 1
1309 passengers500 survivors
38.2% survival
class ≤ 2600 passengers
319 survivors53.2% survival
Best Score:0.6267, class = 1
1309 passengers500 survivors
38.2% survival
class ≤ 2600 passengers
319 survivors53.2% survival
class > 2709 passengers
181 survivors25.5% survival
Best Score:0.6267, class = 1
1309 passengers500 survivors
38.2% survival
class ≤ 2600 passengers
319 survivors53.2% survival
class > 2709 passengers
181 survivors25.5% survival
Score = 0.6244
Best Score:0.6244, class ≤ 2
1309 passengers500 survivors
38.2% survival
class ≤ 2600 passengers
319 survivors53.2% survival
class > 2709 passengers
181 survivors25.5% survival
Score = 0.6244
Best Score:0.6244, class ≤ 2
1309 passengers500 survivors
38.2% survival
class ≠ 3600 passengers
319 survivors53.2% survival
class = 3709 passengers
181 survivors25.5% survival
Score = 0.6244
Best Score:0.6244, class ≤ 2
1309 passengers500 survivors
38.2% survival
gender = female
Best Score:0.6244, class ≤ 2
1309 passengers500 survivors
38.2% survival
gender = female466 passengers
339 survivors72.7% survival
Best Score:0.6244, class ≤ 2
1309 passengers500 survivors
38.2% survival
gender ≠ female843 passengers
161 survivors19.1% survival
gender = female466 passengers
339 survivors72.7% survival
Best Score:0.6244, class ≤ 2
1309 passengers500 survivors
38.2% survival
gender ≠ female843 passengers
161 survivors19.1% survival
gender = female466 passengers
339 survivors72.7% survival
Score = 0.5525
Best Score:0.5525, gender=f
1309 passengers500 survivors
38.2% survival
gender ≠ female843 passengers
161 survivors19.1% survival
gender = female466 passengers
339 survivors72.7% survival
Score = 0.5525
Best Score:0.5525, gender=f
1309 passengers500 survivors
38.2% survival
fsize = 0790 passengers
239 survivors30.3% survival
Score = 0.6448
fsize ≠ 0519 passengers
261 survivors50.3% survival
Best Score:0.5525, gender=f
19.1% survival
72.7% survival
gender=male843 passengers
161 survivors19.1% survival
gender=male843 passengers
161 survivors19.1% survival
class = 1179 passengers
61 survivors34.1% survival
gender=male843 passengers
161 survivors19.1% survival
class = 1179 passengers
61 survivors34.1% survival
class ≠ 1664 passengers
100 survivors15.1% survival
Score = 0.4700
class = 1 class ≠ 1
class = 1 class ≠ 1
15.1% survival
34.1% survival
38.2%
72.7%19.1%MALE
38.2%
FEMALE
72.7%19.1%MALE
34.1%15.1%
38.2%
FEMALE
CLASS≠1 CLASS=1
72.7%19.1%MALE
34.1%15.1%
13.1% 33.9%
38.2%
FEMALE
CLASS≠1 CLASS=1
FSIZE≠2 FSIZE=2
72.7%19.1%MALE
34.1%15.1%
13.1% 33.9%
93.2%49.1%
38.2%
FEMALE
CLASS≠1 CLASS=1 CLASS>2 CLASS<=2
FSIZE≠2 FSIZE=2
72.7%19.1%MALE
34.1%15.1%
13.1% 33.9%
93.2%49.1%
24.4% 54.9%
38.2%
FEMALE
CLASS≠1 CLASS=1 CLASS>2 CLASS<=2
FSIZE≠2 FSIZE=2 FSIZE>2 FSIZE<=2
Predicting Click Probabilities
Passenger → Job ImpressionSurvived → Clicked on Job
For each candidate job, follow path through tree then take click through rate of terminal node
sales account manager
representative manager associate
outside service inside
YES
2.9%
YES
4.4%
YES
2.1%
NO
NO
YES
2.9%
NO
YES
1.8%
NO
1.9%
NO
2.6%
NO NO NO
4.6%
NO
3.8%
YES
YES
5.1%
YES
YES
Simplified Decision Tree for query="sales"
sales
YES
representative
YES
outsideNO
serviceNO
insideNO
4.6%
job title = “sales representative”
YES YES
2.9%
YES
4.4%5.1%
NO
YES
3.8%
YES
2.1%
accountNO
manager
NO
YES
2.9%
managerNO
YES
1.8%
associate
NO
1.9%
NO
2.6%
job title = “account executive”
YES YES
2.9%
YES
4.4%5.1%
NO
YES
3.8%
YES
2.1%
accountNO
manager
NO
YES
2.9%
managerNO
YES
1.8%
associate
NO
1.9%
NO
2.6%
sales
3.8%
account
NO NO NO
YES
representative
YES
outside service inside 4.6%
YES
2.9%
YES
4.4%
YES
2.1%
NO
manager
NO
YES
2.9%
managerNO
YES
1.8%
associate
NO
1.9%
NO
2.6%
NO NO NO
service inside 4.6%
salesNO
representative
outside
5.1%
job title = “outside sales representative”
account
3.8%YES
YES
YES
YES
YES
manager
1.8%
associate
job title = “sales associate”
YES
2.1%
NO
managerNO
1.9%NO
account
3.8%
YES
YES YES
2.9%
YES
4.4%5.1%
NO NO NO
outside service inside 4.6%
sales
representative
YES
NO
2.6%
YES
2.9%
NO NO
YES
sales
representative
outside service
4.4%
inside
job title = “inside sales representative”
YES
2.1%
NO
manager
NO
YES
2.9%
managerNO
YES
1.8%
associate
NO
1.9%
NO
2.6%
NO
account
3.8%
YES
YES YES
2.9%5.1%
4.6%
YES
NO NO NO
YES
YES
NO
2.9%
manager
job title = “sales manager”
YES
2.1%
NO
managerNO
1.9%NO
account
3.8%
YES
YES YES
2.9%
YES
4.4%5.1%
NO NO NO
outside service inside 4.6%
YES
sales
representative
YES
YES
1.8%
associateNO
2.6%NO
YES
job title = “sales consultant”
YES YES
2.9%
YES
4.4%5.1%
NO NO NO
outside service inside 4.6%
YES
YES
2.9%
YES
1.8%
YES
2.1%
NO
managerNO
1.9%NO
account
3.8%
YES
manager associate
sales
representative
YES
NO
2.6%NO NO
sales account manager
job title = “store manager”
YES YES
2.9%
YES
4.4%5.1%
NO
YES
2.9%
managerNO
YES
1.8%
associateNO
2.6%
NO NO NO
YES
representative
YES
outside service inside 4.6%
NO
1.9%
YES
2.1%3.8%
YES
NONO
2.9%
job title = “service sales representative”
YES
2.1%
NO
manager
NO
YES
2.9%
managerNO
YES
1.8%
associate
NO
1.9%
NO
2.6%
NO
account
3.8%
YES
YES
4.4%
NO
inside 4.6%NO
YES
5.1%
sales
representative
outside service
YES
NO
YES
YES
job title = “customer service representative”
YES YES
2.9%
YES
4.4%5.1%
NO
YES
2.9%
managerNO
YES
1.8%
associateNO
2.6%
NO NO NO
representative
YES
outside service inside 4.6%
YES
sales accountNONO
managerNO
1.9%
YES
2.1%3.8%
YES
Final CTR Predictions
5.1% outside sales representative4.6% sales representative4.4% inside sales representative3.8% account executive2.9% sales manager2.9% service sales representative2.6% sales consultant2.1% store manager1.9% customer service representative1.8% sales associate
Single Machine Implementation
Overview
Tree Building Strategies
One node at a time- depth first- breadth first
1
Depth First
1
2 3
Depth First
1
3
Depth First
2
4 5
1
3
Depth First
2
5
6 7
4
1
3
Depth First
2
54
6 7
1
3
Depth First
2
54
6 7
1
3
Depth First
2
4
6 7
5
1
Depth First
2
4
6 7
5
3
8 9
1
Breadth First
1
2 3
Breadth First
1
3
Breadth First
2
4 5
1
Breadth First
2
4 5
3
6 7
1
Breadth First
2
5
3
6 7
8 9
4
1
Breadth First
2 3
6 7
8 9
4 5
1
Breadth First
2 3
8 9
4 5
10 11
6 7
1
Breadth First
2 3
8 9
4 5
10 11
6
12 13
7
Tree Building Strategies
One node at a time- depth first- breadth first
One layer at a time, all nodes simultaneous
1
1iter #1
1
2 3
iter #1
1
2 3
iter #1
iter #2
1
2
4
3
5 6 7
iter #1
iter #2
iter #3
1
2 3
5 6 7
iter #1
iter #2
4
8 9 0 10 11 12 13
iter #3
1
2
4
3
5 6 7
iter #1
iter #2
8 9 0 10 11 12 13
iter #3
1
2
4
3
5 6 7
iter #1
iter #2
iter #48 9 0 10 11 12 13
iter #3
1
2
4
3
5 6 7
iter #1
iter #2
8 9 0 10 11 12 13
iter #3
1
2
4
3
5 6 7
iter #1
iter #2
Data Formatid class fsize gender survived
0 1 0 f 1
1 1 3 m 1
2 1 3 f 0
3 1 3 m 0
4 1 3 f 0
5 1 0 m 1
6 1 1 f 1
7 1 0 m 0
8 1 2 f 1
9 1 0 m 0
id class fsize gender survived
10 1 1 m 0
11 1 1 f 1
12 1 0 f 1
13 1 0 f 1
14 1 0 m 1
15 1 0 m 0
16 1 1 m 0
17 1 1 f 1
18 1 0 f 1
19 1 0 m 0
….
Data Format
Create an inverted index
Key to efficiently building one layer at a time
Inverted Index
Maps terms to the list of documents that contain that term
Terms and docs stored in sorted order
Inverted Index
class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….
Inverted Index
class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….
Field
Inverted Index
class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….
Term
Inverted Index
class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….
Docs
Inverted Index
class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….
Docs
Inverted Index
class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….class=2 → 323,324,325,326,327,328,329….class=3 → 600,601,602,603,604,605,606….
Docs
Inverted Index
fsize=0 → 0,5,7,9,12,13,14,15,18,19,22….fsize=1 → 6,10,11,16,17,26,27,36,49,50….fsize=2 → 8,20,21,42,76,77,78,79,81,82….fsize=3 → 1,2,3,4,54,55,56,57,90,339….fsize=4 → 249,250,251,252,253,449,806….….
Inverted Index
gender=f → 0,2,4,6,8,11,12,13,17,18,21….gender=m → 1,3,5,7,9,10,14,15,16,19,20….
Inverted Index
survived=0 → 2,3,4,7,9,10,15,16,19,25….survived=1 → 0,1,5,6,8,11,12,13,14,17….
Inverted Index Implementations
Lucene
Flamdex
Primary Lookup Tables
groups[doc]Where in the tree each doc isInitialized to all ones, all docs start in root
values[doc]Value to be classified, for each docIn this case it’s 1 if survived, 0 otherwise
Primary Lookup Tables
values[doc]
Constructed from an inverted index of the values
Invert the field of interest (e.g. survived)
Main Loop Overview
foreach fieldforeach term
get group statsevaluate splits
apply best splitsrepeat n times or until no more splits found
Main Loop - First Iteration
foreach field (class, fsize, gender)
Main Loop - First Iteration
foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)
Main Loop - First Iteration
foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)
get group stats
Get Group Stats
count[grp]Count of how many documents within that group contain current term, initialized to zeros
vsum[grp]Summation of the value to be classified from the documents within that group that contain current term, initialized to zeros
Get Group Stats
for current field/term
Get Group Stats
for current field/termforeach doc
Get Group Stats
for current field/termforeach doc
grp = grps[doc]
Get Group Stats
for current field/termforeach doc
grp = grps[doc]if grp == 0 skip
Get Group Stats
for current field/termforeach doc
grp = grps[doc]if grp == 0 skipcount[grp]++vsum[grp] += vals[doc]
Get Group Stats
for current field/term (class=1)
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skip
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
…count[1] = 0, vsum[1] = 0
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
…count[1] = 1, vsum[1] = 1
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
…count[1] = 2, vsum[1] = 2
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
…count[1] = 3, vsum[1] = 2
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
…count[1] = 4, vsum[1] = 2
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
…count[1] = 5, vsum[1] = 2
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
…count[1] = 6, vsum[1] = 3
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
…count[1] = 323, vsum[1] = 200
1309 passengers500 survivors
38.2% survival
class = 1
1309 passengers500 survivors
38.2% survival
class = 1
Group 1
1309 passengers500 survivors
38.2% survival
class = 1
Group 1
1309 passengers500 survivors
38.2% survival
class = 1323 passengers count[1]
Group 1
1309 passengers500 survivors
38.2% survival
class = 1323 passengers
200 survivors
count[1]
vsum[1]
Group 1
Get Group Stats
for current field/term (class=2)foreach doc (323,324,325,326,327,328,329...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (0,1,0,0,0,0,1,0,1…)
…count[1] = 277, vsum[1] = 119
Get Group Stats
for current field/term (class=3)foreach doc (600,601,602,603,604,605,606...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (0,0,0,1,1,1,1,1,0…)
…count[1] = 709, vsum[1] = 181
Main Loop - First Iteration
foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)
get group statsevaluate splits
Evaluate Splits
Consider current field/term as a potential split for each group
1) check if split is admissiblebalance check, significance check
2) score the splitconditional entropy or some other heuristic
3) keep best scoring split
Evaluate Splits
totalcount[group] / totalvalue[group]Total number of documents and total values for each group, i.e. # passengers / # survivors
bestsplit[group] / bestscore[group]Current best split and score for each group, initially nulls
foreach field/term (class=1)get group stats (count[1]=323,vsum[1]=200)foreach group
if not admissible( … ) skipscore = calcscore(cnt[grp], vsum[grp],
totcnt[grp], totval[grp])if score < bestscore[grp]
bestscore[grp] = scorebestsplit[grp] = field/term
foreach field/term (class=1)get group stats (count[1]=323,vsum[1]=200)foreach group
if not admissible( … ) skipscore = calcscore(cnt[grp], vsum[grp],
totcnt[grp], totval[grp])if score < bestscore[grp]
bestscore[grp] = scorebestsplit[grp] = field/term
Main Loop - First Iteration
foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)
get group statsevaluate splits
apply best splits (bestsplit[1]=“gender=f”)
Apply Best Splits
Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group
Apply Best Splits
Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group
target group: 1 1
Apply Best Splits
Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group
target group: 1condition: gender=female
1
Apply Best Splits
Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group
target group: 1condition: gender=femalepositive group: 3
3
1
Apply Best Splits
Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group
target group: 1condition: gender=femalepositive group: 3negative group: 2 2 3
1
Apply Best Splits
Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group
target group: 1condition: gender=femalepositive group: 3negative group: 2 2 3
1
Apply Best Splits
Using inverted index, iterate over docs that match split condition
If current document is in targeted group, move it to the positive group
At the end, move anything left in target group to negative group
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 1 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 1 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 1 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 1 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 1 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 1 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 3 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 1 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 1group[4] = 3 group[11] = 1 group[18] = 1group[5] = 1 group[12] = 1 group[19] = 1group[6] = 1 group[13] = 1 group[20] = 1
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 3 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 3group[4] = 3 group[11] = 3 group[18] = 3group[5] = 1 group[12] = 3 group[19] = 1group[6] = 3 group[13] = 3 group[20] = 1
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 3 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 3group[4] = 3 group[11] = 3 group[18] = 3group[5] = 1 group[12] = 3 group[19] = 1group[6] = 3 group[13] = 3 group[20] = 1
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3 group[7] = 1 group[14] = 1group[1] = 1 group[8] = 3 group[15] = 1group[2] = 3 group[9] = 1 group[16] = 1group[3] = 1 group[10] = 1 group[17] = 3group[4] = 3 group[11] = 3 group[18] = 3group[5] = 1 group[12] = 3 group[19] = 1group[6] = 3 group[13] = 3 group[20] = 1
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3 group[7] = 2 group[14] = 2group[1] = 2 group[8] = 3 group[15] = 2group[2] = 3 group[9] = 2 group[16] = 2group[3] = 2 group[10] = 2 group[17] = 3group[4] = 3 group[11] = 3 group[18] = 3group[5] = 2 group[12] = 3 group[19] = 2group[6] = 3 group[13] = 3 group[20] = 2
Main Loop
foreach fieldforeach term
get group statsevaluate splits
apply best splitsrepeat n times or until no more splits found
1
iter #1
1
iter #1
gender = female
1
2 3
iter #1
gender = femalegender ≠ female
1
iter #1
iter #22 3
iter #1
1
Main Loop - Second Iteration
foreach field (class, fsize, gender)foreach term (class=1,class=2,class=3...)
get group stats
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (3,2,3,2,3,2,3,2,3…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
Get Group Stats
for current field/term (class=1)foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (3,2,3,2,3,2,3,2,3…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
…count[2] = 179, vsum[2] = 61count[3] = 144, vsum[3] = 139
Get Group Stats
for current field/term (class=2)foreach doc (323,324,325,326,327,328,329...)
grp = grps[doc] (2,3,2,2,2,2,3,2,2…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (0,1,0,0,0,0,1,0,1…)
…count[2] = 171, vsum[2] = 25count[3] = 106, vsum[3] = 94
Get Group Stats
for current field/term (class=3)foreach doc (600,601,602,603,604,605,606...)
grp = grps[doc] (2,2,2,3,3,2,2,3,2…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (0,0,0,1,1,1,1,1,0…)
…count[2] = 493, vsum[2] = 75count[3] = 216, vsum[3] = 106
Get Group Stats
for current field/term (gender=female)foreach doc (0,2,4,6,8,11,12,13,17,18,21,23….)
grp = grps[doc] (3,3,3,3,3,3,3,3,3,3,3,3…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,0,0,1,1,1,1,1,1…)
…count[2] = 0, vsum[2] = 0count[3] = 467, vsum[3] = 339
Get Group Stats
for current field/term (gender=male)foreach doc (1,3,5,7,9,10,14,15,16,19,20,22...)
grp = grps[doc] (2,2,2,2,2,2,2,2,2,2,2…)if grp == 0 skipcount[grp]++vsum[grp] += vals[doc] (1,0,1,0,0,0,1,0,0…)
…count[2] = 844, vsum[2] = 161count[3] = 0, vsum[3] = 0
What AboutInequality Splits?
e.g. class ≤ 2
Main Loop + Inequality Splits
foreach fieldforeach term
get group statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
Main Loop + Inequality Splits
foreach fieldreset inequality statsforeach term
get group statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
Main Loop + Inequality Splits
foreach fieldreset inequality statsforeach term
get group statsupdate inequality statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
Main Loop + Inequality Splits
foreach fieldreset inequality statsforeach term
get group statsupdate inequality statsevaluate splitsevaluate inequality splits
apply best splits for each grouprepeat n times or until no more splits found
Scalability
Performs quite well on a single machine
Worked well for a while, but started to hit limits
Ultimately needed to distribute to multiple machines
Multiple Machine Implementation
Hadoop?
Hadoop
Experimented with using Hadoop
Each level took five sequential map reduce jobs
Much slower than single machine; repeatedly writes intermediate data and lots of shuffling
Hadoop
Experimented with using Hadoop
Each level took five sequential map reduce jobs
Much slower than single machine; repeatedly writes intermediate data and lots of shuffling
Hadoop not great for iterative algorithms
Partition Data
Inverted Index
Inverted Index
Inverted Index
Inverted Index
Shard 1 Shard 2
Shard 1 Shard 2
Machine 1 Machine 2
Main Loop
foreach fieldforeach term
get group statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
Main Loop
foreach fieldforeach term
get group statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
Main Loop
foreach fieldforeach term
get group statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
FTGS
Main Loop
foreach fieldforeach term
get group statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
FTGS
Main Loop
foreach field
foreach termget group statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
FTGS
Main Loop
foreach field
foreach term
get group statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
FTGS
FTGS Stream - Single Machine
class=1|1|323|200class=2|1|277|119
class=3|1|709|181fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
class=1|1|323|200class=2|1|277|119
class=3|1|709|181fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200class=2|1|277|119
class=3|1|709|181fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200class=2|1|277|119
class=3|1|709|181fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200class=2|1|277|119
class=3|1|709|181fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
Sorted
FTGS Stream - Single Machine
class=1|1|323|200class=2|1|277|119
class=3|1|709|181fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119class=3|1|709|181
fsize=0|1|790|239fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181fsize=0|1|790|239
fsize=1|1|235|126fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239fsize=1|1|235|126
fsize=2|1|159|90fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126fsize=2|1|159|90
fsize=3|1|43|30fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90fsize=3|1|43|30
fsize=4|1|22|6fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30fsize=4|1|22|6
fsize=5|1|25|5fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6fsize=5|1|25|5
fsize=6|1|16|4fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5fsize=6|1|16|4
fsize=7|1|8|0fsize=10|1|11|0gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4fsize=7|1|8|0fsize=10|1|11|0gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0fsize=10|1|11|0gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339gender=m|1|843|161
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsize=2|1|159|90
fsize=3|1|43|30
fsize=4|1|22|6
fsize=5|1|25|5
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|466|339
gender=m|1|843|161
FTGS Stream - Single Machine
FTGS Stream
How to distribute?
Shard 1 Shard 2
Machine 1 Machine 2
Shard 1 Shard 2
FTGS 1
Machine 2
Shard 1 Shard 2
FTGS 1 FTGS 2
Shard 1 Shard 2
FTGS 1 FTGS 2
Machine 3
Shard 1 Shard 2
FTGS 1 FTGS 2Merge
Machine 3
FTGS Stream Merge
class=1|1|198|111class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53
fsize=2|1|75|48
fsize=3|1|21|17
fsize=4|1|3|1
fsize=5|1|3|1
gender=f|1|308|237
gender=m|1|678|122
Machine 1
FTGS Stream Merge
class=1|1|125|89class=3|1|198|52
fsize=1|1|141|73fsize=2|1|84|42
fsize=3|1|22|13
fsize=4|1|19|5
fsize=5|1|22|4
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|158|102
gender=m|1|165|39
Machine 2
FTGS Stream Merge
class=1|1|198|111class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53
fsize=2|1|75|48
fsize=3|1|21|17
fsize=4|1|3|1
fsize=5|1|3|1
gender=f|1|308|237
gender=m|1|678|122
class=1|1|125|89class=3|1|198|52
fsize=1|1|141|73fsize=2|1|84|42
fsize=3|1|22|13
fsize=4|1|19|5
fsize=5|1|22|4
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|158|102
gender=m|1|165|39
Machine 1 Machine 2
FTGS Stream Merge
class=1|1|198|111class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53
fsize=2|1|75|48
fsize=3|1|21|17
fsize=4|1|3|1
fsize=5|1|3|1
gender=f|1|308|237
gender=m|1|678|122
class=1|1|125|89class=3|1|198|52
fsize=1|1|141|73fsize=2|1|84|42
fsize=3|1|22|13
fsize=4|1|19|5
fsize=5|1|22|4
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|158|102
gender=m|1|165|39
class=1|1|323|200
+
Machine 1 Machine 2
FTGS Stream Merge
class=1|1|198|111
class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48
fsize=3|1|21|17
fsize=4|1|3|1
fsize=5|1|3|1
gender=f|1|308|237
gender=m|1|678|122
class=1|1|125|89
class=3|1|198|52fsize=1|1|141|73
fsize=2|1|84|42fsize=3|1|22|13
fsize=4|1|19|5
fsize=5|1|22|4
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|158|102
gender=m|1|165|39
class=1|1|323|200
Machine 1 Machine 2
FTGS Stream Merge
class=1|1|198|111
class=2|1|277|119class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48
fsize=3|1|21|17
fsize=4|1|3|1
fsize=5|1|3|1
gender=f|1|308|237
gender=m|1|678|122
class=1|1|125|89
class=3|1|198|52fsize=1|1|141|73
fsize=2|1|84|42fsize=3|1|22|13
fsize=4|1|19|5
fsize=5|1|22|4
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|158|102
gender=m|1|165|39
class=2|1|277|119class=1|1|323|200
Machine 1 Machine 2
FTGS Stream Mergeclass=1|1|198|111
class=2|1|277|119
class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17
fsize=4|1|3|1
fsize=5|1|3|1
gender=f|1|308|237
gender=m|1|678|122
class=1|1|125|89
class=3|1|198|52fsize=1|1|141|73
fsize=2|1|84|42fsize=3|1|22|13
fsize=4|1|19|5
fsize=5|1|22|4
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|158|102
gender=m|1|165|39
class=2|1|277|119class=1|1|323|200
Machine 1 Machine 2
FTGS Stream Mergeclass=1|1|198|111
class=2|1|277|119
class=3|1|511|129fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17
fsize=4|1|3|1
fsize=5|1|3|1
gender=f|1|308|237
gender=m|1|678|122
class=1|1|125|89
class=3|1|198|52fsize=1|1|141|73
fsize=2|1|84|42fsize=3|1|22|13
fsize=4|1|19|5
fsize=5|1|22|4
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|158|102
gender=m|1|165|39
class=3|1|709|181class=2|1|277|119
class=1|1|323|200
+
Machine 1 Machine 2
FTGS Stream Mergeclass=1|1|198|111
class=2|1|277|119
class=3|1|511|129
fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17fsize=4|1|3|1
fsize=5|1|3|1
gender=f|1|308|237
gender=m|1|678|122
class=1|1|125|89
class=3|1|198|52
fsize=1|1|141|73fsize=2|1|84|42
fsize=3|1|22|13fsize=4|1|19|5
fsize=5|1|22|4
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|158|102
gender=m|1|165|39
class=3|1|709|181class=2|1|277|119
class=1|1|323|200
Machine 1 Machine 2
FTGS Stream Mergeclass=1|1|198|111
class=2|1|277|119
class=3|1|511|129
fsize=0|1|790|239fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17fsize=4|1|3|1
fsize=5|1|3|1
gender=f|1|308|237
gender=m|1|678|122
class=1|1|125|89
class=3|1|198|52
fsize=1|1|141|73fsize=2|1|84|42
fsize=3|1|22|13fsize=4|1|19|5
fsize=5|1|22|4
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|158|102
gender=m|1|165|39fsize=0|1|790|239
class=3|1|709|181class=2|1|277|119
class=1|1|323|200Machine 1 Machine 2
FTGS Stream Mergeclass=1|1|198|111
class=2|1|277|119
class=3|1|511|129
fsize=0|1|790|239
fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17fsize=4|1|3|1fsize=5|1|3|1
gender=f|1|308|237
gender=m|1|678|122
class=1|1|125|89
class=3|1|198|52
fsize=1|1|141|73fsize=2|1|84|42
fsize=3|1|22|13fsize=4|1|19|5
fsize=5|1|22|4
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|158|102
gender=m|1|165|39fsize=0|1|790|239class=3|1|709|181
class=2|1|277|119class=1|1|323|200
Machine 1 Machine 2
FTGS Stream Mergeclass=1|1|198|111
class=2|1|277|119
class=3|1|511|129
fsize=0|1|790|239
fsize=1|1|94|53fsize=2|1|75|48fsize=3|1|21|17fsize=4|1|3|1fsize=5|1|3|1
gender=f|1|308|237
gender=m|1|678|122
class=1|1|125|89
class=3|1|198|52
fsize=1|1|141|73fsize=2|1|84|42
fsize=3|1|22|13fsize=4|1|19|5
fsize=5|1|22|4
fsize=6|1|16|4
fsize=7|1|8|0
fsize=10|1|11|0
gender=f|1|158|102
gender=m|1|165|39fsize=1|1|235|126fsize=0|1|790|239
class=3|1|709|181class=2|1|277|119
class=1|1|323|200
+
Machine 1 Machine 2
Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6
FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
k-way merge
FTGS 1-6
FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
FTGS 1-6 FTGS 7-12 FTGS 13-18
FTGS 1-6 FTGS 7-12 FTGS 13-18
FTGS 1-18
FTGS 1-18 FTGS 19-36
FTGS 1-36
Main Loop
foreach fieldforeach term
get group statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
FTGS
Main Loop
foreach fieldforeach term
get group statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
Main Loop
foreach fieldforeach term
get group statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
Main Loop
foreach fieldforeach term
get group statsevaluate splits
apply best splits for each grouprepeat n times or until no more splits found
Regroup
FTGS 1-6 FTGS 7-12 FTGS 13-18
FTGS
Regroup 1-6 Regroup 7-12 Regroup 13-18
Regroup
FTGS 1-6 FTGS 7-12 FTGS 13-18
FTGS
Regroup 1-6 Regroup 7-12 Regroup 13-18
Regroup
Imhotep
Imhotep
Distributed System that does efficient FTGS and Regroup operations on inverted indexes
Imhotep
32 machines
2 cpu x 6 core xeon westmere E5649128GB RAM10x1TB 7200 RPM SATA
Total:384 cores, 4TB RAM, 320TB disk
Decision tree on 13 billion documents
Imhotep
Decision tree on 13 billion documents330GB → ~25 bytes per doc
Imhotep
Decision tree on 13 billion documents330GB → ~25 bytes per doc
First FTGS: 314 secondsFirst Regroup: 9.6 seconds
Imhotep
Decision tree on 13 billion documents330GB → ~25 bytes per doc
First FTGS: 314 seconds (36.3 million terms)First Regroup: 9.6 seconds
Imhotep
Decision tree on 13 billion documents330GB → ~25 bytes per doc
First FTGS: 314 seconds (36.3 million terms)First Regroup: 9.6 seconds (7 groups)
Imhotep
Decision tree on 13 billion documents330GB → ~25 bytes per doc
First FTGS: 314 seconds (36.3 million terms)First Regroup: 9.6 seconds (7 groups)
Second FTGS: 57 secondsSecond Regroup: 23 seconds (217 groups)
Imhotep
Imhotep
Distributed System that does efficient FTGS and Regroup operations
Powers our internal analytical tools
Imhotep
Distributed System that does efficient FTGS and Regroup operations
Powers our internal analytical tools
… and more
Imhotep - Next @IndeedEng Talk
Sharding and shard managementSession / FTGS network protocolMemory managementInverted IndexesFTGS MergeRegroup operationsFault Tolerance
Conclusion
Now scales to larger and larger data sets by adding more machines
Increased freshness and frequency of builds
Decision trees have lots of tunable components, regularly get 1% wins via A/B test
Continuous Improvement
Sponsored Job Click-through Rate (CTR)
Thanks.
Q & A
More Questions?Jason David James Jeff
Next @IndeedEng TalkImhotep: Large Scale Analytics
and Machine Learning at Indeed
Jeff Plaisance, Engineering ManagerMarch 26, 2014
http://engineering.indeed.com/talks