identifying and incorporating latencies in distributed data mining algorithms michael sevilla

Identifying and Incorporating Latencies in Distributed Data Mining Algorithms

Michael Sevilla

Identifying and Incorporating Latencies in Distributed Data Mining Algorithms

Michael SevillaX

Applicability of Mahout for Large Data Sets

Michael Sevilla

What is Mahout?

• Distributed machine learning libraries– “scalable to reasonably large data sets”– Runs on Hadoop

http://heureka.blogetery.com/

The Data: Million Song Data Set

• Large Data Set– 1,019,318 users– 384,546 MSD songs– 48,373,586 (user, song, count)

• Kaggle Competition: offline evaluation– Predict songs a user will listen to using• Training: 1M user listening history• Validation: 110K users

• “Martin L” blogged his methodology + results

22 vs.

Motivations

• Can Mahout easily be modified?• Can Mahout perform well for this workload?• Can Mahout produce accurate results?• Can Mahout work ‘out of box’?

• Hypothesis: 22 machines + Mahout > 1 guy

What kind of Recommender?

• Format: <userID, songID, count>• Users interacting with items• Users express preferences towards items

• We can us Collaborative Filtering

22 vs.

Collaborative Filtering

• Predicts preference of user towards an item• Constructs a Top-N-Recommendation

1. Parse input training data2. Create user-item-matrix3. Predict missing entries

Mahout has item-based Collaborative Filtering jobs!

CAN MAHOUT EASILY BE MODIFIED?

Martin’s Code

• Methodology: similarity vector of history– Sparse-matrix• COLISTEN(i, j) – listeners who listened to i and j

– Sum similarities for each song user x listens to• The code: all python– Parse: 27 lines of code (l.o.c)– Create Matrix: 46 l.o.c– Predict: 45 l.o.c

Mahout’s Code

• Methodology: – No Idea…

• The code: all java– Poorly commented– 14 *.java files – Many Directories

• ~/mahout/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java

– RecommenderJob.java: 284 lines of code (l.o.c)– SimilarityMatrixRowWrapperMapper.java: 47 l.o.c– UserVectorSplitterMapper.java: 138 l.o.c

Mahout’s Code

CAN MAHOUT EASILY BE MODIFIED?

CAN MAHOUT PERFORM WELL FOR THIS WORKLOAD?

• Performance on 86MB: – Parse data: 10 minutes– Make Matrix: 22 minutes– Predict songs for 11000 users: 1 hour, 18 minutes

• Did not test scalability

$/ python convertToNumbers.py$/ python colisten.py$/ python predict_colisten.py

Martin’s Code

• Performance on 86MB:– Parse Time: 10 minutes– Total Time: 25 minutes

• Tested scalability– 64MB, 128MB, 256MB, 1GB, 2GB, 3GB

Mahout’s Code

• Total Time• ~ 12m, 43m, 1hr, 2hr, 4hr, >5hr ….

10 Nodes Failed

• Prepare Jobs (parse): seconds - minutes

Mahout’s Code

• Recommend Jobs (predict): seconds - minutes

Mahout’s Code

• Create Matrix Jobs: minutes - hours

CAN MAHOUT PERFORM WELL FOR THIS WORKLOAD?

CAN MAHOUT PRODUCE ACCURATE RESULTS?

Training Set

• Kaggle Million Song Subset: 110K users– User 2: 16 entries – took out 8– User 16: 32 entries – took out 8– User 17: 25 entries – took out 8

User 2:

User 16:

User 17:

where Q is the number of queries Martin’s Code

User 2:

User 16:

User 17:

where Q is the number of queries Mahout’s Code

CAN MAHOUT PRODUCE ACCURATE RESULTS?

CAN MAHOUT WORK ‘OUT OF BOX’?

YES… but not well

Conclusion

• Mahout did not scale well• Mahout was not easy to learn• Mahout was not easily modifiable

• For performance and efficiency, it is better to– Understand the data set– Understand data mining– Understand the methodology

identifying and incorporating latencies in distributed data mining algorithms michael sevilla

c slide

guy slide

seconds minutes slide

gb mahouts code slide

users user

mahout work

preference of user

applicability of mahout

Documents

t. latencies of behavioral response to interception of

sevilla (05530035) i love u sevilla

refactoring network functions modules to reduce latencies

reducing file system tail latencies with - usenix ·...

the effect of different latencies and sentence lengths on...

measuring interface latencies for sas, fibre channel · pdf...

informe resumen de sevilla (sevilla) · 2017-12-04 ·...

reducing file system tail latencies with...

an experimental analysis of the xen and kvm latencies ·...

sevilla - capital de provincia pag. 001 sevilla ... ·...

zooming in on wide-area latencies to a global cloud provider

deep diving into africa’s inter-country latencies

instituciÓn educativa sevilla liceo mixto sevilla - sede

gastronomía y noche de sevilla - gn sevilla

a twin study of drinking and smoking onset and latencies ......

presentación de sevilla&tú en #uocalumni sevilla

institución educativa sevilla – sede liceo mixto sevilla

effectively measure and reduce kernel latencies for real...

reducing long tail latencies in geo-distributed...

production latencies of morphologically simple and … ·...