predictive models at scale

Post on 10-Apr-2017

388 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Predictive Models at Scale using Dumbo

Nikhil Ketkar

40k+ Brands600k+ Sellers

700+ Million Products7k+ Categories10k+ Attributes

Motivation: Problem Space @ Indix

Developing Predictive Models

Unlabelled Data

SampleHandLabel Model Predict

Data with Predicted Labels

HDFS

StatisticalModel

StatisticalModel

StatisticalModel

StatisticalModel

StatisticalModel

StatisticalModel

Predictive Models at Scale

The Two Giants

Native, C/C++ Fortran

Numpy

Scipy, Pandas, Matplotlib

scikit-learn, scikit-image, statsmodels

JVM

Java/Scala

HDFS, Hadoop MapReduce

Cascading/Scalding

PyData Ecosystem Hadoop Ecosystem

ModelPredict

The Standard Options ● Port to Java/Scala use as Library in Mapper

○ Time Consuming ○ Need to port parts of the PyData Stack○ Reduced Velocity○ Error prone

● Write a REST API/Service for the model and call from Mapper○ Slow due to Network Latency○ Deployment is a nightmare

● Use Disco

Can we do better?

● Hadoop Streaming with Typedbytes Support● Python Wrappers over Hadoop Streaming

○ Dumbo○ MRJob○ Hadoopy○ Pydoop

Reference: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

Two Minute MapReduce Refresher

Reference: https://tarnbarford.net/journal/mapreduce-on-mongo

Sample Problem: Extract MPN from Product Titles

● 0.5 Billion Product Titles● Many contain MPNs● Humans can detect

MPNs● Can a model do the

same?● Use CRF on Full Title● Use RF on Tokens

Moen CSIMC000BN Brushed Nickel Decorative Mirror Frame Corner Rosette from Mirrorscapes 000 Series Set of 4

Rohl A3608/6.5LPAPC 2 Polished Chrome Country Kitchen Low Lead Bar Faucet with Porcelain Lever Handle

Newport Brass 3 447/ORB Oil Rubbed Bronze Hand RelievedDiverter / Volume Control Handle from the Metropole Collection

Bosch HCFC2044B 1/4" SDS Plus X5L with Optimized Flute Surface Pack of 25

Sterling 7214120 Ensemble 0" x 30" Shower Receptor with Right hand Drain Pack 6

U12 23252 KUB QUATRON INDX DRILL

MPNs in Product Titles

Code Walkthrough

Code Walkthrough

Important Learnings

● Dumbo Fairly Stable, Mature and Ready for Production

● Gets the 2 giants working together!● Found just one issue over 6 months of

usage (patch submitted)● Support for Typedbytes is critical if making

predictions over binary data (Images etc.)

top related