predictive models at scale
TRANSCRIPT
![Page 1: Predictive Models at Scale](https://reader031.vdocuments.site/reader031/viewer/2022021500/58eba7291a28abdc638b469d/html5/thumbnails/1.jpg)
Predictive Models at Scale using Dumbo
Nikhil Ketkar
![Page 2: Predictive Models at Scale](https://reader031.vdocuments.site/reader031/viewer/2022021500/58eba7291a28abdc638b469d/html5/thumbnails/2.jpg)
40k+ Brands600k+ Sellers
700+ Million Products7k+ Categories10k+ Attributes
Motivation: Problem Space @ Indix
![Page 3: Predictive Models at Scale](https://reader031.vdocuments.site/reader031/viewer/2022021500/58eba7291a28abdc638b469d/html5/thumbnails/3.jpg)
Developing Predictive Models
Unlabelled Data
SampleHandLabel Model Predict
Data with Predicted Labels
![Page 4: Predictive Models at Scale](https://reader031.vdocuments.site/reader031/viewer/2022021500/58eba7291a28abdc638b469d/html5/thumbnails/4.jpg)
HDFS
StatisticalModel
StatisticalModel
StatisticalModel
StatisticalModel
StatisticalModel
StatisticalModel
Predictive Models at Scale
![Page 5: Predictive Models at Scale](https://reader031.vdocuments.site/reader031/viewer/2022021500/58eba7291a28abdc638b469d/html5/thumbnails/5.jpg)
The Two Giants
Native, C/C++ Fortran
Numpy
Scipy, Pandas, Matplotlib
scikit-learn, scikit-image, statsmodels
JVM
Java/Scala
HDFS, Hadoop MapReduce
Cascading/Scalding
PyData Ecosystem Hadoop Ecosystem
ModelPredict
![Page 6: Predictive Models at Scale](https://reader031.vdocuments.site/reader031/viewer/2022021500/58eba7291a28abdc638b469d/html5/thumbnails/6.jpg)
The Standard Options ● Port to Java/Scala use as Library in Mapper
○ Time Consuming ○ Need to port parts of the PyData Stack○ Reduced Velocity○ Error prone
● Write a REST API/Service for the model and call from Mapper○ Slow due to Network Latency○ Deployment is a nightmare
● Use Disco
![Page 7: Predictive Models at Scale](https://reader031.vdocuments.site/reader031/viewer/2022021500/58eba7291a28abdc638b469d/html5/thumbnails/7.jpg)
Can we do better?
● Hadoop Streaming with Typedbytes Support● Python Wrappers over Hadoop Streaming
○ Dumbo○ MRJob○ Hadoopy○ Pydoop
Reference: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
![Page 8: Predictive Models at Scale](https://reader031.vdocuments.site/reader031/viewer/2022021500/58eba7291a28abdc638b469d/html5/thumbnails/8.jpg)
Two Minute MapReduce Refresher
Reference: https://tarnbarford.net/journal/mapreduce-on-mongo
![Page 9: Predictive Models at Scale](https://reader031.vdocuments.site/reader031/viewer/2022021500/58eba7291a28abdc638b469d/html5/thumbnails/9.jpg)
Sample Problem: Extract MPN from Product Titles
● 0.5 Billion Product Titles● Many contain MPNs● Humans can detect
MPNs● Can a model do the
same?● Use CRF on Full Title● Use RF on Tokens
Moen CSIMC000BN Brushed Nickel Decorative Mirror Frame Corner Rosette from Mirrorscapes 000 Series Set of 4
Rohl A3608/6.5LPAPC 2 Polished Chrome Country Kitchen Low Lead Bar Faucet with Porcelain Lever Handle
Newport Brass 3 447/ORB Oil Rubbed Bronze Hand RelievedDiverter / Volume Control Handle from the Metropole Collection
Bosch HCFC2044B 1/4" SDS Plus X5L with Optimized Flute Surface Pack of 25
Sterling 7214120 Ensemble 0" x 30" Shower Receptor with Right hand Drain Pack 6
U12 23252 KUB QUATRON INDX DRILL
MPNs in Product Titles
![Page 10: Predictive Models at Scale](https://reader031.vdocuments.site/reader031/viewer/2022021500/58eba7291a28abdc638b469d/html5/thumbnails/10.jpg)
Code Walkthrough
![Page 11: Predictive Models at Scale](https://reader031.vdocuments.site/reader031/viewer/2022021500/58eba7291a28abdc638b469d/html5/thumbnails/11.jpg)
Code Walkthrough
![Page 12: Predictive Models at Scale](https://reader031.vdocuments.site/reader031/viewer/2022021500/58eba7291a28abdc638b469d/html5/thumbnails/12.jpg)
Important Learnings
● Dumbo Fairly Stable, Mature and Ready for Production
● Gets the 2 giants working together!● Found just one issue over 6 months of
usage (patch submitted)● Support for Typedbytes is critical if making
predictions over binary data (Images etc.)