machine learning in big data
TRANSCRIPT
![Page 1: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/1.jpg)
Machine Learning in Big Data- Look forward or be left behind
V. William Porto Hadoop Summit Dublin 2016
![Page 2: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/2.jpg)
Overview of RedPoint Global
2 RedPoint Global Inc. 2016 Confidential
Launched in 2006
Founded and staffed by industry veterans
Headquarters: Wellesley, Massachusetts
Offices in US, UK, Australia, Philippines
Global customer base
Serves most major industries
![Page 3: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/3.jpg)
Overview of RedPoint Global
3 RedPoint Global Inc. 2016 Confidential
MAGIC QUADRANTData Quality
MAGIC QUADRANTIntegrated Marketing
Management
MAGIC QUADRANTMultichannel Campaign
Management
MAGIC QUADRANTDigital Marketing Hubs
FORRESTER WAVE™Cross-channel
Campaign Management
FORRESTER WAVE™Data Quality Solutions
![Page 4: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/4.jpg)
4 RedPoint Global Inc. 2015 Confidential
With apologies to Gary Larson
Hadoop
![Page 5: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/5.jpg)
5 RedPoint Global Inc. 2015 Confidential
Machine Learning – why bother?
If you have always done it that way, it is probably wrong” - Charles Kettering
![Page 6: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/6.jpg)
6 RedPoint Global Inc. 2015 Confidential
Machine Learning – keeping ahead of the curve
• Three basic tenants for success in today’s world
• Prediction - you need to learn and use what you’ve learned
• Optimization - the world is a dynamic place
• Automation - because people don’t scale well
![Page 7: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/7.jpg)
7 RedPoint Global Inc. 2015 Confidential
Machine Learning – what really is it all about?
• Learning vs. instruction
• Humans learn instinctively – computers not so much
• Intelligent Systems
• Memory
• Prediction (modeling)
• Assessment
• Feedback
• Adaptation
![Page 8: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/8.jpg)
8 RedPoint Global Inc. 2015 Confidential
Data Modeling – what, why, how
• Regression – what happened in the past• Prediction – what will happen in the future
“Prediction is very difficult – especially if it’s about the future”
- Nihls Bohr
![Page 9: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/9.jpg)
9 RedPoint Global Inc. 2015 Confidential
Data Modeling – what, why, how
The wide world of data modeling
• Supervised models• you have historical data and known correlated outputs (truth)
• Unsupervised models• historical data, but may not have (or trust) associated outputs
![Page 10: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/10.jpg)
10 RedPoint Global Inc. 2015 Confidential
Decision Trees
Major Assumption: the world is discrete• fast, easy to understand, no linearity assumptions
• ‘human time’ required, unbalanced and/or large trees
![Page 11: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/11.jpg)
11 RedPoint Global Inc. 2015 Confidential
Standard Linear Models
Assumption: the world is linear• the real world really isn’t linear
• all errors are not all equal
• easy to get misleading results
? !
Which line is best?
![Page 12: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/12.jpg)
12 RedPoint Global Inc. 2015 Confidential
Generalized ‘Non-Linear’ Models
Assumptions• underlying functional mapping is known
• all errors are equal
• data is ‘well-conditioned’
• ‘standard’ error distribution
• Polynomials
• Exponentials (e.g., Gaussian, Poisson)
• Piece-wise linear
![Page 13: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/13.jpg)
13 RedPoint Global Inc. 2015 Confidential
Non-Linear Models
Assumption: data is representative• ‘universal’ modeling tools
• fast execution
• no linearity assumptions
• lots of parameters, many techniques
• difficult to explain
Artificial Neural Network
![Page 14: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/14.jpg)
14 RedPoint Global Inc. 2015 Confidential
User Story: Predict Retention / Attrition
Historical Behavioral Data
Customer Rating
Retention Customer NameLoyalty
MemberDays Since
Last PurchaseImmediate Relatives
Household Children
Customer IDLatest
Purchase Price
Latest Purchase Item ID
Region Code
Customer Capture Method
Customer Contact Code
Domicile
1 1 Allen, Geraldine yes 29 0 2 24160 211.39 B5 MW 2 6 St Louis, MO1 1 Anderson, Harry no 48 0 3 19952 26.55 E12 NE 3 New York, NY1 1 Andrews, Cynthia yes 63 1 0 13502 77.95 D7 NE 10 6 Hudson, NY1 0 Andrews, Thomas Jr no 39 0 0 112050 0 A36 SW Los Angeles, CA1 1 Appleton, Mary yes 53 2 3 11769 51.49 C101 NE D Bayside, Queens, NY1 0 Ashbury, Jeffrey no 47 1 0 PC 17757 29.99 C62 C64 NE 124 New York, NY1 1 Aston, Mrs. yes 18 1 0 PC 17757 29.99 C62 C64 NE 4 New York, NY1 1 Barber, Ellen yes 26 0 2 19877 78.85 S 61 1 Barkley, Henry no 80 0 0 27042 30 A23 NE B Yorktown, PA1 0 Baumann, David no 0 0 PC 17318 25.99 NE New York, NY1 1 Bazzeno, Alice yes 32 0 1 11813 76.95 D15 C 8 341 0 Beattie, Mr. Samuel no 36 0 0 13050 75.29 C6 C A 11 Winnipeg, MN1 1 Beckworth, June yes 47 1 1 11751 52.49 D35 NE 5 New York, NY1 1 Behr, John no 26 0 0 111369 30 C148 NE 5 New York, NY1 1 Biden, Roseanne yes 42 0 0 PC 17757 127.99 C 41 1 Bird, Ellen yes 29 0 0 PC 17483 18.95 C97 S 81 0 Birnbaum, Jason no 25 0 0 13905 26 C 148 San Francisco, CA
![Page 15: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/15.jpg)
15 RedPoint Global Inc. 2015 Confidential
User Story: Predict Customer Retention / Attrition
Machine Learning Processing Chain - Training
![Page 16: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/16.jpg)
16 RedPoint Global Inc. 2015 Confidential
User Story: Predict Retention / Attrition
Machine Learning Processing Chain - Prediction
Reward predicted ‘retainees’ with
targeted product offerings
Give potential attrition customers special
incentives to stay with the business
![Page 17: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/17.jpg)
17 RedPoint Global Inc. 2015 Confidential
User Story: Accurate vs. Useful Prediction
Sparse data + Least-Squares (Linear) Classifier• Task: predict chance of purchasing a sundry item
• Result: ‘best’ model always predicts “none”
• Analysis: LS algorithm assumes all errors are equalBread
Cake & Pie
Chocolate Coffee Cookie DieselJuice & Smoothies
Lubricants MilkOther Bakery
Premium Sandwich Snack TeaTotal Transaction
Total Revenue
0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 30000 0 0 0 0 3 0 0 0 0 0 0 0 0 3 20000 0 0 0 0 0 0 0 0 0 0 0 0 0 6 18000 0 0 0 0 5 0 0 0 0 0 0 0 0 6 48000 0 0 2 0 0 0 0 0 0 0 0 0 0 2 1000 0 0 0 0 1 0 0 0 0 0 0 0 0 1 18280 0 0 0 0 0 0 0 0 0 0 0 0 0 13 164600 0 0 0 0 2 0 0 0 0 0 0 0 0 2 10000 0 0 0 0 2 0 0 0 0 0 0 0 0 2 15000 0 0 0 0 0 0 0 0 0 0 0 0 0 7 46000 0 0 0 0 11 0 0 0 0 0 0 0 0 11 19381.50 0 0 0 0 1 0 0 0 0 0 0 0 0 1 18600 0 0 0 0 0 0 0 0 0 0 0 0 0 3 30000 0 0 0 0 0 0 0 0 0 0 0 0 0 18 9838.820 0 0 0 0 0 0 0 0 0 0 0 0 0 22 110000 0 0 0 0 5 0 0 0 0 0 0 0 0 19 182250 0 0 0 0 0 0 0 0 0 0 0 0 0 1 5000 0 0 0 0 0 0 0 0 0 0 0 0 0 1 8000 0 0 0 0 0 0 0 0 0 0 1 0 0 7 79900 0 0 0 0 0 0 0 0 0 0 0 0 0 5 38200 0 0 0 0 1 0 0 0 0 0 0 0 0 55 43230
![Page 18: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/18.jpg)
18 RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation – group think
Collaborative FilteringRelationship Matrix
![Page 19: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/19.jpg)
19 RedPoint Global Inc. 2015 Confidential
Personalization – not really
!=
![Page 20: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/20.jpg)
20 RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation
Similarity?
Customer Browser GenderAge
SectorIncome Sector
Married Children HomeownerRecent Baby
Clothes Purchase
George IE9 M 0 A N 0 1 NCarol Chrome F 1 B Y 1 0 YMary IE9 F 0 A N 1 0 Y
Dist(George,Carol) = 8Dist(George,Mary) = 4Dist(Carol,Mary) = 4
Can you afford to target (George,Mary) the same way as (Carol,Mary) ?
![Page 21: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/21.jpg)
21 RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation
Basic Question – which one describes the data the best?
Raw data
How many clusters are there ?
Two Clusters
Four Clusters
Six Clusters
![Page 22: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/22.jpg)
22 RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation with Statistics
• relatively simple
• data distribution assumptions
• initialization dependencies
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100Raw Data
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100Ellipsoidal Clustering
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100K-Means Clustering
![Page 23: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/23.jpg)
23 RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation – data driven
• let the data speak for itself
• multiple data projection ‘views’
• important boundary relationships
(“swing voters”)
Customer Demographics
![Page 24: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/24.jpg)
24 RedPoint Global Inc. 2015 Confidential
User Story: Clustering / Segmentation
ML Clustering - Training ML Clustering – Processing New Data
![Page 25: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/25.jpg)
25 RedPoint Global Inc. 2015 Confidential
Model Selection – how to choose?
• Basic Model Type (prediction or segmentation)• inputs + correlated outputs• inputs only?
• Basic Questions:• what to use for my problem?• parameters?• is this the best choice?• could I do better, and how?
![Page 26: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/26.jpg)
26 RedPoint Global Inc. 2015 Confidential
Optimization – Evolving better solutions
• Simulated Evolution• fast, efficient search• always have a solution• arbitrary ‘evaluation’ functions• can start with existing solution(s)
• Variation – alter model type, parameters• Assessment – how well does the model work?• Selection – survival of the fittest
![Page 27: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/27.jpg)
27 RedPoint Global Inc. 2015 Confidential
Evolutionary Optimization – Evaluation Function
• can use any measureable data• no continuity assumptions• no differentiability assumptions• no symmetry assumptions
Sunshine Hurricane
20 -10005 50
SunshineHurricane
Prediction
Reality (Truth)
![Page 28: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/28.jpg)
28 RedPoint Global Inc. 2015 Confidential
User Story: Optimizing Classification Models
Task: Predict Retention/Attrition
0 1 2 3 4 5 60.00
20.00
40.00
60.00
80.00
100.00
34.828.8
24.5 22.1 20.9
62
70.2 72.3 73.4 75.2
Model Performance Optimization
Classification AccuracyTest Set Error (RMS)
GenerationPe
rfor
man
ce
17 Potential input features(customer demographics)
2 outputs (retention/attrition)
1300 Training Samples (50 – 50, A / B Split)1300 Test Samples ( naïve test data )
![Page 29: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/29.jpg)
29 RedPoint Global Inc. 2015 Confidential
Use Case – Fully Adaptive Feedback (Next Best Offer)
DB
Historical User Behavior
(stimulus/response)
Train / Update Model
Non-Adaptive (Fixed) Mode
Randomized A/B/C Offer Selection
Adaptive ML Mode
ML Prediction Offer Selection
Operation (Trigger)
Ad / Offer (stimulus)
Feedback Cycle
![Page 30: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/30.jpg)
30 RedPoint Global Inc. 2015 Confidential
Five Keys to Successful Machine Learning
• Let the data speak for itself – don’t force fit your models• Remember, all errors are not all equal – use this to your advantage• True learning requires continual adaptation !• Automate the process with feedback – remove the “man-in-the-loop”• Trust the optimization process – it really works!
![Page 31: Machine Learning in Big Data](https://reader035.vdocuments.site/reader035/viewer/2022062503/58713baa1a28abf0568b6ec9/html5/thumbnails/31.jpg)
31 RedPoint Global Inc. 2015 Confidential
Q&A
Contact InfoVisit : www.redpoint.net
Bill PortoSr. Engineering AnalystRedPoint Global [email protected]
Want More Information about this topic?
Fill out your card or go to redpoint.net/hadoopeurope