data driven engineering 2014

48
The Unrealized Power of Data Roger Barga GPM, CloudML C+E

Upload: roger-barga

Post on 09-Feb-2017

259 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Data Driven Engineering 2014

The Unrealized Power of DataRoger Barga

GPM, CloudML C+E

Page 2: Data Driven Engineering 2014
Page 3: Data Driven Engineering 2014

Big Data is Here TodayDisruptive Innovation…

http://bit.ly/5a9KL

Page 4: Data Driven Engineering 2014

Competing on Analytics

If you can predict it, you can own it…• Forecasting• Targeting• Fraud detection, anti-fraud analytics• Risk• Customer churn, conversion • Propensity• Price Elasticity

By 2014, 30 percent of analytic applications will use predictive capabilities.– Gartner Business Intelligence Summit 2012

Page 5: Data Driven Engineering 2014

Predictive Analytics in Action

$3 Million in Health Analytics Prize Money Available http://bit.ly/groYXc

Predictive Analytics Applied to the Hospital Billing Process http://bit.ly/gpX9ak

Educational Data Mining is becoming popular http://bit.ly/fM0M1a

New Model Predict Presence of Meth Labs Very Accurately http://bit.ly/aIgUGH

Monitoring Physician Performance for Predictive Profiling http://bit.ly/bYt4sg

Predicting Employee Success http://bit.ly/clox2R

Forecast Career Paths by Resume Analytics http://bit.ly/gvlFKJ

Thorotrends offers research/analytics for horse racing http://bit.ly/diinec

How Target Figured Out A Teen Girl Was Pregnant http://onforb.es/SokL3j

Page 6: Data Driven Engineering 2014

Data Science65% of enterprises feel they have a strategic shortage of data scientists, a role many did not even know existed 12 months ago…

Data Science is far too complex today

• Software Development

• Data Exploration

• Math & Statistics Knowledge

• Machine Learning

• Data Management

this must get simpler, it simply won’t scale!

Page 7: Data Driven Engineering 2014

Talk Outline

Predictive Analytics TodayPredictive analytics can be a potent weapon in your business analytics toolbox

With increasing commoditization it will truly be the next differentiator

Advancement of Data ScienceUnique role and data science process;

Applying the principles of scientific experimentation to business;

Application of Predictive Analytics Within MicrosoftDemo our internal Azure service for machine learning over data to build predictive models;

Discuss three models built using Data Lab that we are using internally;

Page 8: Data Driven Engineering 2014

Predictive Analytics

“Trying to predict the future is like trying to drive down a country road at night with no lights while looking out the back window.”

– Peter Drucker• We must seek out patterns among seemingly

unconnected data and disciplines to gain valuable insights.

• It took 50 years for organizationsto fully leverage transactions systems, so it will time to realize significant value from knowledge, information and data.

• What gets measured, gets done.

Page 9: Data Driven Engineering 2014

Predictive AnalyticsSuccessful Predictions

Page 10: Data Driven Engineering 2014

Predictive AnalyticsSuccessful Predictions

Page 11: Data Driven Engineering 2014

Predictive AnalyticsCustomer discussions, from the front lines

By applying predictive analytics to data they already have, organizations are uncovering unexpected patterns and associations, and they are beginning to develop predictive models to guide front-line interactions.

Page 12: Data Driven Engineering 2014

Predictive AnalyticsCustomer discussions, from the front lines

By applying predictive analytics to data they already have, organizations are uncovering unexpected patterns and associations, and they are beginning to develop predictive models to guide front-line interactions.

Business Scenario:

• Downtime due to equipment failure or unnecessary maintenance bears huge cost;

• Goal: Predict failures from collected data and defer maintenance;

Five years of data, time series data fromhundreds of sensors on machines throughoutthe gas field, for dozens of gas fields acrossdifferent regions of the world…

Page 13: Data Driven Engineering 2014

Predictive AnalyticsCustomer discussions, from the front lines

By applying predictive analytics to data they already have, organizations are uncovering unexpected patterns and associations, and they are beginning to develop predictive models to guide front-line interactions.

Business Scenario:

• Downtime due to equipment failure or unnecessary maintenance bears huge cost;

• Goal: predict failures from collected data and defer maintenance;

Time series analysis to createfeatures and evaluate variousmachine learning approaches…

Page 14: Data Driven Engineering 2014

Predictive AnalyticsCustomer discussions, from the front lines

By applying predictive analytics to data they already have, organizations are uncovering unexpected patterns and associations, and they are beginning to develop predictive models to guide front-line interactions.

Business Scenario:

• Downtime due to equipment failure or unnecessary maintenance bears huge cost;

• Goal: predict failures from collected data and defer maintenance;

One-Class SVM

Page 15: Data Driven Engineering 2014

Predictive AnalyticsCustomer discussions, from the front lines

By applying predictive analytics to data they already have, organizations are uncovering unexpected patterns and associations, and they are beginning to develop predictive models to guide front-line interactions.

• Oil & Gas customer, building models for predictive maintenance of equipment;• Telecom customer, building customer churn predictive model to improve retention;

Enterprises are beginning to realize that data is a strategic asset and that they need to start treating it like one...

• If you don’t have or can’t find the data you need, consider creating it;

Page 16: Data Driven Engineering 2014

Predictive AnalyticsBridging the Gap

Predictive analytics must be simpler, today it requires highly specialized skills;• Tools that their current staff and/or new hires can use for predictive modelling;• Ability to integrate predictive models in line of business applications;

Page 17: Data Driven Engineering 2014

Predictive AnalyticsBridging the Gap

Predictive analytics must be simpler, today it requires highly specialized skills;• Tools that their current staff and/or new hires can use for predictive modelling;• Ability to integrate predictive models in line of business applications;

Enterprises require an end-to-end solution for their predictive modelling• Support for collaboration, lineage tracking;• Archive for predictive models, model governance;• Support for search & discovery, reuse;

Page 18: Data Driven Engineering 2014

Predictive AnalyticsBridging the Gap

Predictive analytics must be simpler, today it requires highly specialized skills;• Tools that their current staff and/or new hires can use for predictive modelling;• Ability to integrate predictive models in line of business applications;

Enterprises require an end-to-end solution for their predictive modelling• Support for collaboration, lineage tracking;• Archive for predictive models, model governance;• Support for search & discovery, reuse;

Predictions need to be operationalized, not locked in the backroom • Deployable predictions, minimize time to insight;• Ability to frequently update predictive model given new data, conditions change;

Page 19: Data Driven Engineering 2014

Predictive AnalyticsBridging the Gap

Familiar algorithms;• Classification and Regression Trees (CART);• Linear Regression;• Decision Forests, Boosted Decision Trees, see Kaggle models of choice;• Logistic Regression or other discrete choice;• Association Rules;• K-nearest Neighbors;• Box Jenkins (ARIMA);• Support Vector Machines;

• Deep Neural Networks, watch this one carefully…

More data & better features, in the right hands even a simple algorithm can produce a very effective predictive model – enter the data scientist…

Page 20: Data Driven Engineering 2014

Data ScientistA Relatively New Job Title

DJ Patil and Jeff Hammerbacher

Alternative Job Titles

Signature SkillsMath and Statistics Knowledge

Machine Learning

Programming Skills

Computer Science

Data Visualization

Data Curiosity

Emergence of Data Science Teams

Page 21: Data Driven Engineering 2014

Data ScienceDiscipline with Deep Academic Roots

Bill Cleveland, Bell LabsData Science Action Plan, 2001

The Data Science Journal, 2002 (CODATA)

Journal of Data Science, 2002

National Science Foundation (NSF), 2005Long-lived Digital Data Collections

JISC, 2008The Skills, Role & Career Structure of Data Scientists & Curators

Page 22: Data Driven Engineering 2014

“We have found that creative data scientists can solve problems in every field better than experts in those fields can…”

“The expertise you need is strategy in answering these questions.”

Interview with Jeremy Howard, Chief Scientist, Kaggle: http://slate/me/UM0wQ7

Data ScientistA Critical Role

Page 23: Data Driven Engineering 2014

Data Science ProcessMind the Gap, the space between the data set and the algorithm

Conventional descriptive analytics goes straight from a data set to applying analgorithm. But in Data Science there’s a huge space in between of very importantstuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hardpart. The hard part is data wrangling and doing it well.

Handling missing values

Page 24: Data Driven Engineering 2014

Data Science ProcessMind the Gap, the space between the data set and the algorithm

Conventional descriptive analytics goes straight from a data set to applying analgorithm. But in Data Science there’s a huge space in between of very importantstuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hardpart. The hard part is data wrangling and doing it well.

Handling missing values

Feature selection

Page 25: Data Driven Engineering 2014

Data Science ProcessMind the Gap, the space between the data set and the algorithm

Conventional descriptive analytics goes straight from a data set to applying analgorithm. But in Data Science there’s a huge space in between of very importantstuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hardpart. The hard part is data wrangling and doing it well.

Handling missing values

Feature selection

Feature creation and imputation

Page 26: Data Driven Engineering 2014

Data Science ProcessMind the Gap, the space between the data set and the algorithm

Conventional descriptive analytics goes straight from a data set to applying analgorithm. But in Data Science there’s a huge space in between of very importantstuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hardpart. The hard part is data wrangling and doing it well.

Handling missing values

Feature selection

Feature creation and imputation

Joining with external data to enrich training set

Page 27: Data Driven Engineering 2014

Data Science ProcessRequires Agile Experimentation

Define Objective

Access and

Understand the

Data

Pre-processing

Feature and/or

Target

construction

1. Define the objective and quantify it with a metric – optionally with constraints, if any. This typically requires domain knowledge.

2. Collect and understand the data, deal with vagaries and biases in data acquisition (missing data, outliers due to errors in data collection process, more sophisticated biases due to the data collection procedure, etc.

3. Frame the problem in terms of a ML problem – classification, regression, ranking, clustering, forecasting, outlier detection etc. – combination of domain knowledge and ML knowledge is useful.

4. Transform raw data into a “modeling dataset”, with features, targets etc., which can be used for modeling. Feature construction often improved with domain knowledge. Target must be identical or proxy of metric in step 1.

Page 28: Data Driven Engineering 2014

Data Science ProcessRequires Agile Experimentation

Feature

selection

Model

training

Model

scoring

Evaluation

Train/ Test

split 5. Train, test and evaluate, taking care to controlbias/variance and ensure the metrics are reported withthe right confidence intervals (cross-validation helpshere), be vigilant against target leaks (which typicallyleads to unbelievably good test metrics) – this is themachine learning heavy step.

Page 29: Data Driven Engineering 2014

Data Science ProcessRequires Agile Experimentation

Define Objective

Access and

Understand the

data

Pre-processing

Feature and/or

Target

construction

Feature

selection

Model

training

Model scoring

Evaluation

Train/ Test

split

6. Iterate steps (2) – (5) until the test metrics are satisfactory

Page 30: Data Driven Engineering 2014

Data Science ProcessSupport Algorithmic Creativity and Exploration

• Neural Networks

• Logistic Regression

• Support Vector Machine

• Decision Trees

• Ensemble Methods

• adaBoost

• Bayesian Networks

• Genetic Algorithms

• Random Forest

• Monte Carlo methods

• Principal Component Analysis

• Kalman Filter

• Evolutionary Fuzzy Modelling

• Deep Neural Networks

Page 31: Data Driven Engineering 2014

Data Science ProcessAbility to Determine if a Model Will Predict With Accuracy...

Overfitting has many faces, always measure performance on out-of-sample data.

• At least one test, train on some of your data and test on the rest;

• Do several, splitting the data differently each time;

“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.” George E.P. Box

Page 32: Data Driven Engineering 2014
Page 33: Data Driven Engineering 2014

CloudML StudioModel Authoring Through Visual Composition

• Accessible through a web browser

• End2End support for data science workflow

• Flexible and extensible

Page 34: Data Driven Engineering 2014

CloudML StudioFoundational ML & Analytics Algorithms

Associative rule mining

Neural networks

Regression analysis

Clustering Random Forrest

Support for extensibility

Enable users to add their own algorithms as modules

Mathematical programming

Online analytical processing

Bayesian Machine Learning

Text analytics

Monte Carlo methods

Support for vector machines

Boosted Decision tree learning

Time series processing

Page 35: Data Driven Engineering 2014

Rapid Experimentation to Create a Better Model

• Train and test several models from existing library

• Try a wide range of strategies

• Extend existing models

Page 36: Data Driven Engineering 2014

Immutable Model Repository

Durable repository for predictive models and algorithms to build up the knowledge base and make CloudML Studio more valuable

Page 37: Data Driven Engineering 2014
Page 38: Data Driven Engineering 2014

Data LabAnalyze SQL Azure database traces, predict rack capacity for planning purposes

Page 39: Data Driven Engineering 2014

Data LabAnalyze SQL Azure database characterization model: bully, resource profiling, etc…

Page 40: Data Driven Engineering 2014

Daimler AG – Predicting gap and flush defects during the automobile manufacturing process…

Page 41: Data Driven Engineering 2014

Top 3 Obstacles to Using Predictive Analytics.

1. Too hard to use, can’t find skilled talent.

2. Cost of software, access to good algorithms.

3. Difficulty integrating into business processes

Page 42: Data Driven Engineering 2014

☁Intelligent Web

Service

consume publish

+

Machine Learning Services on Azure

enterprise

customer

data

scientist

CloudML Studio

☁• Automatically scale in response to actual usage, eliminate upfront costs for hardware resources.

• Scored in batch-mode or request-response mode;

• Actively monitor models in production to detect changes;

• Telemetry and model management (rollback, retrain).

CloudML StudioPublish as Azure Web Service(s)

Page 43: Data Driven Engineering 2014

☁CloudML Service

Page 44: Data Driven Engineering 2014

☁CloudML Service

Page 45: Data Driven Engineering 2014
Page 46: Data Driven Engineering 2014

Closing Comments

Data is a core strategic asset, it encodes your business’ collective experience• It is imperative to learn from your data, learn about your customers, products,…• As your org learns from its collective experience it will discover strategic insights

…and can put this knowledge into action.

Predictive Analytics capitalizes on all available data• Adds smarter decisions to systems, drives better outcomes, delivers greater value…• It requires specialized expertise, talent, and tools to execute well.

Rise of the Data Scientist• Applying the principles of scientific experimentation to business processes;• Adept at interpretation & use of numeric data, appreciation for quantitative methods

Better tooling for data science productivity and end-to-end management…

Emerging literacies and requirements….

Page 47: Data Driven Engineering 2014

Want to understand the what, why, and how of predictive analytics. Here’s a short, ordered reading list designed to get you up to speed super fast…

The Signal And The Noise: Why So Many Predictions Fail — but Some Don’t by Nate Silver.Nate Silver built an innovative system for predicting baseball performance and predicted the 2008 and 2012 elections within a hair ’s breadth.

The New York Times publishes FiveThirtyEight.com, where Silver is one of the nation’s most influential political forecasters. Why read: This

book will inspire your “predictive” imagination and ground you in the realities of predictive analytics.

Predictive Analytics: The Power To Predict Who Will Click, Buy, Lie, or Die by Eric Siegel. Eric Siegel has written a very accessible book on how predictive analytics works. It’s chock-full of dozens of real-world examples, such as how

Chase Bank predicted mortgage risk (before the recession), IBM Watson won Jeopardy!, and Hewlett-Packard predicted employee flight

risk. Why read: Predictive Analytics is a perfectly paced explanation of how predictive analytics works and a repository of dozens of real

examples across many use cases.

Uncontrolled: The Surprising Payoff Of of Trial-and-Error For Business, Politics, and Society by Jim Manzi. Predictive analytics is not magic — it’s science. Jim Manzi reminds us of the power of the scientific method and reveals the shocking truth that

many huge societal and business decisions are made based on misinterpreted data and statistics. Why read: Controlled experimentation

amplifies the results of predictive analytics and helps avert the risk of inaccurate predictive models.

Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten, Eibe Frank, and Mark Hall. If you write code or have a computer science background, you’ll probably want to know the gory details of how predictive analytics works.

Why read: Data Mining provides a thorough grounding in machine learning concepts as well as practical advice on applying machine

learning tools and techniques in real-world data mining situations.

Page 48: Data Driven Engineering 2014

Thank You