the rise of the dataops - dataiku - j on the beach 2016

Post on 16-Apr-2017

5.284 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Role of the DevOps in theData Analytics Teams

J ON THE BEACH05/21/16

MORPHED WITH DEEP LEARNING™

TYPICAL OPS GUY (source: Reddit)

TYPICAL YOUNG DATA SCIENTIST(source: Common Sense)

My initial interests

Type Systems Automated Proving Abstract Program Interpretation Functional Programming Garbage Collection and Vms

Graph Analytics Chess IA Natural Language Processing 80% Emacs / 20% VIM

So to sum it up …

I (USED TO?) TO BE A BIG NERD

Collaboration

CLICKERS CODERS

Software is a Human Problem

I ended up buildingA collaborative software

For data science ....

DEV OPS&& DATA

Let’s get back to the (brief) history of DevOps

Agile Conference, 2008

Scrum, and Agile in an operational context

He!WeshouldhaveourownvelocityinBelgium

10 deploys per day : Dev and Op Operation at Flickr

O’Reilly Velocity, June 2009Patrick Dubois

2007

Dev

Ops

QA

DevOpsDays

Ghent, October 2009

DevOps

DevOps is the practice of operations and development

engineers participating together in the entire service lifecycle,

from design through the development

process to production support.

DevOps is also characterized by operations staff making

use many of the same techniques as developers for

their systems work.

Invite Ops to the Dev MeetingOh. And let them SPEAK

Ops should know how to code

Let’s take an example: John devops from 2009

Learnt Python the Hard WayStarted with Puppet 1.0

Used EC2 before ELB and EBS !

Hegelian perspective

Conflict and FrustrationConcept Combination Catharsis

Create CultureShare

Create Tools

Dev+

Ops

There’s been op associated to data for a while ?

It’s called Business Intelligence !

History of Data Analytics (Oversimplified)

2013 2014 2015 2016 2017 2018

Moving to a world of automated decision making

DATA FOR MORE INSIGHTS

DATAFOR AUTOMATED DECISIONS

The Age Of Distributed Intelligence

Global,PersonalisedandRealTimeDataDrivenServices

Data, Analytics and Data Science

Conflict and FrustrationConcept Combination Catharsis

Create CultureShare

Create Tools

Data+

Science

Welcome to Technoslavia !

Classic Business Intelligence Team Organization

Business Leader Data Consumer

Line-of-business Data Consumer Business Project

Sponsor

BI Solution Architect

Model Designer

ETL Developer

Dashboard / Report Designer

SpecsDim

Big Boss

Data Science Team Organization

Business Leader Data Consumer

Line-of-business Data Consumer

Business ProjectSponsor

Data Engineer

Data Analyst

System Engineer / Data Architect

Business Needs

Data Scientist

ITConstraints

I.T.

Is there room for a new role ?

Data Plumberer

DataEngineer

Data Scientist

Data Waiter

DataCleaner

DataAnalyst

REALJOB

DREAMJOB

DevOps For Data?

Imaginea company building

a new ”smart car” app: AutoFine™

”Revolutionary Collaborative network that check the quality of your driving and punishYou with virtual fines if you’re a bad driver”

Imaginea company building

a new ”smart car” service AutoFine™

10 TB of Data Every Month

Hive / Spark / Python

10 Different Predictive Models

Real-Time API / Workflow

????

????

OPERATIONS : Whose is responsible for …

Check that the newly trained model perform as

expected

Check that the product catalog and the website tags remain consistent

Check that the Hadoop cluster scales as expected and as enough bandwidth to handle the workload

Test the performance for the real-time API

Monitor the performance of the model and decide to

rollback / maintain / rollout

DATA OPSAs a Philosophy

X OPS PHILOSOPHY

Highly consensual

Highly controversial

Create an API culture

Do not shareo Random Piece of Codeo Flat Fileo Email

Do shareü Reproductible documented workflowsü Clean, documented APIs

Defensive Data Programming

•Software has errors.•You are not your software, yet you are are responsible for the errors.•You can never remove the errors, only reduce their probability.

Defensive Data Programming

•Handle the case when one of the input file is empty•Handle the case when a new value appear •Handle the case when two columns become completely correlated•Handle the case when a column is 16k long •Etc.. Etc. etc…

Monitoring : the alerts for people who love it

• Performance ….• Time Spent … • Number of Errors …

Monitoring : Business Informal Monitoring

• % Opening • Market Spent • Exception User Events …

Resource Allocation

I’ve got this strangeError ”OutOfMemory” . Do you know what it is

?

Why is the Hadoop Cluster going slower than my laptop ?

The Philosophy of pre-allocating more resources than necessary

Get to the latest package culture …

Data Scientist

I need the latest version of scikitAnd networkX ….

And coud you repackage that To enable TensorFlow optimizations ?

System Administrator

…..

The culture of containers

Developers’ Sandbox

DATA OPSAs a Job Title

Job Title : a matter of name, $$ and social ladder

Data scientist Data Ops

Developer

Statistician

Full Stack Developer

Sys Admin

DevOps

Job Role : A matter of Do or Don’t

DO DON’TThings you really want to do Things you really don’t want to get into

FIGHT THE TOY PLATFORM ANTI-PATTERN

Test and Invest in Infrastructure == Skilled Peopleor

Go For Cloud / Packaged Infrastructure

YourBrandNewHadoopClusterisperceivedasslow,notsousedandnotreliable

FIGHT THE TECHNO MISMATCH ANTI-PATTERN

Assume Being Polyglotor

Be a Dictator

VS

VS

ThePythonClan

TheRTribe

TheOldElephantFraternity

TheNewElephantClub

GETTING DATA POLITICS

> DATA NOT AVAILABLE

GETTING DATA POLITICS THEFOX

Hunt for Big Problem!

Convince the CEO that you can Solve a Business Critical problem And use it as an excuse to get allThe data you want !

THESPIDER

Create Network !

Create a set of trackers or Addictive Data Collection internallyTo get Data on your side !

PREDICTIVE ANALYTICS DEPLOYMENT STRATEGY

Website2000’winners

Companiesthatwereabletorelease fast

"ArtificialIntelligencewithDataforInternetofThings"2010’winners

Companiesabletoputintelligenceinproduction

?

Design a way to put “PREDITICTIVE MODELS” IN PRODUCTION

OWN ANONYMISATION / PRIVACY / DATA SECURITY WITH PARTNERS ISSUES

Technical Feasibility ? What can or cannot be done ?

Let’s Wrap IT Up ! A Company Building a GPS powered automated car fine system

10 TB of Data Every Month

Hive / Spark / Python

10 Different Predictive Models

Real-Time API / Workflow

Robust Workflow

With Data Quality

Checks

Functional MonitoringBy Business

People through

Slack and Dashboards

Monitoring for the API

Feature Engineering Pipeline in

Python

But you where do you stand ?

???? ???? ???? ?????

What's your roll-back strategy like?

What kind of multi-variate testing or strategies do you have in place for predictive models?

How do you manage the robustness of your data flow production scripts?

How can business people monitor the performance of the application?

http://bit.ly/production-survey

Food forthoughtswww.dataiku.com/blog

THANKYOU!http://bit.ly/production-survey http://bit.ly/production-survey

top related