data science perspective and ds demo

45
BUILT FOR THE SPEED OF BUSINESS

Upload: pivotalopensourcehub

Post on 15-Feb-2017

147 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Data Science Perspective and DS demo

BUILT FOR THE SPEED OF BUSINESS

Additional Line 18 Point Verdana

Page 2: Data Science Perspective and DS demo

2 © Copyright 2014 Pivotal. All rights reserved.

Our everyday devices are smart and talk to us

Page 3: Data Science Perspective and DS demo

3 © Copyright 2014 Pivotal. All rights reserved.

Our everyday devices are smart and talk to us

Page 4: Data Science Perspective and DS demo

4 © Copyright 2014 Pivotal. All rights reserved.

Connected devices take action to make daily life easier.

But what else?

Page 5: Data Science Perspective and DS demo

5 © Copyright 2014 Pivotal. All rights reserved.

How can IoT help prevent accidents like the Macondo

Disaster ?

Page 6: Data Science Perspective and DS demo

6 © Copyright 2014 Pivotal. All rights reserved.

Making sense of your “big data” �  Large volumes of data may be difficult to understand

–  ~100 tables –  Tens of thousands of columns

Page 7: Data Science Perspective and DS demo

7 © Copyright 2014 Pivotal. All rights reserved.

Making sense of your “big data” �  Large volumes of data may be difficult to understand

–  ~100 tables –  Tens of thousands of columns

�  How do you build models that use all the data? Score all the data?

Page 8: Data Science Perspective and DS demo

8 © Copyright 2014 Pivotal. All rights reserved.

Making sense of your “big data” �  Large volumes of data may be difficult to understand

–  ~100 tables –  Tens of thousands of columns

�  How do you build models that use all the data? Score all the data?

�  Where do you focus your effort? –  Getting a rapid grasp of relevant fields is important –  Scanning lots of data is slow, creating models with huge numbers of features is

possible, but generally better to understand your data –  Columns with little or no variation or only null values

Page 9: Data Science Perspective and DS demo

9 © Copyright 2014 Pivotal. All rights reserved.

What Can “Small Data” Scientists Bring on Their “Big Data” Journey?

http://factspy.net/the-difference-between-geeks-vs-nerds/

Page 10: Data Science Perspective and DS demo

10 © Copyright 2014 Pivotal. All rights reserved.

What Can “Small Data” Scientists Bring on Their “Big Data” Journey?

Flat files

Distributed computing

HDFS

In-memory model building

Cloud computing

MapReduce

Command-line tools

Databases

Command-line tools

Small Data Big Data Many tools and approaches are being adapted to big data technologies

Page 11: Data Science Perspective and DS demo

11 © Copyright 2014 Pivotal. All rights reserved.

Drilling into the San Andreas Fault at Parkfield

California. Credit: Stephen H.

Hickman, USGS Data: The New Oil

•  Oil & gas generates large amounts of data from sensors enabling data-driven approaches to improve operations

Predictive maintenance •  Motivation: Failure costs estimated at $150,000/incident

(billions annually)* •  Goals

–  Early warning system –  Insights into prominent features impacting operation and failure –  Reduction of non-productive drill time –  Reduced incidents

*http://blog.pivotal.io/pivotal/case-studies-2/data-as-the-new-oil-producing-value-for-the-oil-gas-industry

Page 12: Data Science Perspective and DS demo

12 © Copyright 2014 Pivotal. All rights reserved.

How are models built using sensor data?

Integrating & Cleansing

Feature Building Modeling

•  A failure occurred at the end of this run

Bit

posi

tion

RPM

RO

P W

OB

Page 13: Data Science Perspective and DS demo

13 © Copyright 2014 Pivotal. All rights reserved.

How are models built using sensor data?

Integrating & Cleansing

Feature Building Modeling

•  A failure occurred at the end of this run

•  Taking a window of

time prior to failure, what features should we extract (e.g. variance of RPM, max bit position velocity)?

RPM

Page 14: Data Science Perspective and DS demo

14 © Copyright 2014 Pivotal. All rights reserved.

How can noisy data create meaningful models? W

OB

Time

WOB

RO

P

Page 15: Data Science Perspective and DS demo

15 © Copyright 2014 Pivotal. All rights reserved.

How can noisy data create meaningful models? W

OB

Time

•  Deriving signal noisy sensor data requires data cleansing

Page 16: Data Science Perspective and DS demo

16 © Copyright 2014 Pivotal. All rights reserved.

How can noisy data create meaningful models? W

OB

Time

•  Deriving signal noisy sensor data requires data cleansing

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

00:00 10:00 20:00 30:00 40:00 50:00 00:00

1015

20

df$ts_utc

df$w

ob

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●●●●●●●●●

●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●●●

●●●

●●

●●●●●

●●

●●

●●●●●

●●●

●●

●●

●●●

●●●●

●●

●●

●●

●●●●●●●

●●

●●

●●●

●●●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●●●

●●

●●●

●●●●●●●●●

●●

●●

●●●●●

●●

●●●●

●●

●●●

●●●

●●

●●

●●●●●

●●●●●●●●

●●●

●●●

●●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●●●

●●

●●●●

●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

Page 17: Data Science Perspective and DS demo

17 © Copyright 2014 Pivotal. All rights reserved.

How can noisy data create meaningful models? Te

mpe

ratu

re

Time

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

00:00 10:00 20:00 30:00 40:00 50:00 00:00

1015

20

df$ts_utc

df$w

ob

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

•  Deriving signal noisy sensor data requires data cleansing

WO

B

Time

Page 18: Data Science Perspective and DS demo

18 © Copyright 2014 Pivotal. All rights reserved.

How can noisy data create meaningful models? Te

mpe

ratu

re

Time

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

00:00 10:00 20:00 30:00 40:00 50:00 00:00

1015

20

df$ts_utc

df$w

ob

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

A cleansing approach: use average across a window

•  Deriving signal noisy sensor data requires data cleansing

•  Window functions in SQL allow us to perform smoothing seamlessly, at-scale W

OB

Time

Page 19: Data Science Perspective and DS demo

19 © Copyright 2014 Pivotal. All rights reserved.

How can noisy data create meaningful models? Te

mpe

ratu

re

Time

•  Deriving signal noisy sensor data requires data cleansing

•  Window functions in SQL allow us to perform smoothing seamlessly, at-scale W

OB

Time

Page 20: Data Science Perspective and DS demo

20 © Copyright 2014 Pivotal. All rights reserved.

How can noisy data create meaningful models? Te

mpe

ratu

re

Time

•  Deriving signal noisy sensor data requires data cleansing

•  Window functions in SQL allow us to perform smoothing seamlessly, at-scale

•  Test many hypotheses in parallel to examine if features have an effect on potency

WO

B

Time

Page 21: Data Science Perspective and DS demo

21 © Copyright 2014 Pivotal. All rights reserved.

How are models built using sensor data?

Integrating & Cleansing

Feature Building Modeling

Predict occurrence of equipment failure in a chosen future time window

Predict remaining life of equipment

Predict Rate-of-Penetration

Page 22: Data Science Perspective and DS demo

22 © Copyright 2014 Pivotal. All rights reserved.

How are models built using sensor data?

Integrating & Cleansing

Feature Building Modeling

Predict occurrence of equipment failure in a chosen future time window

•  Logistic Regression •  Elastic Net Regularized Regression (Binomial) •  Support Vector Machines

Predict remaining life of equipment

Predict Rate-of-Penetration

Page 23: Data Science Perspective and DS demo

23 © Copyright 2014 Pivotal. All rights reserved.

How are models built using sensor data?

Integrating & Cleansing

Feature Building Modeling

Predict occurrence of equipment failure in a chosen future time window

•  Logistic Regression •  Elastic Net Regularized Regression (Binomial) •  Support Vector Machines

Predict remaining life of equipment •  Cox Proportional Hazards Regression

Predict Rate-of-Penetration

Page 24: Data Science Perspective and DS demo

24 © Copyright 2014 Pivotal. All rights reserved.

How are models built using sensor data?

Integrating & Cleansing

Feature Building Modeling

Predict occurrence of equipment failure in a chosen future time window

•  Logistic Regression •  Elastic Net Regularized Regression (Binomial) •  Support Vector Machines

Predict remaining life of equipment •  Cox Proportional Hazards Regression

Predict Rate-of-Penetration •  Linear Regression •  Elastic Net Regularized Regression (Gaussian) •  Support Vector Machines

Page 25: Data Science Perspective and DS demo

25 © Copyright 2014 Pivotal. All rights reserved.

Calling MADlib Functions: Fast Training, Scoring �  MADlib allows users to easily and

create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’);!

MADlib model function Table containing

training data

Table in which to save results

Column containing dependent variable Features included in the

model

Page 26: Data Science Perspective and DS demo

26 © Copyright 2014 Pivotal. All rights reserved.

Calling MADlib Functions: Fast Training, Scoring �  MADlib allows users to easily and

create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’,!

‘bedroom’);!

MADlib model function Table containing

training data

Table in which to save results

Column containing dependent variable Features included in the

model Create multiple output models (one for each value of bedroom)

Page 27: Data Science Perspective and DS demo

27 © Copyright 2014 Pivotal. All rights reserved.

Calling MADlib Functions: Fast Training, Scoring �  MADlib allows users to easily and

create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’);!

SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size],

m.coef!)as predict !

FROM houses, houses_linregr m;!

MADlib model scoring function

Table with data to be scored Table containing model

Page 28: Data Science Perspective and DS demo

28 © Copyright 2014 Pivotal. All rights reserved.

�  The interpreter/VM of the language ‘X’ is installed on each node of the HAWQ Cluster

•  Data Parallelism: -  PL/X piggybacks on HAWQ’s

MPP architecture

•  Allows users to write HAWQ/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby

Master

Master Host

SQL

Interconnect

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}

Page 29: Data Science Perspective and DS demo

29 © Copyright 2014 Pivotal. All rights reserved.

Parallelized R in Pivotal via PL/R: An Example

SQL & R

�  Parsimonious – R piggy-backs on Pivotal’s parallel architecture �  Minimize data movement �  Build predictive model for each state in parallel

TN Data

CA Data

NY Data

PA Data

TX Data

CT Data

NJ Data

IL Data

MA Data

WA Data

TN Model

CA Model

NY Model

PA Model

TX Model

CT Model

NJ Model

IL Model

MA Model

WA Model

Page 30: Data Science Perspective and DS demo

30 © Copyright 2014 Pivotal. All rights reserved.

Parallelized R in Pivotal via PL/R: An Example � With placeholders in SQL, write functions in the native R language

�  Accessible, powerful modeling framework

Page 31: Data Science Perspective and DS demo

31 © Copyright 2014 Pivotal. All rights reserved.

Parallelized R in Pivotal via PL/R: An Example �  Execute PL/R function

�  Plain and simple table is returned

Page 32: Data Science Perspective and DS demo

32 © Copyright 2014 Pivotal. All rights reserved.

Aggregate and obtain final prediction

Each tree makes a prediction

Parallelized R in Pivotal via PL/R: Parallel Bagged Decision Trees

Page 33: Data Science Perspective and DS demo

33 © Copyright 2014 Pivotal. All rights reserved.

Genome Wide Association Study

Page 34: Data Science Perspective and DS demo

34 © Copyright 2014 Pivotal. All rights reserved.

Genome Wide Association Study SNP1

Page 35: Data Science Perspective and DS demo

35 © Copyright 2014 Pivotal. All rights reserved.

Genome Wide Association Study SNP1

Page 36: Data Science Perspective and DS demo

36 © Copyright 2014 Pivotal. All rights reserved.

Genome Wide Association Study SNP2

Page 37: Data Science Perspective and DS demo

37 © Copyright 2014 Pivotal. All rights reserved.

Genome Wide Association Study SNP3

Page 38: Data Science Perspective and DS demo

38 © Copyright 2014 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

COVARIATES

SNP1 2 MAA CC TTAT CG TTAA GG TC

TT CG TC

Page 39: Data Science Perspective and DS demo

39 © Copyright 2014 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SNP1 SNP2 SNPM

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

COVARIATES GENOTYPES

Page 40: Data Science Perspective and DS demo

40 © Copyright 2014 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

COVARIATES GENOTYPES

Page 41: Data Science Perspective and DS demo

41 © Copyright 2014 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

SNP P-value1 2.34x10-212 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS

Page 42: Data Science Perspective and DS demo

42 © Copyright 2014 Pivotal. All rights reserved.

In-database genome-wide association study

Network Interconnect

Master Severs

Segment Severs

SNP1 SNP2 SNPM

Pval1 Pval2 PvalM

SQL & R Indiv Covariates

1 2 10 1 F 23 18 2 M 39 41 3 M 50 23

N F 19 24

Indiv SNP Geno1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG

N M TC

SNP P-value1 2.34x10-212 0.3953 7.15x10-17

M 0.000142

COVARIATES GENOTYPES RESULTS

•  In-database computation of ~1 million mutations for thousands of individuals occurs rapidly and in parallel

•  Results are easily manipulated and explored

Page 43: Data Science Perspective and DS demo

43 © Copyright 2014 Pivotal. All rights reserved.

Visualize and analyze genomics data without movement

Generate relevant plots using tools like Tableau immediately after parallel statistical analysis in-database

on Pivotal technology

Page 44: Data Science Perspective and DS demo

44 © Copyright 2014 Pivotal. All rights reserved.

http://blog.pivotal.io/data-science-pivotal

Check out the Pivotal Data Science Blog!

Page 45: Data Science Perspective and DS demo

BUILT FOR THE SPEED OF BUSINESS