cloud architectures for data science

63
@MargrietGr Margriet Groenendijk, PhD Developer Advocate for IBM Cloud Data Services O’Reilly Software Architecture Conference San Francisco 16 November 2016 Cloud Architectures for Data Science

Upload: margriet-groenendijk

Post on 16-Apr-2017

203 views

Category:

Data & Analytics


0 download

TRANSCRIPT

@MargrietGr

Margriet Groenendijk, PhDDeveloper Advocate for IBM Cloud Data Services

O’Reilly Software Architecture ConferenceSan Francisco16 November 2016

Cloud Architectures for Data Science

@MargrietGr

About me• Developer Advocate at IBM Cloud Data Services, UK

•Data science•Python, Spark, R, Cloudant, dashDB

• Research Fellow at University of Exeter, UK•Worked with very large observational datasets and the output of global scale climate models

• PhD at Vrije Universiteit Amsterdam, the Netherlands•Explored large observational datasets of carbon uptake by forests

@MargrietGr

A Brief History of Data Science

• Computer Science• Data Technology• Visualization• Mathematics• Statistics

http://www.datascienceassn.org/content/history-data-science

@MargrietGr

1781

http://visual.ly/exports-and-imports-scotland

@MargrietGr

1821

https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard.png

@MargrietGr

1855

http://visual.ly/diagram-causes-mortality-army-east

@MargrietGr

1960s

http://www.computerhistory.org/collections/catalog/102630767

@MargrietGr

1960s

http://www.climatecentral.org/news/first-climate-model-video-19007

@MargrietGr

2016

@MargrietGr

2016

@MargrietGrhttps://blog.rjmetrics.com/2015/10/05/how-many-data-scientists-are-there/

How many Data Scientists are there?

@MargrietGrhttps://whatsthebigdata.com/2015/11/08/top-skills-and-backgrounds-of-data-scientists-on-linkedin/

@MargrietGr

https://whatsthebigdata.com/2015/11/08/top-skills-and-backgrounds-of-data-scientists-on-linkedin/

@MargrietGr

Toolbox

http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png

@MargrietGr

Data Engineers

Data Scientists

BusinessAnalysts

App Developers

Data Science is a Team Effort

Data

@MargrietGr

@MargrietGr

Data Science Workflow

@MargrietGr

DiscoverData

UseData Publish Data Socialize

Data

Data Science Workflow

@MargrietGr

Data Science Workflow

DefineQuestion

FindData

ExploreData

CleanData VisualizeandSummarizeData

CreatePredictiveModels

PresentResults

@MargrietGr

Collect Data

APIs

Open Data

MapsWeb Scraping

Time Series

@MargrietGr

Store Data

Object Store - binary files

Relational database

Document store - json

@MargrietGr

Explore Data

@MargrietGr

ExploreData

CleanDataStoreData

@MargrietGr

Spark on a Cluster

@MargrietGr

The Spark Stack

from Karau et al.: Learning Spark

@MargrietGr

RDDs : Resilient Distributed Datasets• Data does not have to fit on a single machine• Data is separated into partitions

• Creation of RDDs•Load an external dataset•Distribute a collection of objects

• Transformations construct a new RDD from a previous one (lazy!)• Actions compute a result based on an RDD

@MargrietGr

Run Spark locally in a Python notebook

https://www.continuum.io/downloads

http://spark.apache.org/downloads.html

Create a new kernel to use in a Jupyter notebook

@MargrietGr

Jupyter Notebooks!

• Server-client application to edit and run notebook documents via a web browser

• Cells with:•Code•Figures and tables•Rich text elements

• Different kernels: Python, R, Scala, Spark

In the Cloud:

@MargrietGrhttp://datascience.ibm.com/

@MargrietGr

@MargrietGr

@MargrietGr

@MargrietGr

Weather Data

@MargrietGr

Define Question

What will the weather be next weekend?

https://unsplash.com/search/autumn?photo=LSF8WGtQmn8https://unsplash.com/search/rain?photo=19tQv51x4-A

@MargrietGr

Find Data

https://console.ng.bluemix.net/

@MargrietGr

Explore DataPython packages• requests and json

•API credentials and latitude/longitude of San Francisco•json data returned

• pandas, numpy and datetime•convert json to pandas DataFrame (table with multiple indices)•add time as index

@MargrietGr

Weather forecast for San Franciscohttps://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/

Visualize DataPython packages• pandas - rolling mean• matplotlib• Basemap

@MargrietGr

Weather map - example for UK

https://developer.ibm.com/clouddataservices/2016/10/06/your-own-weather-forecast-in-a-python-notebook/

Python packages• matplotlib• Basemap• itertools• urllib

@MargrietGr Run as a daily cron job

cloudant

@MargrietGr

@MargrietGr

@MargrietGr

Weather,Twitter and Sentiment

@MargrietGr

Weather, Twitter and Sentiment

• Where to find the data?• Where to store the data?• Where to analyse the data?

• Quick tools to explore

@MargrietGr

Insights for Twitter

@MargrietGr

Add sentiment - example

@MargrietGr

• watson tone analyser

EmotionLanguage style

Social propensities

Analyze how you are coming across to others

@MargrietGr

Simpler Workflow

Weather Company Data

crontab -e

0 23 * * * /path/to/file/do_something.sh

python do_something.py

TweetsWeatherSentiment

Watson Tone Analyser

Insights for Twitter

Cloudant NoSQL

@MargrietGr

PixieDust

https://github.com/ibm-cds-labs/pixiedust

Simpler Workflow

@MargrietGr

PixieDust: an Open Source Library that simplifies and improves Jupyter Python Notebooks• PackageManager• Visualizations• Cloud Integration• Scala Bridge• Extensibility• Embedded Apps

https://developer.ibm.com/clouddataservices/2016/10/11/pixiedust-magic-for-python-notebook/

@DTAIEB55

@MargrietGr

Install Spark packages or plain jars in your Notebook Python kernel without the need to modify configuration file

Uses the GraphFrame Python APIs

Install GraphFrames Spark Package

@MargrietGr

One simple API: display()Call the Options dialog

Panning/Zooming options

Performance statistics

@MargrietGr

Easily export your data to csv, json, html, etc. locally on your laptop or into a cloud-based service like Cloudant or Object Storage

@MargrietGr

Scala Bridge

Define a Python variable

Use the Python var in Scala

Define a Scala variable

Use the Scala var in Python

@MargrietGr

Easily extend PixieDust to create your own visualizations using HTML/CSS/JavaScript

Customized Visualization for GraphFrame Graphs

@MargrietGr

Encapsulate your analytics into compelling User Interfaces better suited for Line of Business Users

@MargrietGr

@MargrietGr

https://github.com/ibm-cds-labs/ibmseti/

SETI

@MargrietGr

• Mission: To explore, understand and explain the origin and nature of life in the universe

• Origins: Started in 1959 by two physicists at Cornell

• NASA became interested in 1970, started working with SETI in 1988, funding cut in 1993

SETI@IBMCloud

http://www.seti.org/node/861

@MargrietGr

• The Allen Telescope Array•198 million radio events detected in the last decade•400,000 candidate signals identified •5TB data generated in 10 hours

• No modern analysis or machine learning has been performed on this data• 5 TB of special observations on IBM Object Store

SETI@IBMCloud - the Data

https://github.com/ibm-cds-labs/ibmseti/

@MargrietGr

Public Spark@SETI

4 TB of SETI Data stored in Object Storage

Web API provides Bluemix users access to download SETI data

ObjectStorage

WebAPI Spark Object

Storage

Public Spark@SETI Bluemix Account My Bluemix Account

Spark using Jupyter Notebook and IBM SETI Python Library

Goal: Amateur scientists/data scientists download and analyze SETI data

@MargrietGr

IBM Watson Data Platform• Data Science Experience• Watson Data Platform• Machine Learning

• Sign up for beta: http://datascience.ibm.com/features#machinelearning

@MargrietGr

Data Science in the Cloud• Flexible and quick to iterate, play and explore data• APIs

•Streaming data•Cloud databases•Watson

• Scaling up - add storage or Spark kernels• Easy collaboration and presentation

•Store Data•Share your analyses in notebooks

• Some useful packages: pandas, pyspark, requests, matplotlib, cloudant• Notebooks can be extended! PixieDust

@MargrietGr

https://developer.ibm.com/clouddataservices/author/mgroenen/

Thanks!

Slides will be here: http://www.slideshare.net/MargrietGroenendijk