don’t panic!/file/...don’t panic! •epcc •big data: science, engineering, management...

38
DON’T PANIC! Demystifying Big Data, Data Science and all that Dr Rob Baxter, Group Manager [email protected] EPCC-public. © 2018 The University of Edinburgh, licensed CC-BY 4.0

Upload: others

Post on 22-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

DON’T PANIC! Demystifying Big Data, Data Science and all that

Dr Rob Baxter, Group Manager

[email protected]

EPCC-public. © 2018 The University of Edinburgh, licensed CC-BY 4.0

Page 2: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Don’t panic!

• EPCC

• Big data: science, engineering, management

• Manage data: RDM lifecycle & DMP

• Organise data: files, databases, objects

• Analyse data: methods and tools

• Big data at EPCC

Page 3: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

• Self-sustained:

• 25+ years

• 100 staff

• £5M turnover, externally generated

• Multi-disciplinary and multi-funded:

• large spectrum of activities

• critical mass of expertise

• distributed systems

• high-performance computing

• data engineering

• research software engineering

• Support research through:

• access to facilities & services

• training courses

• collaborative projects, contract research

The novel computing centre of the University of Edinburgh

Page 4: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

National & regional services

• ARCHER, the UK Tier 1 HPC service

• 118,080 cores, 1.6 Petaflops, high-speed Aries interconnect, high-performance Lustre filesystem

• UK Research Data Facility (RDF)

• 23 PB disk, 48 PB tape

• Cirrus, a UK Tier 2 HPC service

• 13,000+ cores

• support research, industry, Edinburgh Genomics

• National Data Safe Haven for Scotland

• secure data environment: NHS, ADRC, ScotGov

• run under Caldicott Guardian framework

• Software Sustainability Institute

• headquarters

• Alan Turing Institute

• founding university partner, RSE University Delivery Partner

• research computing service hosts

Page 5: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

What is Big Data?

• A set of cutting-edge approaches to deriving new

knowledge from digital data?

• An over-hyped marketing term designed to sell

more business analytics software?

• Facebook?

• An out-of-memory error that tells you that you

need to stop trying to do things on your laptop?

• The output of the Square Kilometre Array?

• All of the above?

Page 6: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

LSST: a modest example

• The Large Synoptic Survey Telescope

• will capture two, 6.4 GB images every 39 s

• = 15 TB of raw image data every night

• and ~2 million ‘transient events’

• Over the 10 years of the survey (2022-2032)

• image 24 billion galaxies and 14 billion stars

• from 5 trillion detections and 32 trillion measurements

• 10 PB in Y1 → 70 PB in Y10

Page 7: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Big Data 2017: per year…

1PB NASDAQ

3PB

US

Census

4PB

US Library

of

Congress

5PB

NOAA

archive

6PB

YouTube

15PB

Page 8: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Big Data 2017: per year…

Page 9: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Big Data 2017: per year…

CERN

archive

73PB

searches

on Google

98PB

uploads to

Facebook

180PB

Page 10: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Big Data 2017: per year…

CERN

archive

73PB

…2025 per year

Square

Kilometre Array

Phase 1

300PB

searches

on Google

98PB

uploads to

Facebook

180PB

High Luminosity

Large Hadron Collider

1,000PB

Square Kilometre Array

Phase 2

1,000PB

Page 11: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Not all bigness is the same

• Big Data is typically measured three ways

• volume – from gigabytes to terabytes to petabytes

• velocity – data streams at you or changes rapidly

• variety – no longer are data in nice, neat tables

• Some folk add others

• veracity, verifiability, validity, variegation…

• You may have one, three or more

• They demand different handling strategies

?

Page 12: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Don’t try this at home

• Handling Big Data really is a specialist field

• hiring the right people is increasingly important

• Data Scientists ≠ Data Engineers ( ≠ DBAs )

• cf. https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

Data Science

• maths

• stats

• analysis

Data Engineering

• programming

• distributed systems

• data pipelines

Page 13: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Data Science vs Data Engineering vs Data Management

“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and

Johnson 2003).”

Hadley Wickham, Journal of Statistical Software, August 2014, Volume 59, Issue 10.

“Literally hundreds of practicing data miners and statistical modelers, most of them working at major corporations

supporting extensive analytical projects, have reported that they spend 80% of their effort in manipulating the data so

that they can analyze it!”

Dan Steinberg 2013, “How Much Time Needs to be Spent Preparing Data for Analysis?”

http://1.salford-systems.com/blog/bid/299181/How-Much-Time-Needs-to-be-Spent-Preparing-Data-for-Analysis

Page 14: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Data [ science | engineering | management ]

~20% Data science • analytics

• machine learning

~40%

Data engineering • data movement

• data pipelines

• data tech deployment (“data dev ops”)

• database design

• data preparation & cleaning

~40%

Data management • data storage

• data formats

• metadata management

• data preservation & backup

• data preparation & cleaning

Page 15: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Data lifecycle models

Plan

Acquire

Assure

Describe

Preserve

Discover

Combine

Process

Page 16: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Data lifecycle models

Plan

Acquire

Assure

Describe

Preserve

Discover

Combine

Process

What data will I create/ acquire/

record/ measure?

How will I describe them?

How will I store them?

Will I share them? If not, why not?

How will others find them?

Create, observe, measure,

generate by simulation, write,

re-use from existing sources

(e.g. databases, the Internet),

This is your raw data, part of

your “laboratory notebook”

Organise it from the start –

choose standard formats.

Validate, calibrate, test.

Check the correctness of the

methods used to create data.

In simulation terms, testing your

code properly!

Checking the correctness of

subsetting methods or synthetic

data generators.

Use meaningful variable names (not a1,

a2…)

Record units (“length = 3”; metres,

millimetres, parsecs?)

Record information needed to interpret

the data in 1, 10, 100 years

Use metadata standards!

If data get this far, they are becoming part

of the scientific record.

Store them carefully; think about

- integrity, backup, replication

- accessibility;

- ability to render in 10 years time;

- keeping data and metadata together.

How should you make

your data discoverable by

others?

Description and

accessibility are key.

Applying computing

power to create “new

data from old”.

Data analytics; analysis

of digital sensor data;

simulation input…

Combining, merging

data to create new

insights.

Good metadata &

tools are essential,

as is an appreciation

of licence conditions!

Page 17: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Tools for creating data management plans

dmptool.org

dmponline.dcc.ac.uk

Remember: a data

management plan is for

life, not just for Christmas!

Page 18: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Managing data

• The most important question to ask:

How will the data be used?

Page 19: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

What kind of questions are you going to ask?

• Aggregation or reduction? • counts, sums, averages and other summary statistics

• Filtering? • age > 30, timestamp before 1 January 2018

• Mapping, applying a function? • convert temperature from Kelvin into degrees Celsius

• Joining two datasets?

• Retrieving a time series?

• Retrieving a document by its id, or title, or…?

• Also: • writing a lot vs reading a lot

• volume, velocity, variety

Page 20: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

What to do with your data

• It’s possible to create a “Data Lake” and store

• This is not always a good idea!

• careful that it doesn’t turn into a “data swamp”

• storing data without thinking ≈ throwing it away

• Need to index, order, label, describe data to understand and use it in the future

• metadata and metadata management are key

• Planning, preparation, organisation are key to not sinking in the swamp

Page 21: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Files v databases v objects

• Files

• structured, unstructured, semi-structured data

• many standard, open formats

• CSV, HDF5, netCDF, geotiff, FITS, JSON, XML, GRIB, …

• Databases

• relational: Postgres, MySQL, MariaDB, …

• noSQL: MongoDB, DynamoDB, Cassandra, Neo4J, …

• Objects

• http-based PUT/GET APIs

• Amazon S3, OpenStack/Swift, …

Page 22: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Files v databases v objects

Files

Pros

• easy to use

• random access

• any content

• any size (well…)

Cons

• poor manageability at scale

• low security

• “all or nothing”

Databases

Pros

• highly optimised

• scalable (volume)

• scalable (velocity)

• high security

• SQL!

Cons

• higher setup costs

• dedicated systems

• poor portability

• higher storage costs

Objects

Pros

• Cloud/Web-friendly

• horizontal scalability

• rich metadata

Cons

• poor findability

• more “DIY”

• more proprietary

Page 23: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Managing the volumes

• If you have too many files…

• … or your files are too big…

• … or both…

• … it’s probably worth migrating to a database

• at the very least for cataloguing your metadata

• A database catalogue that “points to” files (or

objects) is a good compromise

Page 24: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Relational v noSQL?

• Relational DBs been around since the 1970s • tables, relationships, keys, indexes, transactions, SQL

• noSQL are new-ish, “big data” inventions • “not only” SQL (or sometimes “not even” SQL)

• specialised for certain kinds of data at “3V scale”

• Column stores (fast read, slow write) • Google BigTable, Facebook/Apache Cassandra

• Key-value stores (simple, fast lookups) • Amazon DynamoDB, BerkeleyDB

• Document stores (indexing arbitrary content) • MongoDB, CouchDB

• Graph databases (relationships are primary objects) • Neo4J, DataStax

Page 25: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

NoSQL databases

• Designed for distributed storage with high

horizontal scalability

• suitable for large structured, semi-structured or

unstructured data

• built to address particular use-cases at scale

• by Google, Amazon, Facebook, LinkedIn

• No schemas are required (a.k.a, schema-less)

• gives flexibility in storing documents with different content

• No transactions

• No join operations

give up things

like this

to achieve

this

Page 26: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

But: SQL databases are not dead

• Unless you are at Amazon/Google/Facebook/SKA

scale you probably don’t need a noSQL database

• Relational databases remain very powerful, very

effective tools for managing data

• and they all understand SQL!

• If you’re exploring social networks…

• … you might use a graph database

• If you have unpredictably-structured data…

• … you might use a document store

Page 27: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

A model for data analysis: CRISP-DM

• Cross Industry Standard

Process for Data Mining

• Invented 1996-9 by SPSS

(now IBM), Teradata,

Daimler AG, NCR

Corporation and OHRA

• C. Shearer, “The CRISP-

DM model: the new

blueprint for data mining”

• Journal of Data Warehousing,

Vol. 5 (4), 2000

80%

Page 28: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Data Science skills diagram

Fig1-1 from O’Neil and Shutt, “Doing Data Science”. This is

based on Drew Conway’s Venn Diagram of Data Science

Domain

Expertise

Machine

Learning

Data

Science

Page 29: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

The Danger Zone!

• People who know enough to be dangerous

• Capable of extracting and structuring data

• Know quite a bit about the field and can even run a linear regression

BUT…

• Lack understanding of what the regression coefficients mean

SO …

• Have ability to create what appears to be legitimate analysis without understanding how they got there or what they have created!

Page 30: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Python and R (and more)

Notebooks

Jupyter

Zeppelin

Languages

Python

R

SQL

Javascript

NodeJS

Libraries

SciPy

Pandas

Scikit-learn

GPText

OpenNLP

Mahout

+many others

Visualisation

D3.js

matplotlib

Seaborn

R

shiny

Leaflet

PowerBI

ggplot2

Page 31: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Python Data Science ecosystem

Machine Learning/

Statistics

Business Intelligence

Scientific Computing / HPC

Web Distributed Systems

Page 32: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Contents

• Computational model

• Background

• MONC a community model for atmospheric modelling

• In-situ data analytics• Sharing cores between

computation and analysis

• Hybrid MONC• Offloading kernels to GPUs on

Piz Daint

Dealing with Volume: MONC

• Met Office NERC Cloud model

• “Classic” FORTRAN/MPI simulation

• Scales to 10,000s of cores on ARCHER etc.

• Billions of grid points; terabytes of data • how can we best analyse the data in a scalable fashion?

• could write to disk and analyse offline…? Data analytics

• With much larger domains (billions of grid points) how can

we best analyse the data in a scalable fashion?

• Previous LEM model did this in line with computation, where the

model would stop and calculate diagnostics before continuing with

computation

• Could write to disk and analyse offline

Prognostics Diagnosticsprognostic data diagnostic data

Page 33: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Dealing with Volume: MONC

• Raw prognostic data is never written out • would be too time consuming

• Instead do analytics in situ • have many computational processes (C) and a number of data

analytics cores (D)

• Typically one core per processor is dedicated to IO, serving the other cores running the computational model

• Computational cores “fire and forget” their data

• Avoids blocking the computational cores for analytics and IO

In-situ approach• Have many computational processes and a number of

data analytics cores

• Typically one core per processor is dedicated to IO, serving the

other cores running the computational model

• Computational cores “fire and forget” their data

• In-situ as raw data is

never written out

• Would be too time

consuming

• Avoids blocking the

computational cores for

analytics and IO

Page 34: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Dealing with Velocity: International Centre for Earth Data

• ICED consortium

• UoE/EPCC

• CU Boulder

• Orbital Microsystems

• “The most meaningful and timely weather data available”

• 32-40 low Earth orbit CubeSats

• microwave measurements, moisture, temperature

• full global coverage

• improved temporal (30x) resolution

• improved spatial (2x) resolution

• reduced latency, 15min refresh

• Significant incoming data!

• EPCC/CU designing, building analytics pipeline

• processing

• storage

• delivery

https://www.ed.ac.uk/news/2018/

satellite-system-to-improve-weather-forecasts

Page 35: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Dealing with Variety: National Safe Haven for Scotland

• Hosted since 2015

• Designed for secure access to sensitive data

• NHS register

• Government

• Alan Turing Institute

• private sector

• The tech is easy! (ish )

• Employs MongoDB & others to manage variety & volume in DICOM files

• Key is information governance process that sits above the tech

• Caldicott Guardianship framework

Page 36: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Bringing it All Together:

the Edinburgh & South-East Scotland City Region Deal

• £800M investment programme from UK & Scottish Gov into the SE Scotland region, from Fife to the Borders • signed August 2018

• Data Driven Innovation Programme • strong focus on talent & skills

• underpinning World-Class Data Infrastructure

Page 37: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Wrap

• Digital instruments & IoT are opening up all sorts of

interesting research avenues

• managing the data is becoming more of a specialised job

• Take a structured approach to management & analysis

• data management planning: it’s not just the law – it’s a good idea!

• Think about how your data will be used

• design storage around access patterns & queries

• Hire the right people

• data engineers, data scientists, research software engineers…

• Leverage open source tools

• for management, engineering and data science

Page 38: DON’T PANIC!/file/...Don’t panic! •EPCC •Big data: science, engineering, management •Manage data: RDM lifecycle & DMP •Organise data: files, databases, objects •Analyse

Re-use

• © 2018 The University of Edinburgh

• You are free to reuse this presentation and its contents under the terms of CC-BY-4.0

• This presentation was originally created by Rob Baxter, EPCC, The University of Edinburgh