don’t panic!/file/...don’t panic! •epcc •big data: science, engineering, management...

DON’T PANIC! Demystifying Big Data, Data Science and all that

Dr Rob Baxter, Group Manager

[email protected]

EPCC-public. © 2018 The University of Edinburgh, licensed CC-BY 4.0

Don’t panic!

• EPCC

• Big data: science, engineering, management

• Manage data: RDM lifecycle & DMP

• Organise data: files, databases, objects

• Analyse data: methods and tools

• Big data at EPCC

• Self-sustained:

• 25+ years

• 100 staff

• £5M turnover, externally generated

• Multi-disciplinary and multi-funded:

• large spectrum of activities

• critical mass of expertise

• distributed systems

• high-performance computing

• data engineering

• research software engineering

• Support research through:

• access to facilities & services

• training courses

• collaborative projects, contract research

The novel computing centre of the University of Edinburgh

National & regional services

• ARCHER, the UK Tier 1 HPC service

• 118,080 cores, 1.6 Petaflops, high-speed Aries interconnect, high-performance Lustre filesystem

• UK Research Data Facility (RDF)

• 23 PB disk, 48 PB tape

• Cirrus, a UK Tier 2 HPC service

• 13,000+ cores

• support research, industry, Edinburgh Genomics

• National Data Safe Haven for Scotland

• secure data environment: NHS, ADRC, ScotGov

• run under Caldicott Guardian framework

• Software Sustainability Institute

• headquarters

• Alan Turing Institute

• founding university partner, RSE University Delivery Partner

• research computing service hosts

What is Big Data?

• A set of cutting-edge approaches to deriving new

knowledge from digital data?

• An over-hyped marketing term designed to sell

more business analytics software?

• Facebook?

• An out-of-memory error that tells you that you

need to stop trying to do things on your laptop?

• The output of the Square Kilometre Array?

• All of the above?

LSST: a modest example

• The Large Synoptic Survey Telescope

• will capture two, 6.4 GB images every 39 s

• = 15 TB of raw image data every night

• and ~2 million ‘transient events’

• Over the 10 years of the survey (2022-2032)

• image 24 billion galaxies and 14 billion stars

• from 5 trillion detections and 32 trillion measurements

• 10 PB in Y1 → 70 PB in Y10

Big Data 2017: per year…

1PB NASDAQ

3PB

US

Census

4PB

US Library

of

Congress

5PB

NOAA

archive

6PB

YouTube

15PB


CERN

archive

73PB

searches

on Google

98PB

uploads to

Facebook

180PB


CERN

archive

73PB

…2025 per year

Square

Kilometre Array

Phase 1

300PB

searches

on Google

98PB

uploads to

Facebook

180PB

High Luminosity

Large Hadron Collider

1,000PB

Square Kilometre Array

Phase 2

1,000PB

Not all bigness is the same

• Big Data is typically measured three ways

• volume – from gigabytes to terabytes to petabytes

• velocity – data streams at you or changes rapidly

• variety – no longer are data in nice, neat tables

• Some folk add others

• veracity, verifiability, validity, variegation…

• You may have one, three or more

• They demand different handling strategies

?

Don’t try this at home

• Handling Big Data really is a specialist field

• hiring the right people is increasingly important

• Data Scientists ≠ Data Engineers ( ≠ DBAs )

• cf. https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

Data Science

• maths

• stats

• analysis

Data Engineering

• programming

• distributed systems

• data pipelines

Data Science vs Data Engineering vs Data Management

“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and

Johnson 2003).”

Hadley Wickham, Journal of Statistical Software, August 2014, Volume 59, Issue 10.

“Literally hundreds of practicing data miners and statistical modelers, most of them working at major corporations

supporting extensive analytical projects, have reported that they spend 80% of their effort in manipulating the data so

that they can analyze it!”

Dan Steinberg 2013, “How Much Time Needs to be Spent Preparing Data for Analysis?”

http://1.salford-systems.com/blog/bid/299181/How-Much-Time-Needs-to-be-Spent-Preparing-Data-for-Analysis

Data [ science | engineering | management ]

~20% Data science • analytics

• machine learning

~40%

Data engineering • data movement

• data pipelines

• data tech deployment (“data dev ops”)

• database design

• data preparation & cleaning

~40%

Data management • data storage

• data formats

• metadata management

• data preservation & backup

• data preparation & cleaning

Data lifecycle models

Plan

Acquire

Assure

Describe

Preserve

Discover

Combine

Process

Data lifecycle models

Plan

Acquire

Assure

Describe

Preserve

Discover

Combine

Process

What data will I create/ acquire/

record/ measure?

How will I describe them?

How will I store them?

Will I share them? If not, why not?

How will others find them?

Create, observe, measure,

generate by simulation, write,

re-use from existing sources

(e.g. databases, the Internet),

This is your raw data, part of

your “laboratory notebook”

Organise it from the start –

choose standard formats.

Validate, calibrate, test.

Check the correctness of the

methods used to create data.

In simulation terms, testing your

code properly!

Checking the correctness of

subsetting methods or synthetic

data generators.

Use meaningful variable names (not a1,

a2…)

Record units (“length = 3”; metres,

millimetres, parsecs?)

Record information needed to interpret

the data in 1, 10, 100 years

Use metadata standards!

If data get this far, they are becoming part

of the scientific record.

Store them carefully; think about

- integrity, backup, replication

- accessibility;

- ability to render in 10 years time;

- keeping data and metadata together.

How should you make

your data discoverable by

others?

Description and

accessibility are key.

Applying computing

power to create “new

data from old”.

Data analytics; analysis

of digital sensor data;

simulation input…

Combining, merging

data to create new

insights.

Good metadata &

tools are essential,

as is an appreciation

of licence conditions!

Tools for creating data management plans

dmptool.org

dmponline.dcc.ac.uk

Remember: a data

management plan is for

life, not just for Christmas!

Managing data

• The most important question to ask:

How will the data be used?

What kind of questions are you going to ask?

• Aggregation or reduction? • counts, sums, averages and other summary statistics

• Filtering? • age > 30, timestamp before 1 January 2018

• Mapping, applying a function? • convert temperature from Kelvin into degrees Celsius

• Joining two datasets?

• Retrieving a time series?

• Retrieving a document by its id, or title, or…?

• Also: • writing a lot vs reading a lot

• volume, velocity, variety

What to do with your data

• It’s possible to create a “Data Lake” and store

• This is not always a good idea!

• careful that it doesn’t turn into a “data swamp”

• storing data without thinking ≈ throwing it away

• Need to index, order, label, describe data to understand and use it in the future

• metadata and metadata management are key

• Planning, preparation, organisation are key to not sinking in the swamp

Files v databases v objects

• Files

• structured, unstructured, semi-structured data

• many standard, open formats

• CSV, HDF5, netCDF, geotiff, FITS, JSON, XML, GRIB, …

• Databases

• relational: Postgres, MySQL, MariaDB, …

• noSQL: MongoDB, DynamoDB, Cassandra, Neo4J, …

• Objects

• http-based PUT/GET APIs

• Amazon S3, OpenStack/Swift, …

Files v databases v objects

Files

Pros

• easy to use

• random access

• any content

• any size (well…)

Cons

• poor manageability at scale

• low security

• “all or nothing”

Databases

Pros

• highly optimised

• scalable (volume)

• scalable (velocity)

• high security

• SQL!

Cons

• higher setup costs

• dedicated systems

• poor portability

• higher storage costs

Objects

Pros

• Cloud/Web-friendly

• horizontal scalability

• rich metadata

Cons

• poor findability

• more “DIY”

• more proprietary

Managing the volumes

• If you have too many files…

• … or your files are too big…

• … or both…

• … it’s probably worth migrating to a database

• at the very least for cataloguing your metadata

• A database catalogue that “points to” files (or

objects) is a good compromise

Relational v noSQL?

• Relational DBs been around since the 1970s • tables, relationships, keys, indexes, transactions, SQL

• noSQL are new-ish, “big data” inventions • “not only” SQL (or sometimes “not even” SQL)

• specialised for certain kinds of data at “3V scale”

• Column stores (fast read, slow write) • Google BigTable, Facebook/Apache Cassandra

• Key-value stores (simple, fast lookups) • Amazon DynamoDB, BerkeleyDB

• Document stores (indexing arbitrary content) • MongoDB, CouchDB

• Graph databases (relationships are primary objects) • Neo4J, DataStax

NoSQL databases

• Designed for distributed storage with high

horizontal scalability

• suitable for large structured, semi-structured or

unstructured data

• built to address particular use-cases at scale

• by Google, Amazon, Facebook, LinkedIn

• No schemas are required (a.k.a, schema-less)

• gives flexibility in storing documents with different content

• No transactions

• No join operations

give up things

like this

to achieve

this

But: SQL databases are not dead

• Unless you are at Amazon/Google/Facebook/SKA

scale you probably don’t need a noSQL database

• Relational databases remain very powerful, very

effective tools for managing data

• and they all understand SQL!

• If you’re exploring social networks…

• … you might use a graph database

• If you have unpredictably-structured data…

• … you might use a document store

A model for data analysis: CRISP-DM

• Cross Industry Standard

Process for Data Mining

• Invented 1996-9 by SPSS

(now IBM), Teradata,

Daimler AG, NCR

Corporation and OHRA

• C. Shearer, “The CRISP-

DM model: the new

blueprint for data mining”

• Journal of Data Warehousing,

Vol. 5 (4), 2000

80%

Data Science skills diagram

Fig1-1 from O’Neil and Shutt, “Doing Data Science”. This is

based on Drew Conway’s Venn Diagram of Data Science

Domain

Expertise

Machine

Learning

Data

Science

The Danger Zone!

• People who know enough to be dangerous

• Capable of extracting and structuring data

• Know quite a bit about the field and can even run a linear regression

BUT…

• Lack understanding of what the regression coefficients mean

SO …

• Have ability to create what appears to be legitimate analysis without understanding how they got there or what they have created!

Python and R (and more)

Notebooks

Jupyter

Zeppelin

Languages

Python

R

SQL

Javascript

NodeJS

Libraries

SciPy

Pandas

Scikit-learn

GPText

OpenNLP

Mahout

+many others

Visualisation

D3.js

matplotlib

Seaborn

R

shiny

Leaflet

PowerBI

ggplot2

Python Data Science ecosystem

Machine Learning/

Statistics

Business Intelligence

Scientific Computing / HPC

Web Distributed Systems

Contents

• Computational model

• Background

• MONC a community model for atmospheric modelling

• In-situ data analytics• Sharing cores between

computation and analysis

• Hybrid MONC• Offloading kernels to GPUs on

Piz Daint

Dealing with Volume: MONC

• Met Office NERC Cloud model

• “Classic” FORTRAN/MPI simulation

• Scales to 10,000s of cores on ARCHER etc.

• Billions of grid points; terabytes of data • how can we best analyse the data in a scalable fashion?

• could write to disk and analyse offline…? Data analytics

• With much larger domains (billions of grid points) how can

we best analyse the data in a scalable fashion?

• Previous LEM model did this in line with computation, where the

model would stop and calculate diagnostics before continuing with

computation

• Could write to disk and analyse offline

Prognostics Diagnosticsprognostic data diagnostic data

Dealing with Volume: MONC

• Raw prognostic data is never written out • would be too time consuming

• Instead do analytics in situ • have many computational processes (C) and a number of data

analytics cores (D)

• Typically one core per processor is dedicated to IO, serving the other cores running the computational model

• Computational cores “fire and forget” their data

• Avoids blocking the computational cores for analytics and IO

In-situ approach• Have many computational processes and a number of

data analytics cores

• Typically one core per processor is dedicated to IO, serving the

other cores running the computational model

• Computational cores “fire and forget” their data

• In-situ as raw data is

never written out

• Would be too time

consuming

• Avoids blocking the

computational cores for

analytics and IO

Dealing with Velocity: International Centre for Earth Data

• ICED consortium

• UoE/EPCC

• CU Boulder

• Orbital Microsystems

• “The most meaningful and timely weather data available”

• 32-40 low Earth orbit CubeSats

• microwave measurements, moisture, temperature

• full global coverage

• improved temporal (30x) resolution

• improved spatial (2x) resolution

• reduced latency, 15min refresh

• Significant incoming data!

• EPCC/CU designing, building analytics pipeline

• processing

• storage

• delivery

https://www.ed.ac.uk/news/2018/

satellite-system-to-improve-weather-forecasts

Dealing with Variety: National Safe Haven for Scotland

• Hosted since 2015

• Designed for secure access to sensitive data

• NHS register

• Government

• Alan Turing Institute

• private sector

• The tech is easy! (ish )

• Employs MongoDB & others to manage variety & volume in DICOM files

• Key is information governance process that sits above the tech

• Caldicott Guardianship framework

Bringing it All Together:

the Edinburgh & South-East Scotland City Region Deal

• £800M investment programme from UK & Scottish Gov into the SE Scotland region, from Fife to the Borders • signed August 2018

• Data Driven Innovation Programme • strong focus on talent & skills

• underpinning World-Class Data Infrastructure

Wrap

• Digital instruments & IoT are opening up all sorts of

interesting research avenues

• managing the data is becoming more of a specialised job

• Take a structured approach to management & analysis

• data management planning: it’s not just the law – it’s a good idea!

• Think about how your data will be used

• design storage around access patterns & queries

• Hire the right people

• data engineers, data scientists, research software engineers…

• Leverage open source tools

• for management, engineering and data science

Re-use

• © 2018 The University of Edinburgh

• You are free to reuse this presentation and its contents under the terms of CC-BY-4.0

• This presentation was originally created by Rob Baxter, EPCC, The University of Edinburgh

don’t panic!/file/...don’t panic! •epcc •big data: science, engineering, management...

Documents