don’t panic!/file/...don’t panic! •epcc •big data: science, engineering, management...
TRANSCRIPT
DON’T PANIC! Demystifying Big Data, Data Science and all that
Dr Rob Baxter, Group Manager
EPCC-public. © 2018 The University of Edinburgh, licensed CC-BY 4.0
Don’t panic!
• EPCC
• Big data: science, engineering, management
• Manage data: RDM lifecycle & DMP
• Organise data: files, databases, objects
• Analyse data: methods and tools
• Big data at EPCC
• Self-sustained:
• 25+ years
• 100 staff
• £5M turnover, externally generated
• Multi-disciplinary and multi-funded:
• large spectrum of activities
• critical mass of expertise
• distributed systems
• high-performance computing
• data engineering
• research software engineering
• Support research through:
• access to facilities & services
• training courses
• collaborative projects, contract research
The novel computing centre of the University of Edinburgh
National & regional services
• ARCHER, the UK Tier 1 HPC service
• 118,080 cores, 1.6 Petaflops, high-speed Aries interconnect, high-performance Lustre filesystem
• UK Research Data Facility (RDF)
• 23 PB disk, 48 PB tape
• Cirrus, a UK Tier 2 HPC service
• 13,000+ cores
• support research, industry, Edinburgh Genomics
• National Data Safe Haven for Scotland
• secure data environment: NHS, ADRC, ScotGov
• run under Caldicott Guardian framework
• Software Sustainability Institute
• headquarters
• Alan Turing Institute
• founding university partner, RSE University Delivery Partner
• research computing service hosts
What is Big Data?
• A set of cutting-edge approaches to deriving new
knowledge from digital data?
• An over-hyped marketing term designed to sell
more business analytics software?
• Facebook?
• An out-of-memory error that tells you that you
need to stop trying to do things on your laptop?
• The output of the Square Kilometre Array?
• All of the above?
LSST: a modest example
• The Large Synoptic Survey Telescope
• will capture two, 6.4 GB images every 39 s
• = 15 TB of raw image data every night
• and ~2 million ‘transient events’
• Over the 10 years of the survey (2022-2032)
• image 24 billion galaxies and 14 billion stars
• from 5 trillion detections and 32 trillion measurements
• 10 PB in Y1 → 70 PB in Y10
Big Data 2017: per year…
1PB NASDAQ
3PB
US
Census
4PB
US Library
of
Congress
5PB
NOAA
archive
6PB
YouTube
15PB
Big Data 2017: per year…
Big Data 2017: per year…
CERN
archive
73PB
searches
on Google
98PB
uploads to
180PB
Big Data 2017: per year…
CERN
archive
73PB
…2025 per year
Square
Kilometre Array
Phase 1
300PB
searches
on Google
98PB
uploads to
180PB
High Luminosity
Large Hadron Collider
1,000PB
Square Kilometre Array
Phase 2
1,000PB
Not all bigness is the same
• Big Data is typically measured three ways
• volume – from gigabytes to terabytes to petabytes
• velocity – data streams at you or changes rapidly
• variety – no longer are data in nice, neat tables
• Some folk add others
• veracity, verifiability, validity, variegation…
• You may have one, three or more
• They demand different handling strategies
?
Don’t try this at home
• Handling Big Data really is a specialist field
• hiring the right people is increasingly important
• Data Scientists ≠ Data Engineers ( ≠ DBAs )
• cf. https://www.oreilly.com/ideas/data-engineers-vs-data-scientists
Data Science
• maths
• stats
• analysis
Data Engineering
• programming
• distributed systems
• data pipelines
Data Science vs Data Engineering vs Data Management
“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and
Johnson 2003).”
Hadley Wickham, Journal of Statistical Software, August 2014, Volume 59, Issue 10.
“Literally hundreds of practicing data miners and statistical modelers, most of them working at major corporations
supporting extensive analytical projects, have reported that they spend 80% of their effort in manipulating the data so
that they can analyze it!”
Dan Steinberg 2013, “How Much Time Needs to be Spent Preparing Data for Analysis?”
http://1.salford-systems.com/blog/bid/299181/How-Much-Time-Needs-to-be-Spent-Preparing-Data-for-Analysis
Data [ science | engineering | management ]
~20% Data science • analytics
• machine learning
~40%
Data engineering • data movement
• data pipelines
• data tech deployment (“data dev ops”)
• database design
• data preparation & cleaning
~40%
Data management • data storage
• data formats
• metadata management
• data preservation & backup
• data preparation & cleaning
Data lifecycle models
Plan
Acquire
Assure
Describe
Preserve
Discover
Combine
Process
Data lifecycle models
Plan
Acquire
Assure
Describe
Preserve
Discover
Combine
Process
What data will I create/ acquire/
record/ measure?
How will I describe them?
How will I store them?
Will I share them? If not, why not?
How will others find them?
Create, observe, measure,
generate by simulation, write,
re-use from existing sources
(e.g. databases, the Internet),
This is your raw data, part of
your “laboratory notebook”
Organise it from the start –
choose standard formats.
Validate, calibrate, test.
Check the correctness of the
methods used to create data.
In simulation terms, testing your
code properly!
Checking the correctness of
subsetting methods or synthetic
data generators.
Use meaningful variable names (not a1,
a2…)
Record units (“length = 3”; metres,
millimetres, parsecs?)
Record information needed to interpret
the data in 1, 10, 100 years
Use metadata standards!
If data get this far, they are becoming part
of the scientific record.
Store them carefully; think about
- integrity, backup, replication
- accessibility;
- ability to render in 10 years time;
- keeping data and metadata together.
How should you make
your data discoverable by
others?
Description and
accessibility are key.
Applying computing
power to create “new
data from old”.
Data analytics; analysis
of digital sensor data;
simulation input…
Combining, merging
data to create new
insights.
Good metadata &
tools are essential,
as is an appreciation
of licence conditions!
Tools for creating data management plans
dmptool.org
dmponline.dcc.ac.uk
Remember: a data
management plan is for
life, not just for Christmas!
Managing data
• The most important question to ask:
How will the data be used?
What kind of questions are you going to ask?
• Aggregation or reduction? • counts, sums, averages and other summary statistics
• Filtering? • age > 30, timestamp before 1 January 2018
• Mapping, applying a function? • convert temperature from Kelvin into degrees Celsius
• Joining two datasets?
• Retrieving a time series?
• Retrieving a document by its id, or title, or…?
• Also: • writing a lot vs reading a lot
• volume, velocity, variety
What to do with your data
• It’s possible to create a “Data Lake” and store
• This is not always a good idea!
• careful that it doesn’t turn into a “data swamp”
• storing data without thinking ≈ throwing it away
• Need to index, order, label, describe data to understand and use it in the future
• metadata and metadata management are key
• Planning, preparation, organisation are key to not sinking in the swamp
Files v databases v objects
• Files
• structured, unstructured, semi-structured data
• many standard, open formats
• CSV, HDF5, netCDF, geotiff, FITS, JSON, XML, GRIB, …
• Databases
• relational: Postgres, MySQL, MariaDB, …
• noSQL: MongoDB, DynamoDB, Cassandra, Neo4J, …
• Objects
• http-based PUT/GET APIs
• Amazon S3, OpenStack/Swift, …
Files v databases v objects
Files
Pros
• easy to use
• random access
• any content
• any size (well…)
Cons
• poor manageability at scale
• low security
• “all or nothing”
Databases
Pros
• highly optimised
• scalable (volume)
• scalable (velocity)
• high security
• SQL!
Cons
• higher setup costs
• dedicated systems
• poor portability
• higher storage costs
Objects
Pros
• Cloud/Web-friendly
• horizontal scalability
• rich metadata
Cons
• poor findability
• more “DIY”
• more proprietary
Managing the volumes
• If you have too many files…
• … or your files are too big…
• … or both…
• … it’s probably worth migrating to a database
• at the very least for cataloguing your metadata
• A database catalogue that “points to” files (or
objects) is a good compromise
Relational v noSQL?
• Relational DBs been around since the 1970s • tables, relationships, keys, indexes, transactions, SQL
• noSQL are new-ish, “big data” inventions • “not only” SQL (or sometimes “not even” SQL)
• specialised for certain kinds of data at “3V scale”
• Column stores (fast read, slow write) • Google BigTable, Facebook/Apache Cassandra
• Key-value stores (simple, fast lookups) • Amazon DynamoDB, BerkeleyDB
• Document stores (indexing arbitrary content) • MongoDB, CouchDB
• Graph databases (relationships are primary objects) • Neo4J, DataStax
NoSQL databases
• Designed for distributed storage with high
horizontal scalability
• suitable for large structured, semi-structured or
unstructured data
• built to address particular use-cases at scale
• by Google, Amazon, Facebook, LinkedIn
• No schemas are required (a.k.a, schema-less)
• gives flexibility in storing documents with different content
• No transactions
• No join operations
give up things
like this
to achieve
this
But: SQL databases are not dead
• Unless you are at Amazon/Google/Facebook/SKA
scale you probably don’t need a noSQL database
• Relational databases remain very powerful, very
effective tools for managing data
• and they all understand SQL!
• If you’re exploring social networks…
• … you might use a graph database
• If you have unpredictably-structured data…
• … you might use a document store
A model for data analysis: CRISP-DM
• Cross Industry Standard
Process for Data Mining
• Invented 1996-9 by SPSS
(now IBM), Teradata,
Daimler AG, NCR
Corporation and OHRA
• C. Shearer, “The CRISP-
DM model: the new
blueprint for data mining”
• Journal of Data Warehousing,
Vol. 5 (4), 2000
80%
Data Science skills diagram
Fig1-1 from O’Neil and Shutt, “Doing Data Science”. This is
based on Drew Conway’s Venn Diagram of Data Science
Domain
Expertise
Machine
Learning
Data
Science
The Danger Zone!
• People who know enough to be dangerous
• Capable of extracting and structuring data
• Know quite a bit about the field and can even run a linear regression
BUT…
• Lack understanding of what the regression coefficients mean
SO …
• Have ability to create what appears to be legitimate analysis without understanding how they got there or what they have created!
Python and R (and more)
Notebooks
Jupyter
Zeppelin
Languages
Python
R
SQL
Javascript
NodeJS
Libraries
SciPy
Pandas
Scikit-learn
GPText
OpenNLP
Mahout
+many others
Visualisation
D3.js
matplotlib
Seaborn
R
shiny
Leaflet
PowerBI
ggplot2
Python Data Science ecosystem
Machine Learning/
Statistics
Business Intelligence
Scientific Computing / HPC
Web Distributed Systems
Contents
• Computational model
• Background
• MONC a community model for atmospheric modelling
• In-situ data analytics• Sharing cores between
computation and analysis
• Hybrid MONC• Offloading kernels to GPUs on
Piz Daint
Dealing with Volume: MONC
• Met Office NERC Cloud model
• “Classic” FORTRAN/MPI simulation
• Scales to 10,000s of cores on ARCHER etc.
• Billions of grid points; terabytes of data • how can we best analyse the data in a scalable fashion?
• could write to disk and analyse offline…? Data analytics
• With much larger domains (billions of grid points) how can
we best analyse the data in a scalable fashion?
• Previous LEM model did this in line with computation, where the
model would stop and calculate diagnostics before continuing with
computation
• Could write to disk and analyse offline
Prognostics Diagnosticsprognostic data diagnostic data
Dealing with Volume: MONC
• Raw prognostic data is never written out • would be too time consuming
• Instead do analytics in situ • have many computational processes (C) and a number of data
analytics cores (D)
• Typically one core per processor is dedicated to IO, serving the other cores running the computational model
• Computational cores “fire and forget” their data
• Avoids blocking the computational cores for analytics and IO
In-situ approach• Have many computational processes and a number of
data analytics cores
• Typically one core per processor is dedicated to IO, serving the
other cores running the computational model
• Computational cores “fire and forget” their data
• In-situ as raw data is
never written out
• Would be too time
consuming
• Avoids blocking the
computational cores for
analytics and IO
Dealing with Velocity: International Centre for Earth Data
• ICED consortium
• UoE/EPCC
• CU Boulder
• Orbital Microsystems
• “The most meaningful and timely weather data available”
• 32-40 low Earth orbit CubeSats
• microwave measurements, moisture, temperature
• full global coverage
• improved temporal (30x) resolution
• improved spatial (2x) resolution
• reduced latency, 15min refresh
• Significant incoming data!
• EPCC/CU designing, building analytics pipeline
• processing
• storage
• delivery
https://www.ed.ac.uk/news/2018/
satellite-system-to-improve-weather-forecasts
Dealing with Variety: National Safe Haven for Scotland
• Hosted since 2015
• Designed for secure access to sensitive data
• NHS register
• Government
• Alan Turing Institute
• private sector
• The tech is easy! (ish )
• Employs MongoDB & others to manage variety & volume in DICOM files
• Key is information governance process that sits above the tech
• Caldicott Guardianship framework
Bringing it All Together:
the Edinburgh & South-East Scotland City Region Deal
• £800M investment programme from UK & Scottish Gov into the SE Scotland region, from Fife to the Borders • signed August 2018
• Data Driven Innovation Programme • strong focus on talent & skills
• underpinning World-Class Data Infrastructure
Wrap
• Digital instruments & IoT are opening up all sorts of
interesting research avenues
• managing the data is becoming more of a specialised job
• Take a structured approach to management & analysis
• data management planning: it’s not just the law – it’s a good idea!
• Think about how your data will be used
• design storage around access patterns & queries
• Hire the right people
• data engineers, data scientists, research software engineers…
• Leverage open source tools
• for management, engineering and data science
Re-use
• © 2018 The University of Edinburgh
• You are free to reuse this presentation and its contents under the terms of CC-BY-4.0
• This presentation was originally created by Rob Baxter, EPCC, The University of Edinburgh