parallel python tools for handling big climate data · 2018-10-31 · parallel python tools for...

19
air planet people Parallel Python Tools for Handling Big Climate Data Sheri Mickelson Kevin Paul 2018 Workshop on Developing Python Frameworks for Earth System Sciences October 30, 2018 NCAR/CISL/TDD

Upload: others

Post on 08-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

air • planet • people

Parallel Python Tools for Handling Big Climate Data

Sheri MickelsonKevin Paul

2018 Workshop on Developing Python Frameworks for Earth System SciencesOctober 30, 2018

NCAR/CISL/TDD

Page 2: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

CESM’s CMIP5 Workflow

2

Model Run

Publication

Post-Processing

CESM Model Run

Time Series Conversion

(NCO)

CMOR

Diagnostics(NCO/NCL)

Push to ESGF

Page 3: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Lessons We Learned From CMIP5

3

CESM was the first model to complete their simulations, but the last to complete publication.

Why?• All of the post-processing was serial and it took a long time to run• Workflow was error prone and was time consuming to debug• Too much human intervention was needed between post-processing

steps and time was wasted• There was only one person who knew the status of all of the

experiments

Page 4: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Motivating FactorCMIP5 vs CMIP6

4

CMIP5

• 25 Experiments• Timeline: 3 years• Output size: 800TB• Published size: 200TB

CMIP6

• 102 Experiments• Timeline: 1 year• Output size: 8PB (estimate)

• Published size: 2PB (estimate)

http://www.bbc.com/earth/story/20170510-terrifying-20m-tall-rogue-waves-are-actually-real

Page 5: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

New CESM/CMIP6 Workflow

5

Model Run

Publication

Post-Processing

CESM Model Run

Time Series Conversion

(PyReshaper)

New Data Compliance

Tool (PyConform)

Re-Designed Diagnostics

(PyAverager)

Push to ESGF

Aut

omat

ed W

orkf

low

Usi

ng C

ylc

ExperimentsUpdate

Their Status in Run Database

The focus of this talk

Page 6: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Data Compliance

6

Taking model output and processing it into experiment compliant data

Page 7: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Previous Version Used for CMIP5

7

• Used Fortran and CMOR to read in the raw output, do all conversions, and write out complaint files

• Serial, no parallelization

SlowRigid and Hard to Expand

Error Prone

Page 8: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

PyConform: New Version Used for CMIP6

8

• Uses PythonnetCDF4, numpy, dreqPy, cf_units, pyNGL

• Parallelization done with MPI4Py

• A three step process

Faster Flexible User Interface

(16x to 38x speedup over Fortran method)

Page 9: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

First Step

9

Users need to create a text file with definitions that describe how to map model variables to requested variables

Examples:cfc11global=f11vmrch4=vinth2p(CH4, hyam, hybm, plev, PS, P0)mc=CMFMC+CMFMCDZMsiage=siage

Page 10: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Second Step

10

Then users run the iconform tool that matches the definitions to its variable information within the CMIP6 Data Request

The Data Request lists variable requirements:• Units• Dimensions• Descriptions• Positive Attribute on Vertical Dimensions• And a lot more …

Page 11: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Sample Portion of a PyConform Input File

11

"ua": {"attributes": {

"_FillValue": "1e+20",

"cell_measures": "area: areacella",

"cell_methods": "time: mean",

"comment": "\"Eastward\" indicates a vector component which is positive when directed eastward (negative westward). Wind is defined as a two-dimensional (horizontal) air velocity vector, with no vertical component. (Vertical motion in the atmosphere has the standard name upward_air_velocity.)",

"description": "\"Eastward\" indicates a vector component which is positive when directed eastward (negative westward). Wind is defined as a two-dimensional (horizontal) air velocity vector, with no vertical component. (Vertical motion in the atmosphere has the standard name upward_air_velocity.)",

"frequency": "mon",

"id": "ua",

"long_name": "Eastward Wind",

"mipTable": "Amon",

"out_name": "ua",

"prov": "Amon ((isd.003))",

"realm": "atmos",

"standard_name": "eastward_wind",

"time": "time",

"time_label": "time-mean",

"time_title": "Temporal mean",

"title": "Eastward Wind",

"type": "real",

"units": "m s-1",

"variable_id": "ua"},"datatype": "real",

"definition": "vinth2p(U,hyam,hybm,plev, PS,P0)",

Page 12: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Third Step

12

Then users run the xconform tool that generates requested variables based on the input specifications

“x = X1 + X2”Read:X1[i]

Read:X2[i]

Evaluate:(X1+X2)[i]

Map:iàj

Validate:> minimum< maximum

dimensions = [j]etcetera

Write:x[j] File

“y = X1 - X2”

Read:X1[i]

Read:X2[i]

Evaluate:(X1-X2)[i]

Map:iàj

Validate:> minimum< maximum

dimensions = [j]etcetera

Write:y[j] File

Page 13: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Physarray Object

13

• Is a subclass of the maskedArray in NumPy

• Additional features that were needed above the masked array class:

- Automatic Unit Conversion- Automatic Dimension Handling- Automatic Handling of the Positive Attribute

Page 14: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Physarray Object

14

“X = X1 + X2”Units:

K + Units: K

“X” = Units: K

“X = X1 + X2”Units:

K + Units: C

“X” =

Units: K

Convert to K before operation is performed

“X = X1 * X2”Units:

kg * Units: m-2

“X” = Units: kg m-2

* Must be cf compliant units

Units: K + Units:

K

Units: K =

Units: K = =Units:

kg m-2

Extra features we needed to generate the data correctly:Automatic Unit Handling

Page 15: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Physarray Object

15

“X = X1 + X2”

+

“X” =

“X = X1 + X2”

Switch the dimensions before operation is performed

Dims: TimeLatLonLev

=

Dims: TimeLatLonLev

Dims: TimeLatLonLev

Dims: TimeLatLonLev

+

“X” =

Dims: TimeLatLonLev

=

Dims: TimeLatLonLev

Dims: LevLatLon

Time

Dims: TimeLatLonLev

Extra features we needed to generate the data correctly:Automatic Dimension Handling

Dims: TimeLatLonLev

“X = X1 + X2”

+=Dims: TimeLatLon

Dims: TimeLatLon

Will give an error

Dims: TimeLatLonLev

Dims: TimeLatLonLev

+

Page 16: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Physarray Object

16

“X = X1 + X2”Pos: up + Pos:

up

“X” = Pos: up

“X = X1 + X2”Pos: up + Pos:

down

“X” =

Pos: up

Convert before operation is performed

Pos: up + Pos:

up

Pos: up = Pos:

up =

Extra features we needed to generate the data correctly:Automatic Handling of the Positive Attribute

(flipping the vertical dimension)

Page 17: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Switching to Xarray/Dask

17

• We are working on a new version that uses xarray

• While we no longer need the ability to handle dimension reordering, we still need functionality to handle the unit conversion and the flipping of the vertical dimension

• We will also need to evaluate the performance

Page 18: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

Parallel Python Tools for Handling Big Climate Data air • planet • people

Moving Forward ….

18

• We are currently using PyConform in its current form for our CMIP6 output

• We are looking at a redesign of the internal data structures to use new capabilities that didn’t exist when we started the project

• Performance and usability are key for this tool and we will move in those directions

Page 19: Parallel Python Tools for Handling Big Climate Data · 2018-10-31 · Parallel Python Tools for Handling Big Climate Data air •planet •people MotivatingFactor CMIP5 vs CMIP6 4

air • planet • people

Questions

Contact: mickelso .at. ucar.eduhttps://github.com/NCAR/PyConform

Big Data

http://m.sweetclipart.com/halloween-pumpkin-pail-with-candy/