a common and sustainable big data infrastructure in ...carlosm/papers/maltzahn-si2ws-poster17.pdfa...

1
A common and sustainable big data infrastructure in support of weather prediction research and education in universities Big Weather Web Nuclei 1. Large ensemble distributed over 7 universities: Gretchen Mullendore (UND), Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell (U Albany), Steven Greybush (Penn State), Russ Schumacher (CSU). 2. Common storage, linking, and cataloging metholodogy Permanent naming and high availability of data and experiments Connecting data, platform, tools, analysis 3. Software container technologies for easy deployment and reproducibility Self-contained: software can be instantly deployed in common environments Naming and versioning: compact reference mechanisms for complex environments Good for reproducibility and education Education integration: Gretchen Mullendore (UND): Numerical Weather Prediction Modules for Introductory and Advanced Undergraduate Classes Research integration: Brian Ancell (Texas Tech): Using Large Ensembles to Determine the Adaptive Nature of Probabilistic Weather Prediction William Capehart (SDSM): Application of a statistical confidence index to regional scale ensembles Clark Evans (UW Milwaukee): Investigating the Predictability of Mesoscale Convective Systems Robert Fovell (U Albany): Parameterization Testing in a Distributed Ensemble: Improving Model Development in the Research Community Russ Schumacher (CSU): Synoptic analysis and probabilistic post-processing with a distributed ensemble mypaper-repo | README.md | .git/ | .popper.yml | experiments | |-- gassyfs | | |-- README.md | | |-- ansible/ | | | |-- setup.yml | | | |-- vars.yml | | |-- datasets/ | | | |-- input-data.csv | | |-- results/ | | | |-- figure.png | | | |-- postprocess.py | | | |-- output.csv | | |-- run.sh | | -- validate.sh | paper | |-- build.sh | |-- figures/ | |-- paper.tex | -- references.bib Outreach so far 2015 Unidata Users Meeting 2015 AGU Towhhall ~50 attendees 10 new bww-users subscribers 2016 Presentation at AMS 2016 UniData Workshops WRF in a box in the class room: 2016 UND class by Tim See 2017 UND class by Gretchen Mullendore 2017 Presentations at AMS Kevin Tyle (Albany) on BWW workflows Tim See (UND) on BWW in class William Capehart (SDSM) using BWW ensembles (2 papers) Popper: Jimenez et al. VarSys’16 Chicago, IL Jimenez et al. USENIX ;login:, Winter’16 Guest lecture in 2017 UND class by Gretchen Mullendore General Approach Establish “nuclei”: pieces of technology that Are easily shareable Have the ability to grow & improve over time Ensure “buy-in” from researchers and students Examples: Wikipedia Linux kernel Infrastructures to enable community-driven review and improvement Docker offers open-source container software for packaging applications Container includes minimal OS and all dependencies needed for an application WRF, processing, and post-processing precompiled binaries run under any Docker engine Time for new virtual machine (VM) to instantiate: 1 minute Full WRF output and graphics completed (macbook pro, 2 cpu): 5 minutes 30 secs Allows scientific reproducibility and application portability from laptops to servers to clouds. (AWS, Google, Azure) Reproducible results across many platforms Control over compiler uncertainty Hacker, J., J. Exby, N. Chartier, D. Gill, I. Jimenez, C. Maltzahn, and G. Mullendore: Collaborative Research and Education with Numerical Weather Prediction Enabled by Software Containers. American Meteorological Society 32nd Conference on Environmental Processing Technologies. Jan. 2016. WRF in a box: Build, Deploy, Run Problem Informed by EarthCube Users workshops Poor reproducibility of data-intensive science Impact on education and research Impaired availability of intermediate results Unnecessary duplication of work, steep learning curves Communities of practice are falling behind Limited ability to adopt new technologies Domain: Numerical Weather Prediction NWP groups at universities use supercomputing time to create large ensembles Current practice: Keep ensembles in scratch space or download to local infrastructure Don’t share ensemble data, don’t share tools Rewards for results, not data Testing the distributed ensemble framework and tools Sharing of “knowledge products” Initialization methods Physics options Workflow scripts for producing & analyzing data Tracking data authorship and community impact Dissemination of framework & tools Web site: http://bigweatherweb.org Public email list: [email protected] Contact: [email protected] Funded by NSF Award# 1450488. Next Do science with the ensemble output Create infrastructure in TACC Wrangler Investigate cloud commons governance models Popper: Practical Reproducible Evaluation of Systems Reproducibility as a DevOps Problem Independently validating experimental results is challenging. Recreating experimental setup is often difficult to impossible. Software engineers deal with reproducibility all the time: Bug A can be reproduced in version X on platform Y using input Z. Shared (cloud) computing and storage services readily available. Manage an academic article as a software project! Code Package Execute Input Data Data and Metrics Analyze/ Visualize Manuscript Our Approach Popper: Take a common generic experimentation workflow (above) and apply a DevOps practice used in the development of open source software (OSS) projects (below). The Convention: 1. Pick a DevOps tool for each stage of the scientific experimentation workflow. 2. Put all associated scripts (experiment and manuscript) in version control, in order to provide a self-contained repository. 3. Document changes as experiment evolves, in the form of version control commits. Popper Compliance Tools: Generate referenceable assets (associate unique IDs to binaries, data, configuration and infrastructure state); usable from scripts/CLI and capable of acting upon IDs. Experiment: Provide all necessary assets in a single repository (self-contained), including experiment code, orchestration logic, data dependencies, results and validation criteria. Article: Provide full text and figures of article, as well as all popper-compliant experiments. Codified Validations: WHEN NOT network_saturated AND num_nodes=* EXPECT system_throughput >= (baseline_throughput * 0.9) PopperCI Commit change to experiment 1 Trigger execution 2 Run multi-node experiment on one of supported backends 3 Experiment generates output datasets or runtime metrics 4 Validate experiment results by testing codified assertions on output 5 Keep track of execution and associated status to the corresponding commit 6 Pros: Experiments can be falsifiable with minimal re-execution effort. Facilitates collaboration by following the OSS model for sharing. Investing time in DevOps skills quickly pays off. The convention complements many existing efforts Challenges: Steep learning curve of DevOps practices and tools/frameworks. Big cultural change; new experimentation paradigm. Benefits and Challenges Project Structure $ cd mypaper-repo $ popper init -- Initialized Popper repo mypaper-repo $ popper experiment list -- available templates --------------- ceph-rados proteustm mpi-comm adam sirius comd-openmp cloverleaf gassyfs zlog bww unum-py cuddn-deeplrn spark-stand torpor malacology genevo mantle rita-idx hadoop-yarn kubsched alg-encycl macrob dadvisor obfuscdata $ popper add gassyfs -- Added gassyfs experiment to mypaper-repo Bootstrapping a Popper Project Numeric Weather Prediction (Big Weather Web). Computer Systems Research (UCSC; UW Madison). High Performance Computing (Sandia, LLNL, LANL). Games and Playable Media (UCSC). Genomics (UCSC). Communities adoping Popper

Upload: dangdien

Post on 20-Apr-2018

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: A common and sustainable big data infrastructure in ...carlosm/Papers/maltzahn-si2ws-poster17.pdfA common and sustainable big data infrastructure in support of weather prediction research

A common and sustainable big data infrastructure in support of weather prediction research and education in universities

Big Weather Web Nuclei

1. Large ensemble distributed over 7 universities:Gretchen Mullendore (UND), Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell (U Albany), Steven Greybush (Penn State), Russ Schumacher (CSU).

2. Common storage, linking, and cataloging metholodogy○ Permanent naming and high availability of data and experiments○ Connecting data, platform, tools, analysis

3. Software container technologies for easy deployment and reproducibility○ Self-contained: software can be instantly deployed in common

environments○ Naming and versioning: compact reference mechanisms for

complex environments○ Good for reproducibility and education

● Education integration:○ Gretchen Mullendore (UND): Numerical

Weather Prediction Modules for Introductory and Advanced Undergraduate Classes

● Research integration:○ Brian Ancell (Texas Tech): Using Large

Ensembles to Determine the Adaptive Nature of Probabilistic Weather Prediction

○ William Capehart (SDSM): Application of a statistical confidence index to regional scale ensembles

○ Clark Evans (UW Milwaukee): Investigating the Predictability of Mesoscale Convective Systems

○ Robert Fovell (U Albany): Parameterization Testing in a Distributed Ensemble: Improving Model Development in the Research Community

○ Russ Schumacher (CSU): Synoptic analysis and probabilistic post-processing with a distributed ensemble

mypaper-repo| README.md| .git/| .popper.yml| experiments| |-- gassyfs| | |-- README.md| | |-- ansible/| | | |-- setup.yml| | | |-- vars.yml| | |-- datasets/| | | |-- input-data.csv| | |-- results/| | | |-- figure.png| | | |-- postprocess.py| | | |-- output.csv| | |-- run.sh| | -- validate.sh| paper| |-- build.sh| |-- figures/| |-- paper.tex| -- references.bib

Outreach so far

● 2015 Unidata Users Meeting● 2015 AGU Towhhall

○ ~50 attendees○ 10 new bww-users subscribers

● 2016 Presentation at AMS● 2016 UniData Workshops● WRF in a box in the class room:

○ 2016 UND class by Tim See○ 2017 UND class by Gretchen Mullendore

● 2017 Presentations at AMS○ Kevin Tyle (Albany) on BWW workflows○ Tim See (UND) on BWW in class○ William Capehart (SDSM) using BWW

ensembles (2 papers)● Popper:

○ Jimenez et al. VarSys’16 Chicago, IL○ Jimenez et al. USENIX ;login:, Winter’16○ Guest lecture in 2017 UND class by Gretchen

Mullendore

General Approach

● Establish “nuclei”: pieces of technology that○ Are easily shareable○ Have the ability to grow &

improve over time○ Ensure “buy-in” from

researchers and students

● Examples:○ Wikipedia○ Linux kernel

● Infrastructures to enable community-driven review and improvement

● Docker offers open-source container software for packaging applications

● Container includes minimal OS and all dependencies needed for an application

● WRF, processing, and post-processing precompiled binaries run under any Docker engine○ Time for new virtual machine (VM) to

instantiate: 1 minute○ Full WRF output and graphics

completed (macbook pro, 2 cpu): 5 minutes 30 secs

● Allows scientific reproducibility and application portability from laptops to servers to clouds. (AWS, Google, Azure)○ Reproducible results across many

platforms○ Control over compiler uncertainty

Hacker, J., J. Exby, N. Chartier, D. Gill, I. Jimenez, C. Maltzahn, and G. Mullendore: Collaborative Research and Education with Numerical Weather Prediction Enabled by Software Containers. American Meteorological Society 32nd Conference on Environmental Processing Technologies. Jan. 2016.

WRF in a box: Build, Deploy, Run

ProblemInformed by EarthCube Users workshops

● Poor reproducibility of data-intensive science○ Impact on education and research

● Impaired availability of intermediate results○ Unnecessary duplication of work, steep learning curves

● Communities of practice are falling behind○ Limited ability to adopt new technologies

Domain: Numerical Weather Prediction

● NWP groups at universities use supercomputing time to create large ensembles

● Current practice:○ Keep ensembles in scratch space or download to local

infrastructure○ Don’t share ensemble data, don’t share tools○ Rewards for results, not data

● Testing the distributed ensemble framework and tools

● Sharing of “knowledge products”○ Initialization methods○ Physics options○ Workflow scripts for producing & analyzing

data● Tracking data authorship and community impact● Dissemination of framework & tools

Web site: http://bigweatherweb.orgPublic email list:

[email protected]: [email protected]

Funded by NSF Award# 1450488.

Next● Do science with the ensemble output● Create infrastructure in TACC Wrangler● Investigate cloud commons governance

models

Popper: Practical Reproducible Evaluation of SystemsReproducibility as a DevOps Problem● Independently validating experimental results is challenging.● Recreating experimental setup is often difficult to impossible. ● Software engineers deal with reproducibility all the time:

○ Bug A can be reproduced in version X on platform Y using input Z.● Shared (cloud) computing and storage services readily available.● Manage an academic article as a software project!

Code Package

Execute

Input Data

Data andMetrics

Analyze/Visualize

Manuscript

Our Approach

Popper: Take a common generic experimentation workflow (above) and apply a DevOps practice used in the development of open source software (OSS) projects (below).

The Convention:1. Pick a DevOps tool for each stage

of the scientific experimentation workflow.

2. Put all associated scripts (experiment and manuscript) in version control, in order to provide a self-contained repository.

3. Document changes as experiment evolves, in the form of version control commits.

Popper ComplianceTools: Generate referenceable assets (associate unique IDs to binaries, data, configuration and infrastructure state); usable from scripts/CLI and capable of acting upon IDs.Experiment: Provide all necessary assets in a single repository (self-contained), including experiment code, orchestration logic, data dependencies, results and validation criteria.Article: Provide full text and figures of article, as well as all popper-compliant experiments.

Code Package

Execute

Input Data

Data andMetrics

Analyze/Visualize

Manuscript

Codified Validations:WHEN NOT network_saturated AND num_nodes=* EXPECT system_throughput >= (baseline_throughput * 0.9)

PopperCICommit change to

experiment1

Trigger execution

2

Run multi-node experiment on one of supported backends3

Experiment generates output datasets or runtime metrics4

Validate experiment results by testing codified assertions on

output5

Keep track of execution and associated status to the corresponding commit

6

Pros:• Experiments can be falsifiable with minimal

re-execution effort.• Facilitates collaboration by following the OSS

model for sharing.• Investing time in DevOps skills quickly pays off.• The convention complements many existing

effortsChallenges:• Steep learning curve of DevOps practices and

tools/frameworks.• Big cultural change; new experimentation

paradigm.

Benefits and Challenges

Project Structure

$ cd mypaper-repo$ popper init-- Initialized Popper repo mypaper-repo

$ popper experiment list-- available templates ---------------ceph-rados proteustm mpi-comm adam sirius comd-openmpcloverleaf gassyfs zlog bww unum-py cuddn-deeplrnspark-stand torpor malacology genevo mantle rita-idxhadoop-yarn kubsched alg-encycl macrob dadvisor obfuscdata

$ popper add gassyfs-- Added gassyfs experiment to mypaper-repo

Bootstrapping a Popper Project

• Numeric Weather Prediction (Big Weather Web).• Computer Systems Research (UCSC; UW Madison).• High Performance Computing (Sandia, LLNL, LANL).• Games and Playable Media (UCSC).• Genomics (UCSC).

Communities adoping Popper