vtk-m: uniting gpu acceleration successes processing vtk-m architecture worklets datamodel filters...
TRANSCRIPT
![Page 1: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/1.jpg)
VTK-m: Uniting GPU Acceleration Successes
Robert Maynard
Kitware Inc.
![Page 2: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/2.jpg)
VTK-m Project
• Supercomputer Hardware Advances Everyday – More and more parallelism
• High-Level Parallelism – “The Free Lunch Is Over” (Herb Sutter)
![Page 3: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/3.jpg)
VTK-m Project Goals
• A single place for the visualization community to collaborate, contribute, and leverage massively threaded algorithms.
• Reduce the challenges of writing highly concurrent algorithms by using data parallel algorithms
![Page 4: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/4.jpg)
VTK-m Project Goals
• Make it easier for simulation codes to take advantage these parallel visualization and analysis tasks on a wide range of current and next-generation hardware.
![Page 5: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/5.jpg)
VTK-m Project
• Combines the strengths of multiple projects:
– EAVL, Oak Ridge National Laboratory
– DAX, Sandia National Laboratory
– PISTON, Los Alamos National Laboratory
![Page 6: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/6.jpg)
In-Situ
Execution Data Parallel Algorithms Arrays
Post Processing
VTK-m Architecture
Worklets
DataModel
Filters
![Page 7: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/7.jpg)
In-Situ
Execution Data Parallel Algorithms Arrays
Post Processing
VTK-m Architecture
Worklets
DataModel
Filters
![Page 8: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/8.jpg)
Gaps in Current Data Models
Point Arrangement
Cells Coordinates Explicit Logical Implicit
Structured
Strided Structured
Grid ? n/a
Separated ? Rectilinear Grid
Image Data
Unstructured
Strided Unstructured Grid ? ?
Separated ? ? ?
• Traditional data set models target only common combinations of cell and point arrangements
• This limits their expressiveness and flexibility
![Page 9: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/9.jpg)
Arbitrary Compositions for Flexibility
Point Arrangement
Cells Coordinates Explicit Logical Implicit
Structured
Strided
Separated
Unstructured
Strided
Separated
EAVL Data Set
• EAVL allows clients to construct data sets from cell and point arrangements that exactly match their original data
– In effect, this allows for hybrid and novel mesh types
• Native data results in greater accuracy and efficiency
![Page 10: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/10.jpg)
Other Data Model Gaps Addressed in EAVL
Low/high dimensional data (9D mesh in GenASiS)
H
C
H
C
H
H
A B Multiple simultaneous
coordinate systems (lat/lon + Cartesian xyz)
Multiple cell groups in one mesh (E.g. subsets, face sets, flux surfaces)
Non-physical data (graph, sensor, performance data)
Mixed topology meshes (atoms + bonds, sidesets)
Novel and hybrid mesh types (quadtree grid from MADNESS)
![Page 11: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/11.jpg)
1
2
4
8
16
32
64
128
OriginalData
Threshold(a)
Threshold(b)
Threshold(c)
Byte
s per
Crid
Cel
l
Memory Usage VTK EAVL
Memory Efficiency in EAVL • Data model designed for memory efficient
representations – Lower memory usage for same mesh relative to
traditional data models – Less data movement for common transformations leads
to faster operation
• Example: threshold data selection – 7x memory usage reduction – 5x performance improvement
1
2
4
8
16
Runt
ime
(mse
c)
Cells Remaining
Total Runtime VTK EAVL
35 < Density < 45
![Page 12: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/12.jpg)
In-Situ
Execution Data Parallel Algorithms Arrays
Post Processing
VTK-m Architecture
Worklets
DataModel
Filters
![Page 13: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/13.jpg)
Dax: Data Analysis Toolkit for Extreme Scale
Kenneth Moreland Sandia National Laboratories
Robert Maynard Kitware, Inc.
![Page 14: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/14.jpg)
Execution Environment
Cell Operations
Field Operations Basic Math Make Cells
Control Environment
Grid Topology Array Handle Invoke
Device Adapter
Allocate Transfer Schedule
Sort …
Wo
rklet Dax Framework
dax::cont dax::exec
![Page 15: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/15.jpg)
struct Sine: public dax::exec::WorkletMapField { typedef void ControlSignature(FieldIn, FieldOut); typedef _2 ExecutionSignature(_1); DAX_EXEC_EXPORT dax::Scalar operator()(dax::Scalar v) const { return dax::math::Sin(v); } };
dax::cont::ArrayHandle<dax::Scalar> inputHandle = dax::cont::make_ArrayHandle(input); dax::cont::ArrayHandle<dax::Scalar> sineResult; dax::cont::DispatcherMapField<Sine> dispatcher; dispatcher.Invoke(inputHandle, sineResult);
Control Environment
Execution Environment
![Page 16: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/16.jpg)
Dax Success
• ParaView/VTK
– Zero-copy support for vtkDataArray
– Exposed as a plugin inside ParaView • Will fall back to cpu version
16
![Page 17: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/17.jpg)
Dax Success
• TomViz: an open, general S/TEM visualization tool
– Built on top of ParaView framework
– Operates on large (10243 and greater) volumes
– Uses Dax for algorithm construction
• Implements streaming, interactive, incremental contouring
– Streams indexed sub-grids to threaded contouring algorithms
17
![Page 18: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/18.jpg)
In-Situ
Execution Data Parallel Algorithms Arrays
Post Processing
VTK-m Architecture
Worklets
DataModel
Filters
![Page 19: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/19.jpg)
• Focuses on developing data-parallel algorithms that are portable across multi-core and many-core architectures for use by LCF codes of interest
• Algorithms are integrated into LCF codes in-situ either directly or though integration with ParaView Catalyst
PISTON isosurface with curvilinear coordinates
Ocean temperature isosurface generated across four GPUs using distributed PISTON
PISTON integration with VTK and ParaView
Piston
![Page 20: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/20.jpg)
• Particles are distributed among processors according to a decomposition of the physical space
• Overload zones (where particles are assigned to two processors) are defined such that every halo will be fully contained within at least one processor
• Each processor finds halos within its domain: Drop in PISTON multi-/many-core accelerated algorithms
• At the end, the parallel halo finder performs a merge step to handle “mixed” halos (shared between two processors), such that a unique set of halos is reported globally
Distributed Parallel Halo Finder
![Page 21: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/21.jpg)
• This test problem has ~90 million particles per process. • Due to memory constraints on the GPUs, we utilize a hybrid approach, in which the halos are computed on the CPU but the centers on the GPU. • The PISTON MBP center finding algorithm requires much less memory than the halo finding algorithm but provides the large majority of the speed-up, since MBP center finding takes much longer than FOF halo finding with the original CPU code.
Performance Improvements
On Moonlight with 10243 particles on 128 nodes with 16 processes per node,
PISTON on GPUs was 4.9x faster for halo + most bound particle center finding
On Titan with 10243 particles on 32 nodes with 1 process per node, PISTON on
GPUs was 11x faster for halo + most bound particle center finding
Implemented grid-based most bound particle center finder using a Poisson solver that performs fewer total computations than standard O(n2) algorithm
Science Impact
These performance improvements allowed halo analysis to be performed on a very large 81923 particle data set across 16,384 nodes on Titan for which analysis using the existing CPU algorithms was not feasible
Publications
Submitted to PPoPP15: “Utilizing Many-Core Accelerators for Halo and Center Finding within a Cosmology Simulation” Christopher Sewell, Li-ta Lo, Katrin Heitmann, Salman Habib, and James Ahrens
Distributed Parallel Halo Finder
![Page 22: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/22.jpg)
Results: Visual comparison of halos
Original Algorithm VTK-m Algorithm
![Page 23: VTK-m: Uniting GPU Acceleration Successes Processing VTK-m Architecture Worklets DataModel Filters • Focuses on developing data-parallel algorithms that are portable across multi-core](https://reader030.vdocuments.site/reader030/viewer/2022041200/5d357e1b88c993ee5c8b9cab/html5/thumbnails/23.jpg)
In-Situ
Execution Data Parallel Algorithms Arrays
Post Processing
Questions?
Worklets
DataModel
Filters