high performance computing in multibody dynamics · 2014-04-06 · 2d convolution [image size: 8192...
TRANSCRIPT
![Page 1: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/1.jpg)
HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS
Dan NegrutVilas Associate Professor
Nvidia CUDA Fellow
University of Wisconsin-Madison
![Page 2: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/2.jpg)
People Whose Work/Ideas Shaped Up This Presentation
• Current and past close collaborators:• Radu Serban, University of Wisconsin-Madison
• Alessandro Tasora, University of Parma, Italy
• Mihai Anitescu, University of Chicago, Argonne National Lab
• Students from University of Wisconsin-Madison, listed alphabetically:• Omkar Deshmukh, Toby Heyn, Hammad Mazhar, Daniel Melanz, Arman Pazouki, Andrew Seidl
2University of Wisconsin
![Page 3: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/3.jpg)
Acknowledgements: Funding Sources
• National Science Foundation
• US Army
• NVIDIA
• Caterpillar
• Simertis GMBH
• MSC.Software
• FunctionBay, S. Korea
3University of Wisconsin
![Page 4: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/4.jpg)
About this presentation
• Summary of my understanding of the advanced computing topic
• Title of talk somewhat misleading
• High Performance Computing (HPC) often times associated with big supercomputers
• This talk: what one can do to speed up a simulation in multibody dynamics
• Supercomputers can sometimes be an option
4University of Wisconsin
![Page 5: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/5.jpg)
Multibody Dynamics: Commercial Solution
5University of Wisconsin
![Page 6: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/6.jpg)
Multibody Dynamics: Commercial Solution
6University of Wisconsin
![Page 7: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/7.jpg)
[Q] Why am in interested in this? [A] Large Dynamics problems: e.g., terrain simulation
• How is the Rover moving along on a slope with granular material?
• What wheel geometry is more effective?
• How much power is needed to move it?
• At what grade will it get stuck?
• And so on…
University of Wisconsin 7
![Page 8: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/8.jpg)
On wigs, ties, and t-shirts
8University of Wisconsin
![Page 9: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/9.jpg)
On Computing Speeds: The Price of 1 Mflop/second
• 1961: • Combine 17 million IBM-1620 computers
• At $64K apiece, when adjusted for inflation, this would cost $7 trillion
• 2000:• About $1,000
• 2013:• Less than 20 cents out of the value of a workstation
9University of Wisconsin
![Page 10: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/10.jpg)
More Good News: Chip Feature Length & Consequences
• Moore’s law at work
• 2013 – 22 nm
• 2015 – 14 nm
• 2017 – 10 nm
• 2019 – 7 nm
• 2021 – 5 nm
• 2023 – ??? (carbon nanotubes?)
• More transistors = More computational units
• October 2013: • Intel Xeon w/ 12 cores – 3 billion transistors
• Projecting ahead, estimates:• 2015 – 24 cores
• 2017 – 32 cores
• 2019 – 64 cores
• 2021 – 124 cores
University of Wisconsin 10
![Page 11: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/11.jpg)
Frictional Contact Simulation[Commercial Software Simulation - 2007]
• Model Parameters:• Spheres: 60 mm diameter and mass 0.882 kg
• Penalty Approach: stiffness of 1E5, force exponent of 2.2, damping coefficient of 10.0
• Simulation length: 3 seconds
11University of Wisconsin
![Page 12: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/12.jpg)
Frictional Contact Simulation[Commercial Software Simulation - 2013]
• Same problem tested in 2013
• Simulation time reduced by a factor of six
• Simulation times still prohibitively long
• Conclusion: fast computers mean nothing to the bottom line
12University of Wisconsin
![Page 13: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/13.jpg)
Should you decide it is time to look for the exit…
• One of the two important points that should come out of this talk is this
• Performing math operations is basically free
• Procuring the operands is very expensive
• In terms of energy
• In terms of time
• Corollary: a program that leverages spatial or temporal locality in data accesses is a fast program
13University of Wisconsin
![Page 14: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/14.jpg)
NVIDIA’s Fermi GPU Architecture: Quick Facts
14
• Lots of ALU (green), not much of CU (orange)
• Explains why GPUs are fast for high arithmetic intensity applications• Arithmetic intensity: high when many operations performed per word of memory
University of Wisconsin
![Page 15: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/15.jpg)
The Fermi GPU Architecture
• Late 2009, early 2010
• 40 nm technology
• Three billion transistors
• 512 Scalar Processors (SP, “shaders”)
• 64 KB L1 cache
• 768 KB L2 uniform cache (shared by all SMs)
• Up to 6 GB of global memory
• Operates at several clock rates• Memory
• Scheduler
• Execution unit
• High memory bandwidth • Close to 200 GB/s
15University of Wisconsin
![Page 16: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/16.jpg)
Fermi: cost of arithmetic vs. cost of memory access
• Fact 1: 32 FMAs (Fused-Multiply-Add) single precision operations in one clock cycle
• Fact 2: One memory request takes 400-600 cycles to service unless it hits L1 or L2 cache
• Conclusions1. Hundreds of times more expensive to bring data into SM than to compute something with it
2. Arithmetic is free, bringing data over is expensive
16University of Wisconsin
![Page 17: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/17.jpg)
GPU Computing Requires High Bandwidth
• Required bandwidth:• 32 (SP) x 3 (operands) x 4 (bytes) x 1125 MHz
x 15 (SMs) = 6,480 GB/s
• Available bandwidth • 200 GB/s, about 20-30 times smaller
• Two things save the day• Caching
• High arithmetic intensity algorithms
17University of Wisconsin
![Page 18: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/18.jpg)
My Lamborghini can drive at 250 mph [I drive it to get groceries]
18University of Wisconsin
![Page 19: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/19.jpg)
My Lamborghini can drive at 250 mph [I drive it to get groceries]
19University of Wisconsin
![Page 20: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/20.jpg)
Bandwidth in a CPU-GPU System
[Robert Strzodka, Max Plank Institute, Germany]→
1-8 GB/s
GPU
NOTE: The width
of the black lines is
proportional to the bandwidth.
University of Wisconsin 20
![Page 21: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/21.jpg)
Intel Haswell
• Released in June 2013
• 22 nm technology
• Transistor budget: 1.4 billions
• Tri-gate, 3D transistors
• Typically comes in four cores
• Has an integrated GPU
• Deep pipeline – 16 stages
• Strong Instruction Level Parallelism (ILP) support
• Superscalar
• Supports HTT (hyper-threading technology)
21University of Wisconsin
![Page 22: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/22.jpg)
Haswell Layout: “System on Chip” (Soc) Paradigm
22
� Three clocks:� A core’s clock ticks at 2.7 to 3.0 GHz – adjustable up to 3.7-3.9 GHz
� Graphics processor ticking at 400 MHz – adjustable up to 1.3 GHz
� Ring bus and the shared L3 cache - a frequency close, but not necessarily identical, to that of the cores
University of Wisconsin[Intel]→
![Page 23: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/23.jpg)
Caches
• Data:• L1 – 32 KB per core
• L2 – 512 KB or 1024 KB per core
• L3 – 8 MB per CPU
• Instruction:• L0 – storage for about 1500 microoperations (uops) per core
• L1 – 32 KB per core
23University of Wisconsin
![Page 24: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/24.jpg)
Running fast on one workstation
• Several options• Vectorization using AVX
• Leverage multiple cores, up to 16 (AMD Opteron C6274)
• Have four CPUs on a node (64 cores)
• Intel Xeon Phi (60 cores)
24University of Wisconsin
Intel Xeon E5-2690 v2 Ivy Bridge-EP 3.0GHz 25MB L3 Cache 10-CoreTesla K20X (Kepler architecture)
Intel® Xeon Phi™ Coprocessor 5110P (8GB, 1.053 GHz, 60 core)
![Page 25: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/25.jpg)
How can you drive this HW?
• GPU:• CUDA (NVIDIA proprietary software ecosystem, freely distributed)
• OpenCL (standard supporting parallel computing in hardware-agnostic fashion)
• x86• pthreads
• OpenMP
• MPI
25University of Wisconsin
![Page 26: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/26.jpg)
Reflections on picking the winning horse
• The hardware is complex
• The compiler flags are numerous
• The problems can be formulated in so many ways
• The second point of this talk:• The proof of the pudding is in the eating
University of Wisconsin 26
![Page 27: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/27.jpg)
Next two slides:Intel MKL – bewildering results at times
“In our factory, we make lipstick. In our Advertising, we sell hope.”
Charles Revson, cosmetics magnate, Revlon
University of Wisconsin 27
![Page 28: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/28.jpg)
DGEMM : C = alpha*A*B + beta*C using Intel MKL 11.1
(alpha = 1, beta = 0)
A(2000x200), B(200x1000)
Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
University of Wisconsin 28
![Page 29: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/29.jpg)
DGEMM on Intel Xeon Phi (60 Cores with 512-bit SIMD vector registers)
Intel’s MKL 11.1 – used in native mode
University of Wisconsin 29
![Page 30: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/30.jpg)
Next slides: Picking the winning horse is not obvious
University of Wisconsin 30
![Page 31: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/31.jpg)
for (int yOut = 0; yOut < nHeight; yOut++) { // Image Y-dimension
const int yInTopLeft = yOut;
for (int xOut = 0; xOut < nWidth; xOut++) { // Image X-dimension
const int xInTopLeft = xOut;
float sum = 0.f;
for (int r = 0; r < nFilterWidth; r++) { // Kernel Y-dimension
const int idxFtmp = r * nFilterWidth;
const int yIn = yInTopLeft + r;
const int idxIntmp = yIn * nInWidth + xInTopLeft;
for (int c = 0; c < nFilterWidth; c++) { // Image X-dimension
const int idxF = idxFtmp + c;
const int idxIn = idxIntmp + c;
sum += pFilter[idxF]*pInput[idxIn];
}
}
const int idxOut = yOut * nWidth + xOut;
pOutput[idxOut] = sum;
}
}
2D Convolution
University of Wisconsin 31
![Page 32: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/32.jpg)
2D Convolution[Image size: 8192 x 8192]
• Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz• Dual core chip, supports 256-bit AVX capable of 8 FMA per clock-cycle
• Middle of the road laptop, single socket configuration
• Used for AVX acceleration, implementation by Okmar
• Vectorization with OpenCL built-in float4 data type
• AMD Opteron™ 6274, 2MB, 2.2 GHz• Has 16 physical cores
• Used with OpenMP 4/16/64 Threads
University of Wisconsin 32
![Page 33: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/33.jpg)
Vectorization with AVX
• AVX: Advanced Vector Extension – difficult to program, little compiler optimizations
• Examples of AVX intrinsics
• Create and initialize variables that map to AVX registers
__m256 prod __attribute__ ((aligned (32))) = _mm256_set1_ps(0.0f);
• Carry out AVX multiplication using 256 bit wide registers
prod = _mm256_mul_ps(data, kernel);
• Multiplication maps eventually into assembly
c5 f4 59 c0 vmulps ymm0,ymm1,ymm0
University of Wisconsin 33
![Page 34: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/34.jpg)
2D Convolution[Image size: 8192 x 8192]
University of Wisconsin 34
![Page 35: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/35.jpg)
University of Wisconsin 35
2D Convolution: Scaling w/ image size[Kernel Size: 8 x 8]
![Page 36: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/36.jpg)
2D ConvolutionGPU Results – Tesla K20x, 5GB, 0.73 GHz
University of Wisconsin 36
http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf
Reference: one core of Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
![Page 37: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/37.jpg)
Next slides: pthreads is not it
University of Wisconsin 37
![Page 38: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/38.jpg)
Stencil-Type Operation
• Average over the neighbors (grid[ i ± 1 ][ j ± 1 ])
• Hardware Setup: Dual socket, 4-core Intel Xeon CPUs with hyper-threading
• Shows up as 16 virtual cores
• Software approaches
• Pthreads – do-it-yourself, low-level, handle barriers, locking, synchroniztion
• OpenMP – Single line of code
• MPI – Middle ground, uses processes that communicate with each other
University of Wisconsin 38
![Page 39: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/39.jpg)
for (int t=0; t < timesteps/2; t++) {
int flag = 1;
int i, j, k;
MPI_Status status;
// Even indexed elements
for (j=1; j < ydim-1; j++) {
for (i=flag; i < xdim-1; i+=2) {
grid[j][i] = (grid[j-1][i] + grid[j][i-1] + grid[j][i] + grid[j][i+1] + grid[j+1][i]) / 5;
}
flag = (flag == 1 ? 2 : 1);
}
// Odd indexed elements
flag = 2;
for (j=1; j < ydim-1; j++) {
for (i=flag; i < xdim-1; i+=2) {
grid[j][i] = (grid[j-1][i] + grid[j][i-1] + grid[j][i] + grid[j][i+1] + grid[j+1][i]) / 5;
}
flag = (flag == 1 ? 2 : 1);
}
Barrier();
}
University of Wisconsin 39
![Page 40: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/40.jpg)
University of Wisconsin 40
![Page 41: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/41.jpg)
University of Wisconsin 41
![Page 42: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/42.jpg)
Accelerators
Multicore chips
Many-node configurations
42University of Wisconsin
![Page 43: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/43.jpg)
Our Lab’s Cluster
University of Wisconsin 43
![Page 44: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/44.jpg)
Lab’s Cluster
• More than 1,200 CPU cores
• Mellanox Infiniband Interconnect (QDR), 40Gb/sec
• Memory: about 2.7 TB of RAM
• More than 10 TFlops Double Precision out of x86 hardware
• 60 GPU cards (K40, K30, GTX480) – more than 15 Tflops
• BTW: you can get an account if interested
University of Wisconsin 44
![Page 45: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/45.jpg)
Rover Simulation
University of Wisconsin 45
![Page 46: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/46.jpg)
Distributed Computing Dynamics: Is it worth it?
• The cons• Communication is major bottleneck – latencies are high, bandwidth is ok for dynamics
• Software design is complex
• The pros• Access to lots of memory
• Can put more cores to work
• Conclusion• Justifiable if one node not large enough to store the entire problem
46University of Wisconsin
![Page 47: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/47.jpg)
Breaking Up Dynamics for Distributed Computing
• Simulation divided into chunks executed on different cores
• Elements leave one chunk (subdomain) to move to a different one
• Key issues:• Dynamic load balancing
• Establish a dynamic data exchange (DDE) protocol to implement migration at run time
47
v1
v3
v2
v5
v4
University of Wisconsin
![Page 48: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/48.jpg)
Computation Using Multiple CPUs & MPI
University of Wisconsin 48
![Page 49: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/49.jpg)
0.5 Million Bodies on 64 Cores[Penalty Approach, MPI-based]
49University of Wisconsin
![Page 50: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/50.jpg)
Project Chrono: Using HPC in Multibody Dynamics
• Open source, distributed under BSD-3 license
• 100,000 lines of code
• https://github.com/projectchrono/chrono
50University of Wisconsin
![Page 51: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/51.jpg)
Tracked Vehicle Simulation – on GPU
51University of Wisconsin
![Page 52: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/52.jpg)
Chrono::Fluid
University of Wisconsin 52
![Page 53: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/53.jpg)
Chrono::Flex [GPU+CPU]
University of Wisconsin 53
![Page 54: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/54.jpg)
Additive Manufacturing (3D Printing)
Selective Laser Sintering (SLS) machine
54University of Wisconsin
![Page 55: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/55.jpg)
Additive Manufacturing (3D Printing)
55University of Wisconsin
![Page 56: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/56.jpg)
Surface Roughness: Before and After Leveling
56University of Wisconsin
![Page 57: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/57.jpg)
Put things in perspective: parallel computing options
• Accelerators (GPU/CUDA, Intel Xeon/Phi)
• Up one level: multicore
• Up one more level: multi-node
57University of Wisconsin
![Page 58: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/58.jpg)
Wrapping Things Up
• In the midst of a democratization of parallel computing• Many alternatives, landscape very fluid
• Lots of hardware options: GPU, CPU (Intel/AMD), Phi, clusters, FPGAs, etc
• Lots of software ecosystems: CUDA, OpenCL, OpenMP, OpenACC, MPI, etc.
• Parallel computing can be a game changer• Can pursue new physics or new problems
• Provide impetus into development of new algorithms that expose more parallelism
58University of Wisconsin
![Page 59: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/59.jpg)
Two pieces of practical advice
• Moving data around is hurting performance badly• Fixing this aspect calls for rethinking of implementation and maybe even numerical algorithm
• Fluid landscape, technology changing fast – best solution is a bit of a guessing game• The proof of the pudding is in the eating
• “80% of success is showing up” (Woody Allen, mentioned at breakfast by David Stewart)
59University of Wisconsin
![Page 60: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/60.jpg)
Looking ahead
• Integration of the CPU and GPU – the “system on a chip” (SoC) paradigm takes over
• Intel• Haswell, yet not clear strategy for CPU-GPU integration
• NVIDIA• Maxwell and Denver project - CUDA 6 sees a unified CPU-GPU memory space
• AMD• Kavery chip, “Hybrid System Architecture” – pushing OpenCL to leverage the architecture
University of Wisconsin 60
![Page 61: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/61.jpg)
Thank You.
Simulation Based Engineering Lab
Wisconsin Applied Computing Center
University of Wisconsin-Madison
61University of Wisconsin
![Page 62: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports](https://reader033.vdocuments.site/reader033/viewer/2022042016/5e74cc20213c446d9207f5f5/html5/thumbnails/62.jpg)
Next two slides – Intel MKL: What you get, is not quite what’s advertised
“The very first law in advertising is to avoid the concrete promise and cultivate the delightfully vague.”~ Bill Cosby
University of Wisconsin 62