advisor: dr. aamir shafi co-advisor: mr. ali sajjad member: dr. hafiz farooq member: mr. tahir azim...
TRANSCRIPT
![Page 1: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/1.jpg)
Advisor: Dr. Aamir Shafi
Co-Advisor: Mr. Ali Sajjad
Member: Dr. Hafiz Farooq
Member: Mr. Tahir Azim
Optimizing N-body Simulations for Multi-core Compute Clusters
Ammar Ahmad Awan
BIT-6
![Page 2: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/2.jpg)
Presentation Outline
2
• Introduction
• Design & Implementation
• Performance Evaluation
• Conclusions and Future Work
![Page 3: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/3.jpg)
Introduction
• Sea change in the basic computer architecture: – Power Consumption– Heat Dissipation
• Emergence of multiple energy-efficient processing cores instead of a single power-hungry core
• Moore’s law will now be realized by increasing core-count instead of increasing clock speeds
• Impact on software applications: – Change of focus from Instruction Level Parallelism ( higher clock
frequency) to Thread Level Parallelism ( increasing core count )
• Huge impact on High Performance Computing (HPC) community:
– 70% of the TOP500 supercomputers are based on multi-core processors 3
![Page 4: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/4.jpg)
Source : Google Images4
![Page 5: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/5.jpg)
5Source : www.intel.com
![Page 6: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/6.jpg)
Main Memory
A Dual Core Processor
Core 1 Core 2
CacheMMU
Main Memory
Single Core Processor
Core 1
CacheMMU
Single Core Processor
Core 1
CacheMMU
Symmetric Multi-Processor Multi-core Processor
SMP vs Multicore
6
![Page 7: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/7.jpg)
HPC and Multi-core
• Message Passing Interface (MPI) is the defacto standard for programming today’s supercomputers
– Alternatives include OpenMP (for SMP machines) and Unified Parallel C (UPC)
• With the existing approaches, it is possible to port MPI on multi-core processors:
– One MPI process per core—we call it the “Pure MPI” approach– OpenMP threads inside MPI process—we call it “MPI+threads” approach
• We expect “MPI+threads” approach to be good because– Communication cost for threads is lower than processes– Threads are light-weight
• We have evaluated this hypothesis by comparing both approaches 7
![Page 8: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/8.jpg)
Pure MPI vs “MPI+threads” approach
8
![Page 9: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/9.jpg)
Sample Application: N-body Simulations
• To demonstrate the usefulness of our “MPI+threads” approach, we chose N-body simulation code
• N-body or “many body” method is used for simulating the evolution of a system consisting of ‘n’ bodies.
• It has found a widespread use in the fields of – Astrophysics – Molecular Dynamics – Computational Biology
9
![Page 10: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/10.jpg)
Summation Approach to solving N-body problems
10
The most compute intensive part of any N-body method is the “force calculation” phase
The cost of this calculation is O(n2)
![Page 11: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/11.jpg)
Barnes Hut Tree
The Barnes-Hut algorithm is divided into 3 steps
1. Building the tree – O( n * log n )
2. Computing cell centers of mass – O (n)
3. Computing Forces – O( n * log n )
The Barnes-Hut algorithm is divided into 3 steps
1. Building the tree – O( n * log n )
2. Computing cell centers of mass – O (n)
3. Computing Forces – O( n * log n )
11
Other popular methods are
•Fast Multipole Method
•Particle Mesh Method
•TreePM Method
•Symplectic Methods
![Page 12: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/12.jpg)
Sample Application: Gadget-2
• Cosmological Simulation Code• Simulates a system of “n” bodies
– Implements Barnes-Hut Algorithm
• Written in C language & parallelized with MPI • As part of this project:
– Understood the Gadget-2 code– How it is used in production mode– Modified the C code to use threads in the Barnes-hut tree
algorithm– Added performance counters to the code for measuring
cache utilization 12
![Page 13: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/13.jpg)
Presentation Outline
13
• Introduction
• Design & Implementation
• Performance Evaluation
• Conclusions and Future Work
13
![Page 14: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/14.jpg)
Gadget-2 Architecture
14
![Page 15: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/15.jpg)
Code Analysis
15
parallel for ( i=0 to n ){ calculate_force( i );}
for ( i = 0 to No. of particles && n = 0 to BufferSize ){ for ( j = 0 to No. of tasks ) {
export_particles ( j ); }}
parallel for ( i=0 to n ){ calculate_force( i );}
for ( i = 0 to No. of particles && n = 0 to BufferSize ){ for ( j = 0 to No. of tasks ) {
export_particles ( j ); }}
for ( i = 0 to No. of particles && n = 0 to BufferSize){ calculate_force ( i ); for ( j = 0 to No. of tasks ) {
export_particles ( j ); }}
for ( i = 0 to No. of particles && n = 0 to BufferSize){ calculate_force ( i ); for ( j = 0 to No. of tasks ) {
export_particles ( j ); }} Original Code
Modified Code
![Page 16: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/16.jpg)
Presentation Outline
16
• Introduction
• Design & Implementation
• Performance Evaluation
• Conclusions and Future Work
16
![Page 17: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/17.jpg)
Evaluation Testbed
• Our cluster called Chenab consists of nine nodes. • Each node consists of an
– Intel Xeon Quad-Core Kentsfield Processor• 2.4 GHz with 1066 MHZ FSB• 4 MB L2 Cache / two cores• 32 KB L1 Cache / core
– 2 GB main memory
17
![Page 18: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/18.jpg)
Performance Evaluation
• Performance evaluation is based on two main parameters– Execution Time
• Calculated directly from MPI wallclock timings
– Cache Utilization• We patched the Linux kernel using perfctr patch• We selected the PerfAPI ( PAPI ) for hardware performance counting• Used PAPI_L2_TCM (Total Cache Misses ) and PAPI_L2_TCA (Total
Cache Accesses ) to calculate cache miss ratio
• Results are shown on the upcoming slides– Execution Time for Colliding Galaxies– Execution Time for Cluster Formation– Execution Time for Custom Simulation– Cache Utilization for Cluster Formation 18
![Page 19: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/19.jpg)
Execution Time for Colliding Galaxies
19
![Page 20: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/20.jpg)
Execution Time for Cluster Formation
20
![Page 21: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/21.jpg)
Execution Time for Custom Simulation
21
![Page 22: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/22.jpg)
Cache Utilization for Cluster Formation
22
Cache utilization has been measured using hardware counters provided by the kernel patch (Perfctr) and PerfAPI (PAPI)Cache utilization has been measured using hardware counters provided by the kernel patch (Perfctr) and PerfAPI (PAPI)
![Page 23: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/23.jpg)
Presentation Outline
23
• Introduction
• Design & Implementation
• Performance Evaluation
• Conclusions and Future Work
23
![Page 24: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/24.jpg)
Conclusion
• We optimized Gadget-2 which was our sample application– “MPI+threads” approach performs better– The optimized code offers scalable performance
• We are witnessing dramatic changes in core designs for multicore systems– Heterogeneous and Homogeneous designs– Targeting a 1000 core processor will require
scalable frameworks and tools for programming24
![Page 25: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/25.jpg)
Conclusion
25Source: Dave Patterson, Overview of the Parallel Laboratory
• Towards Many-core computing– Multicore : 2x / 2 yrs ≈ 64 cores in 8 years– Manycore : 8x to 16x multicore
![Page 26: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/26.jpg)
Future Work
• Scalable Frameworks which provide programmer friendly high level constructs are very important– PeakStream provides GPU and CPU+GPU hybrid
programs– Cilk++ augment the C++ compiler with three new
keywords ( cilk_for, cilk_sync, cilk_spawn )– Research Accelerator for Multi Processors (RAMP) can
be used to simulate a 1000 core processor– Gadget-2 can be ported to GPUs using Nvidia’s CUDA
framework– ‘xlc’ compiler to program the STI Cell Processor
26
![Page 27: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/27.jpg)
![Page 28: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/28.jpg)
The Timeline
ID Task Name Start FinishDuratio
n
Apr 2008Feb 2008Jan 2008
6/8
May 2008Mar 2008 Jun 2008
6/13/9 5/42/3 3/2 4/6
1 6w2/28/20081/18/2008Literature Review
6
5
2
3
4
4w3/26/20082/28/2008Evaluation of Gaget-2
5w4/29/20083/26/2008Optimizations in Gadget-2 ( prototype1)
2w5/12/20084/29/2008Testing of prototype1
3w5/30/20085/12/2008Optimizations in prototype1
2w6/12/20085/30/2008Final Version
8
7 2w6/25/20086/12/2008Simulation Snapshots and Results
3.8w7/21/20086/25/2008Final Documentation and Finishing Tasks
9 13.8w7/18/20084/15/2008Improvements in Documentation
Jul 2008
7/6
28
![Page 29: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/29.jpg)
29
![Page 30: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/30.jpg)
30
![Page 31: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/31.jpg)
31
![Page 32: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/32.jpg)
Barnes Hut Tree
32
![Page 33: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/33.jpg)
33
![Page 34: Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute](https://reader030.vdocuments.site/reader030/viewer/2022032606/56649e8e5503460f94b91c91/html5/thumbnails/34.jpg)
34