feasibility study of mpi implementation on the heterogeneous multi-core cell be tm architecture
Post on 22-Jan-2016
40 Views
Preview:
DESCRIPTION
TRANSCRIPT
High Performance Computing GroupHigh Performance Computing Group
Feasibility Study of MPI Implementation Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell on the Heterogeneous Multi-Core Cell
BEBETMTM Architecture Architecture
A. Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1
, P.K. Baruah1, R. Sarma1,
S. Kapoor2, A. Srinivasan3
1 Sri Sathya Sai University, Prashanthi Nilayam, India2 IBM, Austin, skapoor@us.ibm.com
3 Florida State University, asriniva@cs.fsu.edu
A. Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1, P.K. Baruah1, R. Sarma1,
S. Kapoor2, A. Srinivasan3
1 Sri Sathya Sai University, Prashanthi Nilayam, India2 IBM, Austin, skapoor@us.ibm.com
3 Florida State University, asriniva@cs.fsu.edu
Goals
1. Determine the feasibility of Intra-Cell MPI
2. Evaluate the impact of different design choices on performance
Goals
1. Determine the feasibility of Intra-Cell MPI
2. Evaluate the impact of different design choices on performance
High Performance Computing GroupHigh Performance Computing Group
A PowerPC core, with 8 co-processors (SPE) with 256 K local
store each
Shared 512 MB - 2 GB main memory - SPEs can DMA
Peak speeds of 204.8 Gflops in single precision and 14.64 Gflops
in double precision for SPEs
204.8 GB/s EIB bandwidth, 25.6 GB/s for memory
Two Cell processors can be combined to form a Cell blade with
global shared memory
Cell ArchitectureCell Architecture
DMA put timesDMA put times
Memory to memory copy using:
• SPE local store
• memcpy by PPE
Memory to memory copy using:
• SPE local store
• memcpy by PPE
High Performance Computing GroupHigh Performance Computing Group
Intra-Cell MPI Design ChoicesIntra-Cell MPI Design Choices
Cell features In order execution, but DMAs can be out of order Over 100 simultaneous DMAs can be in flight
Constraints Unconventional, heterogeneous architecture SPEs have limited functionality, and can act directly only on local stores SPEs access main memory through DMA Use of PPE should be limited to get good performance
MPI design choices Application data in: (i) local store or (ii) main memory MPI meta-data in: (i) local store or (ii) main memory PPE involvement: (i) active or (ii) only during initialization and finalization Point-to-point communication mode: (i) synchronous or (ii) buffered
High Performance Computing GroupHigh Performance Computing Group
Blocking Point-to-Point Communication Blocking Point-to-Point Communication Performance Performance
Results are from a 3.2 GHz Cell Blade, at IBM Rochester
The final version uses buffered mode for small messages and synchronous mode for long messages
Threshold to switch to Synchronous mode is set to 2KB
In these figures, the default is for Application data to be in main memory, MPI data in Local Store, no congestion, and limited PPE involvement
High Performance Computing GroupHigh Performance Computing Group
MPI/PlatformLatency
(0 Byte)Maximum throughput
MPICELL 0.41 µs 6.01 GB/s
MPICELL Congested NA 4.48 GB/s
MPICELL Small 0.65 µs 23.12 GB/s
Nemesis/Xeon 1.0 µs 0.65 GB/s
Shm/Xeon 1.3 µs 0.5 GB/s
Open MPI/Xeon 2.8 µs 0.5 GB/s
Nemesis/Opteron 0.34 µs 1.5 GB/s
Open MPI/Opteron 0.6 µs 1.0 GB/s
Comparison of MPICELL with MPI on Other Hardware
High Performance Computing GroupHigh Performance Computing Group
Collective Communication Example – Collective Communication Example – BroadcastBroadcast
Broadcast on 16 SPEs (2 processors) TREE: Pipelined tree structured communication based on LS TREEMM: Tree structured Send/Recv type implementation AG: Each SPE is responsible for a different portion of data OTA: Each SPE copies data to its location G: Root copies all data
Broadcast with good choice of
algorithms for each data size and SPE count
Maximum main memory bandwidth is also shown
High Performance Computing GroupHigh Performance Computing Group
Application Performance – Matrix-Application Performance – Matrix-Vector MultiplicationVector Multiplication
Used a 1-D decomposition (not very efficient)
Achieved a peak double precision throughput of 7.8 Gflop/s for matrices of size of 1024
The collective used was from an older implementation on the Cell, built on top of Send/Recv using a tree structured communication
The Opteron results used LAM MPI
Performance of Double Precision matrix-vector multiplication
High Performance Computing GroupHigh Performance Computing Group
Conclusions and Future WorkConclusions and Future Work
Conclusions The Cell processor has good potential for MPI applications.
PPE should have a very limited role Very high bandwidths with application data in local store High bandwidth and low latency even with application data in main memory
But local store should be used effectively, with double buffering to hide latency Main memory bandwidth is then the bottleneck
Good performance for collectives even with two Cell processors
Current and future work Implemented
Collective communication operations optimized for contiguous data Blocking and non-blocking communication
Future work Optimize collectives for derived data types with non-contiguous data Optimize point-to-point communication on blade with two processors More features, such as topologies, etc
top related