designing high performance computing architectures for reliable space applications
Post on 28-May-2015
122 Views
Preview:
DESCRIPTION
TRANSCRIPT
Designing High Performance ComputingDesigning High Performance ComputingA hit t f R li bl SA hit t f R li bl SArchitectures for Reliable SpaceArchitectures for Reliable Space
ApplicationsApplicationspppp
Fisnik KrajaPhD DefenseDecember 6, 2012 Advisors:
1st : Prof Dr Arndt Bode1 : Prof. Dr. Arndt Bode2nd : Prof. Dr. Xavier Martorell
OutlineOutlineOut eOut e
1. Motivation
2 The Proposed Computing Architecture2. The Proposed Computing Architecture
3 The 2DSSAR Benchmarking Application3. The 2DSSAR Benchmarking Application
4. Optimizations and Benchmarking Resultsp g– Shared memory multiprocessor systems– Distributed memory multiprocessor systems
Heterogeneous CPU/GPU systems– Heterogeneous CPU/GPU systems
5. Conclusions
2
MotivationMotivation
• Future space applications will demand for:
ot at oot at o
• Future space applications will demand for:– Increased on-board computing capabilities– Preserved system reliability
• Future missions:– Optical - IR Sounder: 4.3 GMult/s + 5.7 GAdd/s, 2.2 Gbit/s
Radar/Microwave HRWS SAR: 1 Tera 16 bit fixed point operations/s 603 1– Radar/Microwave - HRWS SAR: 1 Tera 16 bit fixed point operations/s, 603.1 Gbit/s
• ChallengesChallenges– Costs (ASICs are very expensive)– Modularity (component change and reuse)– Portability (across various spacecraft platforms)y ( p p )– Scalability (hardware and software)– Programmability (compatible to various environments)– Efficiency (power consumption and size)y (p p )
3
The Proposed ArchitectureThe Proposed Architecturee oposed c tectu ee oposed c tectu e
Legend:
RHMURadiation-Hardened Management Unit
Control Bus
PPNParallel processing Node
Data Bus
Co t o us
4
The 2DSSAR ApplicationThe 2DSSAR Application22 DimensionalDimensional SSpotlightpotlight Synthetic Aperture RadarSynthetic Aperture Radar
Synthetic Data
22-- DimensionalDimensional SSpotlightpotlight Synthetic Aperture Radar Synthetic Aperture Radar
Illuminated swath in Side‐looking Spotlight SARSynthetic Data Generation (SDG):
Synthetic SAR returns from a uniform grid of point reflectorsS ft point reflectors
Azimuth
Spacecraft
Flight Path
Altit dSwath Range
SAR Sensor Processing (SSP)
Read Generated DataImage Reconstruction (IR)Write Reconstructed Image
Reconstructed SAR i i b i d b
Altitude Range Write Reconstructed Image
image is obtained by applying a 2DFourier Matched Filtering and Interpolation
Swath Cross-Range
Range
5
pAlgorithm
Profiling SAR Image ReconstructionProfiling SAR Image Reconstructiong gg g
Coverage Memory FLOP Time Goal: g(in km)
y(in GB) (in Giga) (in Seconds) Speedup
Scale=10 3.8 x 2.5 0.25 29.54 23
30xScale=30 11 4 x 7 5 2 115 03 230 30xScale 30 11.4 x 7.5 2 115.03 230Scale=60 22.8 x 15 8 1302 926
Transposition and FFT‐shifting
2%
Compression and Decompression
Loops7%7%
FFTs22%
InterpolationInterpolation Loop69%
6IR Profiling
IR Optimizations for IR Optimizations for Shared Memory MultiprocessingShared Memory Multiprocessing
O MP
Shared Memory MultiprocessingShared Memory Multiprocessing
• OpenMP– General optimizations:
Thread Pinning and First Touch Policy• Thread Pinning and First Touch Policy• Static/Dynamic Scheduling
– FFT• Manual Multithreading of Loops of 1D-FFT(not the FFT itself)
– Interpolation Loop (Polar to Rectangular Coordinates)• Atomic Operations• Replication and Reduction (R&R)
• Other Programming Models– OmpSs, MPI, MPI+OpenMPOmpSs, MPI, MPI OpenMP
7
IR IR on a Shared Memory Nodeon a Shared Memory Nodeyy
The ccNUMA Node:12
The ccNUMA Node:2 x Nehalem CPUs
6 Cores, 12 threads2.93-3.33 GHzQPI: 25.6 GB/sIMC 32 GB/ 8
10
=60)
IMC: 32 GB/sLithography: 32 nmTDP: 95 W
2 x 3 x 6 GB MemoryTotal: 36 GB
6
8pe
edup
(Scale
DDR3 SDRAM1066 MHz
2
4Sp
1 2 4 6 8 10 12 12 (24)OpenMP Atomic 1 1,55 3,05 4,45 5,81 6,94 7,98 10,54OpenMP R&R 1 1,78 3,51 5,02 6,36 7,74 9,03 11,06
0Cores (Threads)
OpenMP R&R 1 1,78 3,51 5,02 6,36 7,74 9,03 11,06OmpSs Atomic 1 1,61 3,12 4,62 5,92 7,02 8,13 10,72OmpSs R&R 1 1,93 3,73 5,52 7,13 8,65 10,37 12,37MPI R&R 1 1,92 3,65 5,30 6,57 7,94 9,81 11,20MPI+OpenMP 1 1 89 3 54 4 88 6 40 8 02 9 94 11 69
8
MPI+OpenMP 1 1,89 3,54 4,88 6,40 8,02 9,94 11,69
IR Optimizations for IR Optimizations for Distributed Memory MultiprocessingDistributed Memory Multiprocessing• Programming Paradigms
Distributed Memory MultiprocessingDistributed Memory MultiprocessingPID 0 D00 D01 D02 D03g g g
– MPI • Data Replication• Process Creation Overhead
MPI+OpenMP
PID 0 D00 D01 D02 D03PID 1 D10 D11 D12 D13PID 2 D20 D21 D22 D23PID 3 D30 D31 D32 D33
– MPI+OpenMP• 1 process/Node• 1 Thread/Core
PID 0 A1 B1
• Communication Optimizations– Transposition (new: All-to-All)– FFT-shift (new: Peer-to-Peer)
PID 1 A2 B2PID 2 D1 C1PID 3 D2 C2
– Int.Loop Replication and Reduction
• Pipelined IR– Each node reconstructs a separate
SAR Image
9
IR IR on the on the DDistributedistributed Memory SystemMemory Systemy yy y
The Nehalem Cluster:Each Node 60Each Node
2 x 4 Cores, 16 threads2.8 - 3.2 GHz12/24/48 GB RAM
50
60
QPI: 25.6 GB/sIMC: 32 GB/s,Lithography: 45 nmTDP: 95 W/CPU
Infiniband Network 30
40up
(Scale=60)
Fat-tree Topology 6 Backbone Switches24 Leaf Switches 20Sp
eedu
1 (8) 2 (16) 4 (32) 8 (64) 12 (96) 16 (128)0
10
No of Nodes (Cores) ( ) ( ) ( ) ( ) ( ) ( )MPI (4Proc/Node) 3,54 5,46 7,92 8,52 7,69 7,37Hybrid (1Proc:16Thr/Node) 6,68 10,19 14,41 17,11 17,19 17,73MPI_new (8Proc/Node‐24GB) 6,45 10,87 15,93 23,69 28,90 31,06Hyb new (1Proc:16Thr/Node) 5 66 9 68 17 13 26 92 30 72 32 08
No. of Nodes (Cores)
10
Hyb_new (1Proc:16Thr/Node) 5,66 9,68 17,13 26,92 30,72 32,08Pipelined (1Proc:16Thr/Node) 6,35 11,50 21,30 38,05 50,48 59,80
IR Optimizations for IR Optimizations for Heterogeneous CPU/GPU ComputingHeterogeneous CPU/GPU Computing
ccNUMA Multi Processor
Heterogeneous CPU/GPU ComputingHeterogeneous CPU/GPU ComputingblockIndex.x (bx)
• ccNUMA Multi-Processor– Sequential Optimizations– Minor load-balancing improvements
0 1 2 3
0 1threadIndex.x (tx)
• Computing on CPU+GPU
• Accelerator (GP-GPU)
0
1ex.y
(by)
Block (2 1)dex.
y (ty
)
tsize
tsize
– CUDA – Tiling Technique– cuFFT Library– Transcendental functions
1
2
bloc
kInd
e
0
Block (2,1)
thre
adIn
d
1tsize
• Such as sine and cosine– CUDA 3.2 lacks
• Some complex operations (multiplication and CEXP)p p ( p )• Atomic operations for complex/float data
– Memory Limitation• Atomic operations are used in SFI loop (R&R is not an option)• Large Scale IR dataset does not fit into GPU memory
11
IR on a IR on a HeterogeneousHeterogeneous NodeNodegg
The Machine:ccNUMA Module
35ccNUMA Module
2 x 4 Cores, 16 threads2.8 – 3.2 GHz12 GB RAM 25
30
TDP: 95 W/CPUPCIe 2.0 (8 GB/s)
Accelerator Module2 GPU CardsNVIDIA Tesla(Fermi)
20
dup
( )1.15 GHz6 GB GDDR5 144GB/sTDP 238 W
10
15
Spee
CPU B t CPU CPU 2 GPU0
5
CPU CPU Best Sequential
CPU 8 Threads 16 Threads
(SMT)GPU CPU + GPU 2 GPUs 2 GPUs
Pipelined
Scale=10 1 1,82 14,46 16,06 20,11 18,88 4,27 15,86Scale=30 1 1,89 11,41 13,26 19,44 22,10 16,71 25,40
12
Scale=60 1 1,97 10,27 12,55 20,17 24,68 22,26 34,46
ConclusionsConclusionsCo c us o sCo c us o s• Shared memory Nodes y
– Performance is limited by hardware resources – 1 Node (12 Cores/24 Threads): speedup = 12.4
• Distributed memory systems– Low efficiency in terms of performance per power consumption and size.– 8 Nodes (64 cores): speedup: 38.05
• Heterogeneous CPU/GPU systems– Perfect compromise:
• Better performance than current shared memory nodes• Better efficiency than distributed memory systems• 1 CPU + 2 GPUs: speedup: 34.46
• Final Design Recommendations– Powerful shared memory PPN– PPN with ccNUMA CPUs and GPU accelerators– Distributed memory only if multiple PPNs are needed
13
Th k YTh k YThank YouThank You
kraja@in.tum.de
top related