designing high performance computing architectures for reliable space applications

Designing High Performance ComputingDesigning High Performance ComputingA hit t f R li bl SA hit t f R li bl SArchitectures for Reliable SpaceArchitectures for Reliable Space

ApplicationsApplicationspppp

Fisnik KrajaPhD DefenseDecember 6, 2012 Advisors:

1st : Prof Dr Arndt Bode1 : Prof. Dr. Arndt Bode2nd : Prof. Dr. Xavier Martorell

OutlineOutlineOut eOut e

1. Motivation

2 The Proposed Computing Architecture2. The Proposed Computing Architecture

3 The 2DSSAR Benchmarking Application3. The 2DSSAR Benchmarking Application

4. Optimizations and Benchmarking Resultsp g– Shared memory multiprocessor systems– Distributed memory multiprocessor systems

Heterogeneous CPU/GPU systems– Heterogeneous CPU/GPU systems

5. Conclusions

MotivationMotivation

• Future space applications will demand for:

ot at oot at o

• Future space applications will demand for:– Increased on-board computing capabilities– Preserved system reliability

• Future missions:– Optical - IR Sounder: 4.3 GMult/s + 5.7 GAdd/s, 2.2 Gbit/s

Radar/Microwave HRWS SAR: 1 Tera 16 bit fixed point operations/s 603 1– Radar/Microwave - HRWS SAR: 1 Tera 16 bit fixed point operations/s, 603.1 Gbit/s

• ChallengesChallenges– Costs (ASICs are very expensive)– Modularity (component change and reuse)– Portability (across various spacecraft platforms)y ( p p )– Scalability (hardware and software)– Programmability (compatible to various environments)– Efficiency (power consumption and size)y (p p )

The Proposed ArchitectureThe Proposed Architecturee oposed c tectu ee oposed c tectu e

Legend:

RHMURadiation-Hardened Management Unit

Control Bus

PPNParallel processing Node

Data Bus

Co t o us

The 2DSSAR ApplicationThe 2DSSAR Application22 DimensionalDimensional SSpotlightpotlight Synthetic Aperture RadarSynthetic Aperture Radar

Synthetic Data

22-- DimensionalDimensional SSpotlightpotlight Synthetic Aperture Radar Synthetic Aperture Radar

Illuminated swath in Side‐looking Spotlight SARSynthetic Data Generation (SDG):

Synthetic SAR returns from a uniform grid of point reflectorsS ft point reflectors

Azimuth

Spacecraft

Flight Path

Altit dSwath Range

SAR Sensor Processing (SSP)

Read Generated DataImage Reconstruction (IR)Write Reconstructed Image

Reconstructed SAR i i b i d b

Altitude Range Write Reconstructed Image

image is obtained by applying a 2DFourier Matched Filtering and Interpolation

Swath Cross-Range

pAlgorithm

Profiling SAR Image ReconstructionProfiling SAR Image Reconstructiong gg g

Coverage Memory FLOP Time Goal: g(in km)

y(in GB) (in Giga) (in Seconds) Speedup

Scale=10 3.8 x 2.5 0.25 29.54 23

30xScale=30 11 4 x 7 5 2 115 03 230 30xScale 30 11.4 x 7.5 2 115.03 230Scale=60 22.8 x 15 8 1302 926

Transposition and FFT‐shifting

Compression and Decompression

Loops7%7%

FFTs22%

InterpolationInterpolation Loop69%

6IR Profiling

IR Optimizations for IR Optimizations for Shared Memory MultiprocessingShared Memory Multiprocessing

Shared Memory MultiprocessingShared Memory Multiprocessing

• OpenMP– General optimizations:

Thread Pinning and First Touch Policy• Thread Pinning and First Touch Policy• Static/Dynamic Scheduling

– FFT• Manual Multithreading of Loops of 1D-FFT(not the FFT itself)

– Interpolation Loop (Polar to Rectangular Coordinates)• Atomic Operations• Replication and Reduction (R&R)

• Other Programming Models– OmpSs, MPI, MPI+OpenMPOmpSs, MPI, MPI OpenMP

IR IR on a Shared Memory Nodeon a Shared Memory Nodeyy

The ccNUMA Node:12

The ccNUMA Node:2 x Nehalem CPUs

6 Cores, 12 threads2.93-3.33 GHzQPI: 25.6 GB/sIMC 32 GB/ 8

IMC: 32 GB/sLithography: 32 nmTDP: 95 W

2 x 3 x 6 GB MemoryTotal: 36 GB

(Scale

DDR3 SDRAM1066 MHz

1 2 4 6 8 10 12 12 (24)OpenMP Atomic 1 1,55 3,05 4,45 5,81 6,94 7,98 10,54OpenMP R&R 1 1,78 3,51 5,02 6,36 7,74 9,03 11,06

0Cores (Threads)

OpenMP R&R 1 1,78 3,51 5,02 6,36 7,74 9,03 11,06OmpSs Atomic 1 1,61 3,12 4,62 5,92 7,02 8,13 10,72OmpSs R&R 1 1,93 3,73 5,52 7,13 8,65 10,37 12,37MPI R&R 1 1,92 3,65 5,30 6,57 7,94 9,81 11,20MPI+OpenMP 1 1 89 3 54 4 88 6 40 8 02 9 94 11 69

MPI+OpenMP 1 1,89 3,54 4,88 6,40 8,02 9,94 11,69

IR Optimizations for IR Optimizations for Distributed Memory MultiprocessingDistributed Memory Multiprocessing• Programming Paradigms

Distributed Memory MultiprocessingDistributed Memory MultiprocessingPID 0 D00 D01 D02 D03g g g

– MPI • Data Replication• Process Creation Overhead

MPI+OpenMP

PID 0 D00 D01 D02 D03PID 1 D10 D11 D12 D13PID 2 D20 D21 D22 D23PID 3 D30 D31 D32 D33

– MPI+OpenMP• 1 process/Node• 1 Thread/Core

PID 0 A1 B1

• Communication Optimizations– Transposition (new: All-to-All)– FFT-shift (new: Peer-to-Peer)

PID 1 A2 B2PID 2 D1 C1PID 3 D2 C2

– Int.Loop Replication and Reduction

• Pipelined IR– Each node reconstructs a separate

SAR Image

IR IR on the on the DDistributedistributed Memory SystemMemory Systemy yy y

The Nehalem Cluster:Each Node 60Each Node

2 x 4 Cores, 16 threads2.8 - 3.2 GHz12/24/48 GB RAM

QPI: 25.6 GB/sIMC: 32 GB/s,Lithography: 45 nmTDP: 95 W/CPU

Infiniband Network 30

(Scale=60)

Fat-tree Topology 6 Backbone Switches24 Leaf Switches 20Sp

1 (8) 2 (16) 4 (32) 8 (64) 12 (96) 16 (128)0

No of Nodes (Cores) ( ) ( ) ( ) ( ) ( ) ( )MPI (4Proc/Node) 3,54 5,46 7,92 8,52 7,69 7,37Hybrid (1Proc:16Thr/Node) 6,68 10,19 14,41 17,11 17,19 17,73MPI_new (8Proc/Node‐24GB) 6,45 10,87 15,93 23,69 28,90 31,06Hyb new (1Proc:16Thr/Node) 5 66 9 68 17 13 26 92 30 72 32 08

No. of Nodes (Cores)

Hyb_new (1Proc:16Thr/Node) 5,66 9,68 17,13 26,92 30,72 32,08Pipelined (1Proc:16Thr/Node) 6,35 11,50 21,30 38,05 50,48 59,80

IR Optimizations for IR Optimizations for Heterogeneous CPU/GPU ComputingHeterogeneous CPU/GPU Computing

ccNUMA Multi Processor

Heterogeneous CPU/GPU ComputingHeterogeneous CPU/GPU ComputingblockIndex.x (bx)

• ccNUMA Multi-Processor– Sequential Optimizations– Minor load-balancing improvements

0 1 2 3

0 1threadIndex.x (tx)

• Computing on CPU+GPU

• Accelerator (GP-GPU)

Block (2 1)dex.

– CUDA – Tiling Technique– cuFFT Library– Transcendental functions

Block (2,1)

1tsize

• Such as sine and cosine– CUDA 3.2 lacks

• Some complex operations (multiplication and CEXP)p p ( p )• Atomic operations for complex/float data

– Memory Limitation• Atomic operations are used in SFI loop (R&R is not an option)• Large Scale IR dataset does not fit into GPU memory

IR on a IR on a HeterogeneousHeterogeneous NodeNodegg

The Machine:ccNUMA Module

35ccNUMA Module

2 x 4 Cores, 16 threads2.8 – 3.2 GHz12 GB RAM 25

TDP: 95 W/CPUPCIe 2.0 (8 GB/s)

Accelerator Module2 GPU CardsNVIDIA Tesla(Fermi)

( )1.15 GHz6 GB GDDR5 144GB/sTDP 238 W

CPU B t CPU CPU 2 GPU0

CPU CPU Best Sequential

CPU 8 Threads 16 Threads

(SMT)GPU CPU + GPU 2 GPUs 2 GPUs

Pipelined

Scale=10 1 1,82 14,46 16,06 20,11 18,88 4,27 15,86Scale=30 1 1,89 11,41 13,26 19,44 22,10 16,71 25,40

Scale=60 1 1,97 10,27 12,55 20,17 24,68 22,26 34,46

ConclusionsConclusionsCo c us o sCo c us o s• Shared memory Nodes y

– Performance is limited by hardware resources – 1 Node (12 Cores/24 Threads): speedup = 12.4

• Distributed memory systems– Low efficiency in terms of performance per power consumption and size.– 8 Nodes (64 cores): speedup: 38.05

• Heterogeneous CPU/GPU systems– Perfect compromise:

• Better performance than current shared memory nodes• Better efficiency than distributed memory systems• 1 CPU + 2 GPUs: speedup: 34.46

• Final Design Recommendations– Powerful shared memory PPN– PPN with ccNUMA CPUs and GPU accelerators– Distributed memory only if multiple PPNs are needed

Th k YTh k YThank YouThank You

kraja@in.tum.de

designing high performance computing architectures for reliable space applications

ir optimizations

shared memory node y

gpu memory

distributed memory system

thrnode mpi

heterogeneous node g

gb memory total

x tx computing

Technology

designing distributed scalable and reliable systems

outsystems - the art of designing outsystems architectures -...

04 designing architectures

towards designing robust qca architectures in the - hal

designing effective logical architectures and site...

designing software architectures: a practical approach -...

designing software architectures: a practical approach -...

designing architectures by hand is hard › wp-content ›...

aggregate architectures: observing and designing with

designing reliable products ii

designing memory systems for tiled architectures

documenting and designing architectures using together tom...

designing intrusion detection systems: architectures...

designing software architectures: a practical approach -...

designing safer batteries via materials architectures and

designing highly-available architectures for otm

representing and designing architectures in the …

designing software architectures: a practical approach -...

designing hpc architectures at the barcelona super...

designing isp architectures - coe java...