architecture and compilers part ii - cs.fsu.eduengelen/courses/hpc-2008/architecture2.pdf ·...

Architecture and Compilers PART II

HPC Fall 2008 Prof. Robert van Engelen

HPC Fall 2008 2 10/14/08

Overview

  PMS models   Shared memory multiprocessors

  Basic shared memory systems   SMP, Multicore, and COMA

  Distributed memory multicomputers   MPP systems   Network topologies for message-passing multicomputers   Distributed shared memory

  Pipeline and vector processors   Comparison   Taxonomies

HPC Fall 2008 3 10/14/08

PMS Architecture Models

  Processor (P)   A device that performs

operations on data

  Memory (M)   A device that stores

  Switch (S)   A device that facilitates

transfer of data between devices

  Arcs denote connectivity

Example: a computer system with CPU and peripherals

A simple PMS model

HPC Fall 2008 4 10/14/08

Shared Memory Multiprocessor

  Processors access shared memory via a common switch, e.g. a bus   Problem: a single bus results in a bottleneck

  Shared memory has a single address space   Architecture sometimes referred to as a “dance hall”

HPC Fall 2008 5 10/14/08

Bus Contention

  Each processor competes for access to shared memory   Fetching instructions   Loading and storing data

  Bus contention   Access to memory is restricted to one processor at a time   This limits the speedup and scalability with respect to the

number of processors   Assume that each instruction requires m memory operations

(load/store), F instructions are performed per unit of time, and a maximum of W words can be moved over the bus per unit of time, then SP < W / (m F)

regardless of the number of processors P   In other words, the efficiency is limited unless

P < W / (m F)

HPC Fall 2008 6 10/14/08

Work-Memory Ratio

  The work-memory ratio (also known as the FP:M ratio):   The ratio of the number of

floating point operations to the number of distinct memory locations referenced

  Note that the definition refers to the number of distinct memory locations referenced   The same location is counted

just once   Implicitly refers to the use of

registers and cache   Efficient utilization of shared

memory multiprocessors requires

P < FP:M × W / F

for (i=0; i<1000; i++) x = x + 2*i;

for (i=0; i<1000; i++) x = x + a[i];

2 distinct locations: FP:M = 500

1001 distinct locations: FP:M = 1000:1001 ≈ 1

HPC Fall 2008 7 10/14/08

Shared Memory Multiprocessor with Local Cache

  Add local cache the improve performance when W / F is small   With today’s systems we have W / F << 1

  Problem: how to ensure cache coherence?

HPC Fall 2008 8 10/14/08

Cache Coherence

  A cache coherence protocol ensures that processors obtain newly altered data when shared data is modified by another processor

  Because caches operate on cache lines, more data than the shared object alone can be effected, which may lead to false sharing

Thread 1 modifies shared data

Thread 0 reads modified shared data

Cache coherence protocol ensures that thread 0 obtains the newly altered data

HPC Fall 2008 9 10/14/08

  Cache-only memory architecture (COMA)   Large cache per processor to replace shared memory   A data item is either in one cache (non-shared) or in multiple caches

(shared)   Switch includes an engine that provides a single global address

space and ensures cache coherence

HPC Fall 2008 10 10/14/08

Distributed Memory Multicomputer

  Massively parallel processor (MPP) systems with P > 1000   Communication via message passing   Nonuniform memory access (NUMA)   Network topologies

  Mesh   Hypercube   Cross-bar switch

HPC Fall 2008 11 10/14/08

10 20 30 40 50 60 70 80 90 100

Speedup

Computation-Communication Ratio

  The computation-communication ratio: tcomp / tcomm

  Usually assessed analytically and/or measured empirically

  High communication overhead decreases speedup, so ratio should be as high as possible

  For example: data size n, number of processors P, and ratio tcomp / tcomm = 1000n / 10n2

10 20 30 40 50 60 70 80 90 100

tion /

n tcomp / tcomm = 1000n / 10n2

SP = ts / tP = 1000n / (1000n / P + 10n2)

HPC Fall 2008 12 10/14/08

  Network of P nodes has mesh size √P × √P

  Diameter 2 × (√P -1)

  Torus network wraps the ends   Diameter √P -1

HPC Fall 2008 13 10/14/08

Hypercube

  d-dimensional hypercube has P = 2d nodes

  Diameter is d = log2 P

  Node addressing is simple   Node number of nearest

neighbor node differs in one bit

  Routing algorithm flips bits to determine possible paths, e.g. from node 001 to 111 has two shortest paths

  001 → 011 → 111   001 → 101 → 111

HPC Fall 2008 14 10/14/08

Cross-bar Switches

Cross-bar switch

  Processors and memories are connected by a set of switches

  Enables simultaneous (contention free) communication between processor i and σ(i), where σ is an arbitrary permutation of 1…P

σ(1)=2 σ(2)=1 σ(3)=3

HPC Fall 2008 15 10/14/08

Multistage Interconnect Network

  Each switch has an upper output (0) and a lower output (1)

  A message travels through switch based on destination address   Each bit in destination address is

used to control a switch from start to destination

  For example, from 001 to 100   First switch selects lower output

(1)   Second switch selects upper

output (0)   Third switch selects upper output

  Contention can occur when two messages are routed through the same switch

8×8 three-stage interconnect

4×4 two-stage interconnect

HPC Fall 2008 16 10/14/08

Distributed Shared Memory

  Distributed shared memory (DSM) systems use physically distributed memory modules and a global address space that gives the illusion of shared virtual memory

  Hardware is used to automatically translate a memory address into a local address or a remote memory address (via message passing)

  Software approaches add a programming layer to simplify access to shared objects (hiding the communications)

HPC Fall 2008 17 10/14/08

Pipeline and Vector Processors

  Vector processors run operations on multiple data elements simultaneously

  Vector processor has a maximum vector length, e.g. 512

  Strip mining the loop results in an outer loop with stride 512 to enable vectorization of longer vector operations

  Pipelined vector architectures dispatch multiple vector operations per clock cycle

  Vector chaining allows the result of a previous vector operation to be directly fed into the next operation in the pipeline

DO i = 0,9999 z(i) = x(i) + y(i)

DO j = 0,18,512 DO i = 0,511 z(j+i) = x(j+i) + y(j+i) ENDDO ENDDO DO i = 0,271 z(9728+i) = x(9728+i) + y(9728+i) ENDDO

DO j = 0,18,512 z(j:j+511) = x(j:j+511) + y(j:j+511) ENDDO z(9728:9999) = x(9728:9999) + y(9728:9999)

HPC Fall 2008 18 10/14/08

Comparison: Bandwidth, Latency and Capacity

HPC Fall 2008 19 10/14/08

Flynn’s Taxonomy

  Single instruction stream single data stream (SISD)   Traditional PC system

  Single instruction stream multiple data stream (SIMD)   Similar to MMX/SSE/AltiVec

multimedia instruction sets   MASPAR

  Multiple instruction stream multiple data stream (MIMD)   Single program, multiple data

(SPMD) programming: each processor executes a copy of the program

SISD SIMD

Data stream

single

multiple

HPC Fall 2008 20 10/14/08

Task Parallelism versus Data Parallelism

  Task parallelism   Thread-level parallelism   MIMD

  Data parallelism   Multiple processors (or units) operate on segmented data set   SIMD, vector and pipeline machines   SIMD-like multi-media extensions, e.g. MMX/SSE/Altivec

X3 X2 X1 X0

Y3 Y2 Y1 Y0

X3 ⊕ Y3 X2 ⊕ Y2 X1 ⊕ Y1 X0 ⊕ Y0

⊕ ⊕ ⊕ ⊕

Vector operation X[0:3] ⊕ Y[0:3] with SSE instruction on Pentium-4

HPC Fall 2008 21 10/14/08

architecture and compilers part ii - cs.fsu.eduengelen/courses/hpc-2008/architecture2.pdf ·...

Documents

compilers & tools for hpc

xiv system architecture2.pdf

compilers -principles_techniques_and_tools

nvidia hpc compilers reference guide

ibm ats deep computing © 2007 ibm corporation introduction...

continuous integration for research software · • based...

optimising compilers

hpc fortran compilers€¦ · call checkresult call...

architecture2 efort

maximizing hpc application performance on openpower...

mod02 compilers

hpc workload performance tuning on power8 with ibm xl...

havering enterprise architecture2

history of architecture2

compilers ide

compilers overview.ppt

compilers programmingembedded

2010 ibm hpc challenge class ii submission · ibm research...

compilers for embedded systems: why are compilers an issue?

compilers notes