the von neumann architecture and alternatives - … · slide 4 caches and hyper-threading • in a...

58
The von Neumann Architecture and Alternatives Matthias Fouquet-Lapar Senior Principal Engineer Application Engineering [email protected]

Upload: doanminh

Post on 05-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

The von Neumann Architecture and Alternatives

Matthias Fouquet-LaparSenior Principal EngineerApplication [email protected]

Page 2: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 2

The von Neumann bottleneck and Moore’s law

Page 3: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 3

Physical semi-conductor limits

• Transistor Count is continuing to increase according to Moore’s Law for the foreseeable Future

• However, we hit physical limits with current semi-conductor technology in increasing clock-frequency- we have to extract the heat somehow from the chip

• So what can be done with all these transistors to increase performance (wall-clock time) if we can’t increase clock frequency ?

Page 4: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 4

Caches and Hyper-Threading

• In a von Neumann architecture larger and larger on and off caches are used to hide memory latencies

• This makes it easy for the programmer – but your effective silicon utilization will be relatively low (10% -20%)

• Hyper-Threading allows additional hiding of memory references by switching to a different thread

Page 5: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 5

Current Top-End Microprocessor Technology

• Xeon Quad-Core 268 million transistors 4 MB L2 cache

• Tukwila Quad-Core 2-billion transistors30 MByte on-die cache

• Majority of the micro-processor surface (80 - 90%) is used for caches

Page 6: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 6

Many / Multi-Core

• Additional transistors can be used to add cores on the same die

• Dual / Quad Cores are standard today, Many Cores will be available soon

• But (and you already guessed that there is no free lunch) there is ONE problem

Page 7: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 7

The ONLY problem

• is your application / workflow , because it has to be parallelized

• and while you’re doing this, why not look at alternatives to get – higher performance– better silicon utilization

– less power consumption

– better price/performance

Page 8: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 8

Application Specific Acceleration

• The way and degree of parallelization is Application specific and we feel that is important to have a set of Accelerator choices to build the Hybrid Compute Solution for your workflow

• SGI supports different Hybrid Acceleration Alternatives

Page 9: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 9

SGI’s history and strategic directions in Application Specific Acceleration

• 25 years of Experience in Accelerating HPC Problems10 years of Experience in Building Application Specific Accelerators for large scale Super-Computer Systems– 1984 first ASIC based IRIS workstation– 1998 TPU Product for MIPS based O2K/O3K Architectures

– 2003 FPGA Product for ALTIX IA64 based Architectures

• 2008 Launch of the Accelerator Enabling Program

Page 10: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 10

A Disruptive Technology 1984 Implement High-Performance Graphics in ASICs

• 1000/1200– 8MHz Motorola 68000 - PM1 (variant of Stanford UNiversity "SUN"

board) – 3 -4 MB RAM Micro Memory Inc. Multibus No Hard Disk

– Ethernet: Excelan EXOS/101 • Graphics: • GF1 Frame Buffer • UC3 Update Controller• DC3 Display Controller• BP2 Bitplane

Page 11: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 11

1998 : Tensor Processing Unit

Adaptive

in execution

Hybrid

Computational Model

Library-based

Programming Model

An advanced form of Array Processing that is:

Primitive Library

FFT

Fourier Transform

RFG

Ref. Function Generator &

Complex Multiply

CTM

Corner Turn

DET

I 2 + Q2Detect

-1

Range Comp

Matched Filter

Range DeComp

Phase Change

Chain Development

Chain 1-1 -1

Synchronized / Concatenated Chains

-1-1

-1 -1

-1 -1

TPU Array Host Scheduling / Control

TPU-4

TPU-3

TPU-2

TPU-1

Buffer Memory

Sequence Controller

Mult / Acc

Mult / Acc

Acc

Mult / Acc

Mult / Acc

Acc

DMA

FxAdd

FxMult

BLU

Cond.

Reg.

Reg.

2 Shift

2 DDA

HUB

µP µP

Memory

R

XB

OW

TPU

TPU

TPU

TPU

TPU

$ $

Page 12: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 12

TPU : Application Specific Acceleration

Cro

ssba

r

10 Floating Point Functional Units

(1.5 Gflops)

12 Application Specific Functional Units

(3.1 Gops)

8 Mbytes SRAM Buffer Memory (6.4 GBytes/s)

“Deeply Pipelined & Heavily Chained Vector Operations”

TPU XIO Card

Example Image Formation Application Acceleration

( 4096 X 8192 32bit initial problem set )

c c

-1 -1

Chain 2 Chain 3 Chain 4

c c

c

c

c c

-1 +1

Global Transpose

+1

Chain 1

Global Transpose

Global Transpose

to -

from

O

rigin

Mem

ory

Page 13: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 13

TPU : Adaptive Processing (customer co-development)

Host

Memory

XTalk

DMA

Flo

atin

g P

oint

XB

AR

Inpu

t / O

utpu

t Buf

fer FxAdd

FxMult

BLU

Shift

Cond.

Buffer Memory

Sequence Controller

Fix

ed P

oint

XB

AR

DMA

Data Source Logical Array

Data Destination

Logical Array

Microcode Source

Logical Array

Microcode Memory

Shift

Acc

Flt/Fx

Acc

Reg. File

Reg. File

DDA

DDA

Mult Acc

Mult Acc

Mult Acc

Mult Acc

Programmable

Sequence Controller

Reconfigurable

Data PathsData Dependent

Dynamic Addressing

Very Large Buffer

Memory

Multiple Independent Arithmetic

Logical and

Addressing

Resources

Page 14: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 14

SGI Tensor Processing Unit - Applications

• RADAR• Medical Imaging (CT)• Image Processing

– Compression

– Decompression

– Filtering

• Largest Installation with 150 TPUs• Continues to be sold and fully supported today

Page 15: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 15

Accelerator Enabling Program

• FPGA Technology:– SGI RC100– XDI 2260i

• GPU & TESLA / NVIDIA• ClearSpeed• PACT XPP• Many core CPUs• IBM Cell

Page 16: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 16

Accelerator Enabling Program

• Increased focus on work with ISVs and key customers providing the best accelerator choice for different scientific domains

• Application Engineering group with approx 30 domain experts will play an essential role

• Profiling existing applications and workflows to identify hot-spots – a very flat profile will be hard to accelerate

Page 17: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 17

SGI Options for Compute Bound Problems

• Specialized Computing Elements• Most of them come from the embedded space :

– High Density

– Very Low Power Consumption– Programmable in High Level Languages

• These attributes make them a very good fit for High Performance Computing

Page 18: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 18

FPGA - RASC RC100

Page 19: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 19

ALTIX 450/4700 ccNUMA System Infrastructure

CPU node

Physical Memory

CACHE

CPU

InterfaceChip

CPU

CACHE

TIO

GeneralPurpose

I/O

GeneralPurpose

I/O

General PurposeI/O Interfaces

TIO

GPU GPU

Scalable GPUs

TIO

FPGA

RASC™ (FPGAs)

FPGA

NUMAlink™Interconnect

Fabric

Page 20: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 20

SGI® RASC™ RC100 Blade

TIO

TIO

NL4

6.4 GByte/sec

Loader

PCI

SSP

SSP

Configuration

V4LX200

V4LX20

SRAM

SRAM

SRAM

SRAM

SRAM

Application FPGA

Xilinx Virtex-4 LX200

SRAM

SRAM

SRAM

SRAM

SRAM

Application FPGA

Xilinx Virtex-4 LX200

Configuration

NL4

6.4 GByte/sec

NL4

6.4 GByte/sec

Page 21: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 21

SpeedShop™

Debugger(GDB)

RASCAL : Fully Integrated Software Stack for Programmer Ease of Use

Algorithm Device Driver

COP (TIO, Algorithm FPGA, Memory, Download FPGA)

DownloadUtilities

Application

Device Manager

User Space

Abstraction LayerLibrary

Linux ® Kernel

Hardware

Download Driver

PerformanceTools

Page 22: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 22

Software OverviewCode example for wide scaling

Application using 1 FPGA :

strcpy(ar.algorithm_id,“my_sort");ar.num_devices = 1;rasclib_resource_alloc(&ar,1);

Application using 70 FPGAs:

strcpy(ar.algorithm_id,“my_sort");ar.num_devices = 70;rasclib_resource_alloc(&ar,1);

Page 23: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 23

Demonstrated Sustained Bandwidth

RC100 System Bandwidth

0.00

40.00

80.00

120.00

160.00

0 16 32 48 64

# of FPGAs

Ban

dwid

th (

GB

/s)

Page 24: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 24

As well as running “off-the-shelf”Real-world Applications

• Using Mitrion Accelerated BLAST 1.0 without modifications• Using RASCAL 2.1 without modifications• Test case :

– Searching the Unigene Human and Refseq Human databases (6,733,760 sequences; 4,089,004,795 total letters) with the Human Genome U133 Plus 2.0 Array probe set from Affymetrix (604,258 sequences; 15,106,450 total letters).

– Representative of current top-end research in the pharmaceutical industry.

– Wall-Clock Speedup of 188X compared to top-end Opteron cluster

Page 25: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 25

skynet9World largest FPGA SSI Super Computer

• World largest FPGA SSI Super Computer :– Setup in 2 days using standard SGI components

– Operationally without any changes to HW or SW

– 70 FPGAs (Virtex 4 LX200)– 256 Gigabyte of common shared memory

– Successfully ran a bio-informatics benchmark.

– Running 70 Mitrionics BLAST-N execution streams speedup > 188x compared to 16 core Opteron cluster

• Aggregate I/O Bandwidth running 64 FPGAs from a single process : 120 Gbytes/sec

Page 26: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 26

Picture please ☺

Page 27: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 27

FPGA – XDI 2260i

FPGA - RASC RC100

Page 28: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 28

XDI 2260i FSB socket

`

Application FPGA

Altera Stratix III SE 260

8 MB SRAM

Application FPGA

Altera Stratix III SE 260

8 MB SRAM

Bridge FPGA

9.6 GB/s

9.6 GB/s

9.6 GB/s

Configuration

32 MBConfiguration

Flash

Intel FSB

8.5 GB/s

Page 29: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 29

Intel Quick Assist

• Framework for X86 based accelerators• Unified interface for different accelerator technologies

– FPGAs

– Fixed and Floating Point Accelerators

• It provides flexibility to use SW implementations of accelerated functions with an unified API regardless of the accelerator technique used

Page 30: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 30

Intel Quick Assist AAL overview

Intel FSB (Front-Side Bus) AHM (Accelerator HW Module)Driver

Kernel Mode

Application

AAS(Accelerator Abstraction Services)

AIA(Accelerator Interface Adaptor)

AFU Proxy

ISystem, Ifactory

IEDS

IAFUFactory IRegistrar

IAFU Callback

IPIP Callback

IProprietary

IProprietary

Page 31: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 31

NVIDIA – G80 GPU & TESLA

Page 32: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 32

TESLA and CUDA

• GPUs are massively multithreaded many-core chips– NVIDIA Tesla products have up to 128 scalar processors– Over 12,000 concurrent threads in flight– Over 470 GFLOPS sustained performance

• CUDA is a scalable parallel programming model and a software environment for parallel computing

– Minimal extensions to familiar C/C++ environment– Heterogeneous serial-parallel programming model

• NVIDIA’s TESLA GPU architecture accelerates CUDA– Expose the computational horsepower of NVIDIA GPUs – Enable general-purpose GPU computing

• CUDA also maps well to multi-core CPUs!

Page 33: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 33

TESLA G80 GPU architecture

Thread Execution Manager

Input Assembler

Host

Parallel Data

Cache

Global Memory

Load/store

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Parallel Data

Cache

Parallel Data

Cache

Thread Processors

Page 34: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 34

TESLA / NVIDIA

• Thread Execution Manager issues threads• 12288 concurrent threads managed by HW• 128 Thread Processors grouped into 16

Multiprocessors (SMs)• 1.35 Ghz 128 Processors == 518 GFLOPS (SP)• Parallel Data Cache (Shared Memory) enables

thread cooperation

Page 35: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 35

PACT XPP OverviewFrom von-Neumann Instruction

to Configuration/Data Flow

• Intel FSB based Accelerator– Plug in compatible into Intel XEON family, fits XEON socket– Intel QuickAssist support

• 48 FNA PAEs (Function Processing Array Elements)– 16-bit VLIW cores processing sequential algorithms– 84500 MIPS raw performance– Optimized for control flow dominated, irregular code– Chained operations (up to 4 levels) can be combined with

predicated execution allowing for example to execute a if-then-else statement in one single cycle

Page 36: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 36

PACT XPP OverviewFrom von-Neumann Instruction

to Configuration/Data Flow

• 912 ALU Processing Array Element– 16-bit ALUs for dataflow processing– Logic operations, basic arithmetic operators (add/subtract)– „Special arithmetic“ including comperators and multipliers– reconfigurable compute fabric (reconfiguration through FNC PAE or event

driven)– Supports partial recofiguration– 319 GOPS raw performance– Scaleable ALU architecture: 16-, 32-, 64-bit integer processing

• Fast program development and short turnaround cycles– ANSI C Programming and Debugging Environment

Page 37: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 37

PACT XPP-3m Single Core

Page 38: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 38

PACT XPP-3m XEON compatible Accelerator Module

• 4x XPP-3c DSP• Fits into XEON 5000/7000 socket• FSB Bandwidth 8 GBytes/s peak• Module dimensions according to keep-out area• Local memory eliminates system memory accesses• XPP-3c–to–FSB Bridge interconnect via Proprietary Serial High-

Speed Interface– 2.5GBytes/s per channel– 10GBytes/s total

• Power Dissipation 30W

Page 39: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 39

PACT XPP FSB module

Page 40: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 40

PACT – XPP Tool Chain

C-Based– ANSI-C / C++ based Development System– Cycle Accurate Simulator– Debugging Environment

Libraries– Intel IPP Functions– DSP Library– Imageing– Video and Audio– Crypto (DES, AES)

System Integration– Intel AAL / QuickAssist

Page 41: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 41

XPP-3 Benefits and Target Applications

Target Applications– High-End DSP Processing

• Video, Imaging, Biomedical, Recognition/Analysis– Search and Pattern Matching– Integer and Fixed Point Algebra

Benefits– Software development flow (ANSI C / C++)– Fast compile time and turnaround cycles– Extensive Libraries– Intel QuickAssist support– Low Power Dissipation (30W including memory)– Reduced system load

• Local memory eliminates main memory accesses – Price/Performance ratio

Page 42: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 42

PACT XPP next generation

• Increasing core frequency from 350 Mhz to 1.2 Ghz• Quadrupling amount of processing units (ALU-PAEs)• Process shrink from 90nm to 45 nm• Addition of Configurable Floating Point ALUs

– 16 bit FP 240 GFLOPS

– 32 bit FP 120 GFLOPS– 64 bit FP 60 GFLOPS

Page 43: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 43

PACT Example: 2D Edge Detection#include „XPP.h“ ...main() {

int v, h, inp;int p1[VERLEN][HORLEN]; int p2[VERLEN][HORLEN];int htmp, vtmp, sum;

for(v=0; v<VERLEN; v++)for(h=0; h<HORLEN; h++) {

XPP_getstream(1, 0, &inp);p1[v][h] = inp;

}for(v=0; v<=VERLEN-3; v++) {

for(h=0; h<=HORLEN-3; h++) {

htmp = (p1[v+2][h] - p1[v][h]) +

(p1[v+2][h+2] - p1[v][h+2]) +

2 * (p1[v+2][h+1] - p1[v][h+1]);

if (htmp < 0) htmp = - htmp;

vtmp = (p1[v][h+2] - p1[v][h]) +

(p1[v+2][h+2] - p1[v+2][h]) +

2 * (p1[v+1][h+2] - p1[v+1][h]);

if (vtmp < 0) vtmp = - vtmp;

sum = htmp + vtmp;

if (sum > 255) sum = 255;

p2[v+1][h+1] = sum;

}

}

Read input stream to Buffer

2D edge detection

for(v=1; v<VERLEN-1; v++)

for(h=1; h<HORLEN-1; h++)

XPP_putstream(4, 0, p2[v][h]);

Stream result frombuffer to a port

Page 44: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 44

PACT Example: 2D Edge DetectionFlow graph of module edge_m2

Result:

� 3 Modules:

� Edge_m1: loads RAMs� Edge_m2: calculates

� Edge_m3: flushs RAMs

Page 45: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 45

• Multi-Threaded Array Processing:– Designed for high performance, low power– Modular design– Programmed in high-level languages– Asynchronous, overlapped I/O

• Scalable array of many Processor Elements (PEs):

– Coarse-grained data parallel processing– Supports redundancy and resiliency

• Programmed in an extended version of C called C n:

– Single “poly” data type modifier – Rich expressive semantics

ClearSpeed’s “smart SIMD” MTAP processor core

MTAP

Programmable I/O to DRAM

PE 0

Peripheral Network

PE 1

PE n-1

Data Cache

Mono Controller Instruc-

tionCache

Control and

Debug

System Network

Poly Controller

System Network

Page 46: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 46

ClearSpeedSmart SIMD Processing Elements

• Multiple execution units

• Floating point adder• Floating point multiplier• Fixed-point MAC 16x16 → 32+64• Integer ALU with shifter• Load/store

• High-bandwidth, 5-port register file• Fast inter-PE communication path (swazzle)• Closely coupled SRAM for data

• Keeping data close is key to low power• High bandwidth per PE DMA (PIO)• Per PE address generators

• Complete pointer model, including parallel pointer chasing and vectors of addresses

• Key for gather/scatter and sparse operations• Platform design enables PE variants

• E.g. memory size, data path widths etc.

PEn

Memory address generator (PIO)

Register File128 Bytes

PE SRAMe.g. 6 KBytes

FP

Mul

FP

Add

MA

C

ALU

PIO Collection & Distribution

64 64 64

32

6464 PEn+1

PEn–1

128

32

Page 47: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 47

CCNODE

CCNODE

SoCBLOCK

A

SoCBLOCK

B

Copyright © 2008 ClearSpeed Technology plc. All righ ts reserved .

ClearSpeed - The ClearConnect™ “Network on Chip” system bus

• Scalable, re-usable System on Chip (SoC) platform interconnect

• “Network-on-a-chip”• System backbone

• Used to connect together all the major blocks on a chip:

– Multiple MTAP smart SIMD cores– Multiple memory controllers– On-chip memories– System interfaces, e.g. PCI Express

• Blocks are connected into a unified memory architecture

• Distributed arbitration• Scalable bandwidths

• Low power:– Power dissipation proportional to volume of

data and distance travelled

Page 48: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 48

• Includes dual MTAP cores:– 96 GFLOPS peak (32 & 64-bit)– 10-12W typical power consumption

• 18W maximum power consumption

– 250MHz clock speed– 192 Processing Elements (2x96)– 8 spare PEs for resiliency– ECC on all internal memories

• ClearConnect™ on-chip interconnect• Dual integrated 64-bit DDR2 memory

controllers with ECC• Integrated PCI Express x16• Integrated ClearConnect™ chip-to-

chip bridge port (CCBR)• 2x128 Kbytes of on-chip SRAM• IBM 90nm process• 266 million transistors

The platform realized: The Clearspeed CSX700 Processor

Page 49: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 49

ClearSpeed software development environment

• Cn optimising compiler – C with poly extension for SIMD datatypes

– Uses ACE CoSy compiler development system

• Assembler, linker• Simulators

• Debuggers – csgdb

– A port of the GNU debugger gdb for ClearSpeed’s hardware.

• Profiling – csprof– Heterogeneous visualization of an application’s performance while running on

both a multi-core host and ClearSpeed’s hardware. Intimately integrates with the debugger.

• Libraries (BLAS, RNG, FFT, more..)

• High level APIs, streaming etc.• Available for Windows and Linux (Red Hat and SLES)

– Open source driver for Linux

Page 50: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 50

ClearSpeed - Accelerated applications

• Molpro electronic molecular structure code– Collaboration with Dr Fred Manby’s group at Bristol University’s

Centre for Computational Chemistry

• Sire QM/MM free energy code– Collaboration with Dr Christopher Woods

• BUDE molecular dynamics-based drug design– Collaboration with Dr Richard Sessions at Bristol University’s

Department of Biochemistry

• Amber 9 implicit (production code)– Molecular dynamics simulation with implicit solvent

• Amber 9 explicit (demonstration)– Molecular dynamics simulation with explicit solvent

• Monte Carlo based financial codes

Page 51: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 51

ClearSpeed - Amber 9 sander application acceleration

• Accelerated Amber 9 sander implicit shipping now (download from ClearSpeed website)

• Amber 9 sander explicit available in beta soon• On CATS, both explicit and implicit should run as fast

as 10 to 20 of the fastest contemporary x86 cores– Saving 90-95% of the simulation time -> more simulations– Alternatively, enabling larger, more accurate simulations

• While reducing power consumption by 66%• And increasing server room capacity by 300%

Page 52: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 52

ClearSpeed : Other acceleratable applications

• Computational Fluid Dynamics, Smooth Particle Hydrodynamics methods– Collaborations with Jamil Appa at BAE systems and Dr

Graham Pullan at the Whittle Laboratory, Cambridge University

• Star-P from Interactive Supercomputing• MATLAB and Mathematica when performing

large matrix operations, such as solving systems of linear equations

Page 53: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 53

ClearSpeed CATSNew Design Approach Delivers 1 TFLOP in 1U

• 1U standard server• Intel 5365 3.0GHz

– 2-socket, quad core– 0.096 DP TFLOPS peak– Approx. 600 watts– Approx. 3.5 TFLOPS

peak in a 25 kW rack

• ClearSpeed AcceleratedTerascale System

– 24 CSX600 processors– ~1 DP TFLOPS peak– Approx. 600 watts– Approx. 19 TFLOPS

peak in a 25 kW rack– 18 standard servers &

18 acceleration servers

2 PCIe x8cables

Page 54: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 54

Lessons learned - 1

• Not every Accelerator Technology is applicable to every HPC Problem – and it’s a moving target

• Not every HPC Application can be accelerated • The Majority of Users is interested in complete

Appliances/Applications• “Ease-of-Use” is relative• Standards starting to emerge now which will be key

to broader acceptance

Page 55: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 55

Lessons learned – 2

• Domain knowledge absolutely critical when working on acceleration

• Enabling accelerators for ISVs will be key• Strategic partnerships are a great way to create

synergies• Think “out-of-the-line” to create unique solutions

Page 56: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 56

Lessons learned - 3

• Keep in mind that a solution has to be price/performance competitive.

• There should be at least a 10-20X speedup compared to current top-end CPUs

• CPUs speeds (and multi-core architectures) continue to evolve

• Make sure your Algorithm Implementation scales across multiple Accelerators

• Talk to the experts – we are here to help you

Page 57: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Slide 57

A broader view of Hybrid Computing

MaximumPerformanceStorage Tier

MediumPerformanceStorage Tier

OnlineArchive

Storage Tier

High-Throughput Processes

SGI® Altix ® XE Clusters

SGI Altix®Shared Memory System

Specialized Scalable Performance

SGI RASC™ RC100FPGA Solutions

Storage Solutions

Data Archival

Data

Accelerators : PACT XPP, ClearSpeed, XDI 2260i, TESLA

PACT

ClearSpeed

Page 58: The von Neumann Architecture and Alternatives - … · Slide 4 Caches and Hyper-Threading • In a von Neumann architecture larger and larger on and off caches are used to hide memory

Thank You