(keynote) (from hpc to) new horizons of very high performance computing (vhpc): hurdles and chances...

(keynote)(from HPC to)

New Horizons of Very High Performance Computing

(VHPC): Hurdles and Chances

Reiner Hartenstein

TU Kaiserslautern

Rhodes Island, Greece, April 25-26, 2006

TU KaiserslauternReconfigurable Supercomputing

(VHPC) going commercial

Cray XD1

silicon graphics RASC

… it‘s a paradigm shift !… and other vendors

TU Kaiserslautern

The Pervasiveness of RC

162,000

127,000

158,000113,000

171,000194,000

# of hits by Google

1,620,000

915,000

398,000

272,000

647,000

1,490,000

# of hits by Google

“FPGA and ….”ECE-savvy scene Math/SW-savvy sceneunqualified for RC ?

TU Kaiserslautern

world-wide a mass movement

Methodology ?

reminds me to the mass migration of lemmings

terminology chaosnot really a sense of direction

an urgent need to get organized

TU Kaiserslautern>> Outline <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

TU KaiserslauternThe Reconfigurable Computing

Paradox

very poor effective integration density

„very power-hungry“ [Rick Kornfeld*]

very poor application development support

poor FPGA technology:

lower clock frequencies, and more expensive.

RC education: extremely poor, or none

Languages and tools unacceptable for software peoplemost hardware experts (86%**) hate their tools

**) DeHon ‘98 *) personal communication

poor tools:

poor education:

However, brilliant

results everywhere

what paradox ?

ignored by CS curricula

… teach like for a 50 year old mainframe …

TU Kaiserslautern

Computing Curricula 2004fully ignores

Reconfigurable Computing

Joint Task Force for

FPGA & synonyma: 0 hits

not even here

(Google: 10 million hits)

Education ?

TU Kaiserslautern

Computing Curricula v.2005:no changes other than „… FPGA, etc.“(not really mentioning that it‘s missing)

Completed ?

Taskforce activity completed ?Next task force in 2020 or later ?

TU Kaiserslautern

End of this week: brainstorming session at DARPA:

(urgently needed – overdue! )

Tools ?

TU Kaiserslautern

fine-grained RC: 1st DeHon‘s Law Technology:

reconfigurability overhead>

routing congestion

wiring overhead

overhead:

>> 10 000

1980 1990 2000 2010100

FPGAlogical

FPGArouted

density:

FPGAphysical

(Gordon Moore curve)

transistors / microchip

(microprocessor)

immense area inefficiency

[1996: Ph. D, MIT]

TU Kaiserslautern

X 2/yr

published speed-up factors

1980 1990 2000 2010100

Pentium 4

50%/yr

http://xputers.informatik.uni-kl.de/faq-pages/fqa.html

10 000

Los Alamos traffic simulation

real-time face detectionreal-time face detection6000

video-rate stereo vision

900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

BLASTBLAST52protein identificationprotein identification

molecular dynamics simulationmolecular dynamics simulation

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

FFTFFT

1000MA

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

20002000

2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]

Lee Routing (by TU-KL)

Grid-based DRC („fair

comparizon“)

comparizon“)1500015000

DSP and wirelessImage processing,Pattern matching,

Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

DPLADPLA

MoM Xputer architecture

Microprocessor

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

pre-FPGA era

TU Kaiserslautern

pre FPGA era: Why DPLA* was so good

Close to Moore because of small overhead (wiring, programmability, routing)

Large arrays of canonical boolean expressions

PLA layout ~similar to RAM / ROM layout:

Mid’ 80ies: first very tiny FPGAs available

*) designed by TU-KL, fabricated by E.I.S. German multi university project

GAG Generic Address Generator to avoid address computation overhead

2ASM: Auto-Sequencing MemoryASM

[M. Herz et al.: ICECS 2003, Dubrovnik]

TU Kaiserslautern(anti-von-Neumann machine

paradigm)Data Counter instead of Program CounterGeneralization of the DMA

datacounter

GAG RAM

ASM: Auto-Sequencing MemoryASM

GAG & enabling technology:published 1989 [by TU-KL],Survey paper: [M. Herz et al.*: IEEE ICECS 2003, Dubrovnik] *) IMEC & TU-KL

**) -- patented by TI** 1995

Storge Scheme optimization methodology, etc.

TU Kaiserslautern

Thousands or Millions of $ for free

Application migration [from supercomputer] resulting not only in massive speed-upsElectricity bills reduced by an order of magnitude and even more you may get for free…. up to millions of $ dollars per year

(also a matter of national energy policy)

GoogleAmsterdam

TU KaiserslauternReconfigurable Scientific

Computing How software types do programming the FPGAs ?Hiring a good student from the EE Dept. ?

Because of Missing RC education: Far away from optimum solutions ?Much higher speedup achievable ?

1 or 2 more orders of magnitude ? 100.000 ? 1.000.000 ?

TU Kaiserslautern

X 2/yr

By education: better speed-up factors ?

1980 1990 2000 2010100

50%/yr

10 000

900pattern

recognitionpattern

recognition730

FFTFFT

1000MA

20002000

comparizon“)

Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

DPLADPLA

Microprocessor

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

TU Kaiserslautern

The Supercomputing Paradox

Growing listed Teraflops

Often limited sustained Teraflops

Almost stalled application implementation progress

Increasing number of processors running in parallel

COTS processor decreasing cost

Very high total cost of the Tera(?)flops

promising technology

poor results

Scientists waiting for affordable compute capacity

The Law of More

TU Kaiserslautern

Why traditional supercomputing / HPC failed

instruction-stream-based: memory-cycle-hungry

the wrong way, how the data are moved around

because of the wrong multi-core interconnect architecture

y unbal

ance d

stolen from Bob Colwell

TU Kaiserslautern

Earth Simulator

5120 Processors, 5000 pins eachES 20: TFLOPS

Crossbar weight: 220 t, 3000 km of thick cable,moving data around

inside the

TU Kaiserslautern

Bringing together data and processor

moving the grand piano

by SoftwareMoving data to the processor:

TU Kaiserslautern

coarse-grained RC: Hartenstein‘s Law

FPGArouted

>> 10 000

1980 1990 2000 2010100

(Gordon Moore curve)

transistors / microchip

rDPA physical rDPA logical

area efficiency very close to Moore‘s law

[1996: ISIS, Austin, TX]

KressArray

family

TU Kaiserslautern

X 2/yr

higher speed-up factors by coarse-grained?

1980 1990 2000 2010100

50%/yr

10 000

900pattern

recognitionpattern

recognition730

FFTFFT

1000MA

20002000

comparizon“)

Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

DPLADPLA

Microprocessor

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

1000Coa

TU Kaiserslautern

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

Coarse grain is about computing, not logic

rout thru only

not usedbackbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

reconfigurable Data Path Unit, e. g. 32 bits wide

no CPUrDPUrDPU

TU Kaiserslautern

SW 2coarse-grained CW migration example

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

TU KaiserslauternCompare it to software solution on CPU

on a very simple CPU C = 1

memory cycles

nanoseconds

if C then read A

read instruction

instruction decoding

read operand*

operate & register transfers

if not C then read B

read instruction

add & store

read instruction

operate & register transfers

store result

Clock200

S = R + (if C then A else B endif);

TU Kaiserslautern

hypothetical branching example to illustrate software-to-configware

migration

*) if no intermediate storage in register file

C = 1simple conservative CPU example

memory cycles

nanoseconds

if C then read A

read instruction 1 100instruction decoding

read operand* 1 100operate & reg. transfers

if not C then read B

add & store

operate & reg. transfers

store result 1 100

total 5 500

clock200 MHz(5 nanosec)

TU Kaiserslautern

moving the locality of operation into the route of the data stream by P&R

Why the speed-up? What‘s the difference?

instead of moving data by instruction streams

TU Kaiserslautern

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

rout thru only

not usedbackbus connect[Ulrich Nageldinger]

The wrong mind set ....

section of a very large pipe network:

decision

not knowing this solution:symptom of the hardware / software chasm

and the configware / software chasm

„but you can‘t implement decisions!“

We need Reconfigurable Computing Education

TU Kaiserslautern

The new paradigm: how the data are traveling

not transport-triggered: old hat

pipeline, or chaining

super systolic array

no, not by instruction execution

DPU DPU DPU

vN Move Processor

instruction-driven

+ instruction-driven

[Jack Lipovski, EUROMiCRO, Nice, 1975]

P&R: move locality of operation, not data !

TU Kaiserslautern

input data stream

|output data streams

„data

streams“ time

port #

port #time

port #

define: ... which data item at which time at which port

Data streams

(pipe network)

H. T. Kung paradigm(systolic array)

implemented by distributed

memory

datacounter

GAG RAM

MASM: Auto-Sequencing

Memory

50 & more on-chip ASM are feasible

TU Kaiserslautern

The Generalization of the Systolic Array

[R. Kress]:use optimization algorithmse. g.: simulated annealing

Achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible

reconfigurability makes sense

discard algebraic synthesis methods

remedy?

only for applications with regular data dependencies

Kress-Kung paradigmsuper systolic array

• Reconfigurable Computing Paradox

• The Supercomputing Paradox

• We are using the wrong model

• Coarse-grained Reconfigurable Devices

• Super Pentium for Desktop Supercomputer

TU Kaiserslautern

Here is the common model

data-stream-based

instruction-stream-

software code

accelerator reconfigurable

accelerator hardwired

configware code

it’s not von Neumann the vN monopoly in our curricula is severely harmful

wagging the dog

the tail is

we need dual paradigm education

TU Kaiserslautern

A potential Pentium successorDiscard most caches

have 64* cores, 0.5 - 1 GHz

with clever interconnect for:

concurrent processes and

and for multithreading,

Kung-Kress pipe network

The Desk-top Supercomputer!

*) CPU mode / DPU mode capability

and, for

TU Kaiserslautern“Super Pentium” configuration

examplerDPUrDPU rDPUrDPU rDPUrDPU

rDPUrDPU rDPUrDPU rDPUrDPU

CPUCPU

CPUCPU CPUCPU

CPUCPU

TU Kaiserslautern

e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz

GamesGames MusicMusicVideosVideos

SMeXPPSMeXPP

CameraCamera

Baseband-Baseband-ProcessorProcessor

Radio-Radio-InterfaceInterface

AudioAudio--InterfaceInterface

SD/MMC CardsSD/MMC Cards

LCD DISPLAY

rDPArDPA

• Variable resolutions and refresh rates• Variable scan mode characteristics• Noise Reduction and Artifact Removal• High performance requirements• Variable file encoding formats• Variable content security formats• Variable Displays• Luminance processing• Detail enhancement• Color processing• Sharpness Enhancement• Shadow Enhancement• Differentiation • Programmable de-interlacing heuristics• Frame rate detection and conversion• Motion detection & estimation & compensation• Different standards (MPEG2/4, H.264)• A single device handles all modes

World TV & game console & multi media center

http://pactcorp.com

TU Kaiserslautern

Dual Paradigm Application Development

instruction-stream-

software code

accelerator hardwired

configware codedata-stream-based

software/configwareco-compiler

high level language

TU KaiserslauternSoftware / Configware Co-

Compilation

Juergen Becker’s CoDe-

X, 1996

CPUCPU

Resource Parameters

supportingdifferentplatforms

SWcompiler

CWcompiler

C language source

Partitioner

Placement &

Routing

Placement &

Routing(Move the Locality of Operation

TU Kaiserslautern

Bringing together data and processor

Move the stool

byConfigware

Place the location of execution into the data pipe

TU Kaiserslautern>> Conclusions <<

•Conclusions http://www.uni-kl.de

TU Kaiserslautern

Conclusions (1): Hurdles

Obstacles are:

unbelievably disastrous tools market:

unbelievably ignorant curricula:

enabling technologies available, partly decades old, but not used

transdisciplinary models not available nor taught at CS, nor elsewhere

fragmentation into application-domain-specific cultures and trick boxes

… teach like for a 50 year old mainframe …

TU Kaiserslautern

Conclusions (2): Future Work

CS disciplines must recognize and accept its strategic role and its responsibility toward all its application disciplines: embedded and scientific computing.

The monopoly of the von-Neumann-based mind set in CS education:

heavily stalls progress in R&D, not only in HPC causes high cost in R&D, not only in supercomputing

The von-Neumann-only-based mind set in CS urgently needs to go to adopt the dual paradigm common model

CS graduates are not qualified for our job market

TU Kaiserslautern

Conclusions (3): Chances

New horizons: chances are brilliant

TU Kaiserslautern

thank you

TU Kaiserslautern

thank you

TU Kaiserslautern

Backup:

TU Kaiserslautern

Co-Compiler Enabling Technology

is available from academia

only a small team needed for commercial re-implementation

on the road map to the Personal Supercomputer

TU KaiserslauternCompilation: Software vs.

Configware

source program

softwarecompiler

software code

Software Engineeri

configware code

mapper

configwarecompiler

scheduler

flowware code

source „program“

Configware

Engineering

Configware

Engineering

placement &

routing

C, FORTRANMATHLAB

TU Kaiserslautern

configware resources: variable

Nick Tredennick’s Paradigm Shifts explain the differences

2 programming sources needed

flowware algorithm: variable

Configware EngineeringConfigware Engineering

Software EngineeringSoftware Engineering

1 programming source needed

algorithm: variable

resources: fixedsoftware

TU Kaiserslautern

Co-Compilation

softwarecompiler

software code

Software / Configware Co-Compiler

configware code

mapperconfigware

compiler

scheduler

flowware code

C, FORTRAN, MATHLAB

automatic SW / CW partitionersimulated annealing

simulated annealing

TU Kaiserslautern

Co-Compiler for Hardwired Kress/Kung Machine

[e. g. Brodersen]

softwarecompiler

software code

Software / Flowware

Co-Compiler

Software / Flowware

Co-Compiler

flowwarecompiler

scheduler

flowware code

source

automatic SW / CW partitioner

TU KaiserslauternThe first archetype machine model

mainframe

compile orassemble

proceduralpersonalization

Software IndustrySoftware Industry Software Industry’sSecret of Success

simple basic .Machine Paradigm

personalization:RAM-based

instruction-stream- based mind set

“von Neumann”

TU KaiserslauternThe 2nd archetype machine model

compilestructural

personalization

Configware IndustryConfigware Industry

Configware Industry’sSecret of Success

personalization:RAM-based

data-stream- based mind set

“Kress-Kung”

simple basic .Machine Paradigm

TU Kaiserslautern

„Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack“ [Herb Riley, R. Associates]

TU Kaiserslauternmodern FPGA bestsellers:

The new model is reality:FPGA fabrics, together with several µprocessors, many memory banks, and other IP cores, on the same COTS microchip

TU Kaiserslautern

500MHz FlexibleSoft Logic Architecture

200KLogic Cells

500MHz Programmable DSP Execution Units

0.6-11.1GbpsSerial Transceivers

500MHz PowerPC™ Processors(680DMIPS)

withAuxiliary Processor Unit

1Gbps DifferentialI/O

500MHz multi-portDistributed 10 Mb SRAM

500MHz DCM DigitalClock Management

DSP platform FPGA[courtesy Xilinx Corp.]

(keynote) (from hpc to) new horizons of very high performance computing (vhpc): hurdles and chances...

yr fpga

yr http

google fpga

overhead overhead

prefpga era slide

d fir filter tu

organized slide

mit slide

Documents

how to cope with the power wall reiner hartenstein tu...

use of prepackaged ecc, vhpc, and uhpc in bridge structures

reconfigurable hpc reconfigurable hpc part 1 introduction...

kaiserslautern pikes e.v 110 % kaiserslautern

reiner hartenstein, university of kaiserslautern, germany...

reiner hartenstein, university of kaiserslautern, germany -...

reiner hartenstein, tu kaiserslautern,...

the von neumann syndrome calls for a revolution reiner...

powerpoint-präsentation - - tu kaiserslautern ·...

however, we are far >> outline - - tu kaiserslautern ·...

ipdps 2004 software or configware? about the digital divide...

reiner hartenstein, university of kaiserslautern,...

workshop selbstoptimierung und adaption reiner hartenstein*...

vlsi-soc 2001 ifip - lirmm stream-based arrays: converging...

fiber-reinforced ecc, vhpc, and uhpc in bridge structures

personal overview lars hartenstein

cs curricula update proposed: by adding reconfigurable...

reiner hartenstein, university of kaiserslautern, · pdf...

reconfigurable supercomputing: hurdles and chances reiner...

how to cope with the power wall reiner hartenstein tu...