reiner hartenstein, university of kaiserslautern, germany - xputer lab … · 2013. 5. 8. ·...

[email protected]

Enabling Technologies for System-on-Chip Development,

November 19-20, 2001, Tampere, Finland

http://www.cs.tut.fi/soc/

Reconfigurable Computing Architectures and Methodologies for System-on-Chip;

Reiner Hartenstein, Monday, November 19, 10:15 - 11:00 hrs.

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

Enabling Technologies for

Reconfigurable Computing

Reiner Hartenstein

University of Kaiserslautern

November 21, 2001, Tampere, Finland

Enabling Technologies for Reconfigurable Computing part 1: Reconfigurable Computing (RC) Wednesday, November 21, 8.30 – 10.00 hrs. © 2001, [email protected] http://www.fpl.uni-kl.de


Xputer Lab

2

Schedule

time slot

08.30 – 10.00 Reconfigurable Computing (RC)

10.00 – 10.30 coffee break

10.30 – 12.00 Compilation Techniques for RC

12.00 – 14.00 lunch break

14.00 – 15.30 Resources for Stream-based RC


16.00 – 17.30 FPGAs: recent developments

© 2001, [email protected] http://www.fpl.uni-kl.de


Xputer Lab

3

Reconfigurable: why?

• Exploding design cost and shrinking product life cycles of ASICs create a demand on RA usage for product longevity.

• Performance is only one part of the story. The time has come fully exploit their flexibility to support turn-around times of minutes instead of months for real time in-system debugging, profiling, verification, tuning, field-maintenance, and field-upgrades.

• A new “soft machine” paradigm and language framework is available for novel compilation techniques to cope with the new market structures transferring synthesis from vendor to customer.



Xputer Lab

4

SOC Alternatives… not including C/C++ CAD Tools [Gordon Bell]

• The blank sheet of paper: FPGA

• Auto design of a basic system: Tensilica

• Standardized, committee designed components*, cells, and custom IP

• Standard components including more application specific processors *, IP add-ons and custom

• One chip does it all: SMOP **

*) Processors, Memory, Communication & Memory Links, **) SMOP ??



Xputer Lab

5

SoC Alternatives [Gordon Bell]

product strategy vendor

FPGA “sea of uncommitted gate arrays” Xylinx, Altera

compile a system unique processor for every application

Tensilica

systolic array many pipelined or parallel processors + custom

DSP, VLIW special purpose processor cores + custom

TI

processor + RAM + ASICS

general purpose cores, specialized by I/O, etc.

IBM, Intel,

universal micro multiprocessor array, programmable I/O

Cradle



Xputer Lab

6

A Decade of Research in Reconfigurable Computing

• Due to the achievements of numerous Research Projects throughout the 90ies the Breakthrough in Commercialization has started and already a quite comprehensive Methodology is available.

• Dear Colleague, the RC Scene welcomes your contributions to improve it and to push for Inclusion in contemporary CS&E Curricula.

• It is one of the Goals of this Talk to stimulate you by Highlights and introducing some Key Issues.

[email protected]









Xputer Lab

7

no more a strange niche area

• was “Hardware” design for a strange plattform – CAD, but no Compilation

• Emerging awareness: – New mind set – New curricular embedding

• coming Dichotomie of CS – SW <-> CW – HW <-> FW – computing in time <-> computing in space



Xputer Lab

8

flexibility / universality trade-off

trade-off flexibility efficiency

FPGA

Kress Array

Xplorer hardwired



Xputer Lab

9

RAs are heading for Mainstream

ASPP, application-specific programmable product is: • Application-specific standard product and: • embedded programmable logic

Soap Chip: System on a programmable Chip

Logic

Analog

DRAM/Flash/SRAM

Pro

gram

mab

le L

ogic

Microprocessor

CSoC, configurable SoC is: • an industry standard µProcessor, • embedded reconfigurable array, • memory, dedicated systen bus ...

Logic

Flash / RAM

memory banks

Reconfigurable

Accelerator

Array

... become indispensable for SoC products ?



Xputer Lab

10

Reconfigurable Logic going Mainstream

• Please, Lobby for New Curricula.

• Comprehensive Methodology

• One of the goals of this talk: to motivate You by Key Issues and Visionary Highlights.

• Fine grain: FPGAs killing the ASIC market

• Coarse grain: several startups

• Substantially improved design flow and libraries

• Fastest growing segment of semiconductor market



Xputer Lab

11

Designer-oriented Innovation stalled ?

• EDA industry: about 7 bio $ • leverages > 200 bio $ semconductor industry • FPGAs (7 bio $) fastest growing segment • EDA industry constantly redefining itself • „except logic synthesis nor really significant

innovation in the past decade“ • CAD developers can‘t deliver their idear

effectively • CAD developers personally don‘t appreciate the

real problems facing designers



Xputer Lab

12

EDA the main bottleneck

[email protected]









Xputer Lab

13

Biggest Mistake of EDA guess it !



Xputer Lab

14

>> History

• History

• Paradidgm Shift

• Coarse Grain: why ?

• Coarse Grain Architectures

• Reconfiguration Architecture http://www.uni-kl.de



Xputer Lab

15

Logic Gate Price Trend

Source:Altera

Pric

e (N

orm

aliz

ed t

o Q

1/19

93

)

Q1 '93

Q1 '94

Q1 '95

Q1 '96

Q1 '97

Q1 '98

Q1 '99

Q1 '00

Price per Logic Element

40% lower per Year

0

0.2

0.4

0.6

0.8

1

1.2

0.261

0.086 0.042 0.029



Xputer Lab

16

?

The History of Paradigm Shifts

“Mainstream Silicon Application is switching every 10 Years”

TTL µproc., memory

“The Programmable System-on-a-Chip is the next wave“

custom

standard

1957

1967

1977

1987

1997

2007

ASICs, accel’s

LSI, MSI

1st

Design Crisis

2nd

Design Crisis

?



Xputer Lab

17

Makimoto’s 3rd Wave

• Fine Grain Subsystems (FPGAs):

– 1st half of 3rd wave – universal (but less efficient)

• Coarse Grain Subsystems:

– 2nd half of 3rd wave – domain-specific – much more flexible than 2nd half of 2rd wave



Xputer Lab

18

How’s next Wave ?

2007 FPGAs

custom

standard

1957

1967

1977

1987

1997

Tredennick’s Paradigm Shifts

procedural programming

algorithm: variable

resources: fixed

hardwired

algorithm: fixed

resources: fixed

2007

?

structural programming

algorithm: variable

resources: variable

Coarse grain RAs

no further wave !

Hartenstein’s Curve

? 4th wave ?

[email protected]









Xputer Lab

19

The Impact of Makimoto’s Paradigm Shifts

TTL µproc., memory

custom

standard

ASICs, accel’s

LSI, MSI

1957

1967

1977

1987

1997

2007

Procedural personalization via RAM-based

Machine Paradigm

Personalization (CAD) before fabrication

structural personalization:

RAM-based before run time

Dr. Makimoto: FPL 2000 keynote

Software Industry’s Secret of Success

Repeat Success Story by new Machine Paradigm !



Xputer Lab

20

>> Paradigm Shift

• History

• Paradidgm Shift






Xputer Lab

21

Sequential vs. structural RAM

re-

download

conf. accelerator(s)

RAM

Logic Synthesis

Route and Place

FPGA

“von Neumann”

downloading

RAM

downloading

data path instruction sequencer

I / O

(procedural) Software

sequential

RAM

structural

RAM



Xputer Lab

22

Changing Models of Computing

“von Neumann” contemporary reconfigurable computing

downloading

RAM

downloading

data path instruction sequencer

I / O

host

hardwired

downloading

accelerator(s)

CAD

RAM

host

re-

downloading

conf. accelerator(s)

RAM RAM

(procedural) Software

Software Configware

(structural)

Flexware Hardware

occupies most silicon

the tail wagging the dog



Xputer Lab

23

The Microprocessor is a Methuselah

• 1th 4004

• 2nd 8008

• 3rd 8086

• 4th 80286

• 5th 80386

• 6th 80486

• 7th P5 (Pentium)

• 8th P6 (Pentium Pro / Pentium II)

• 9th Pentium III

9 technology generations ...

... the steam engine

of the silicon age



Xputer Lab

24

… Decline of Wintel Business Model

Billion Subscribers worldwide

1 Bio

0.5 Bio

20

Billion US-$ US Market [forrester]

15

10

20

1997 1998 1999 2000 2001 2002

Million Devices delivered in the U.S.

[IDC]

1000 $

1500 $

[email protected]









Xputer Lab

25

Basics of Binding Time

run time

loading time

compile time

time of “Instruction Fetch”

microprocessor parallel computer




Xputer Lab

26

Binding Time vs. Computing Domain

time domain (procedural)

Binding time: (Set-up of Communication Channels)

at run time microprocessor parallel computer

time & space (hybrid)

systolic arrays

later fabrication step ASICs

space domain (structural)

before fabrication full custom ICs

at loading time

at compile time


array processor

programming domain:

The KressArray is a generalization

of the systolic array



Xputer Lab

27

Dataquest Predicts Programmability to be Predominant in SOC

• With programmability as a standard feature, ASPPs will be predominant system-on-a-chip products in five years

Dataquest Semiconductors ‘98 conference

EETimes 10/21/98

Jordan Selburn, principal analyst, ASICs and system-level integration, Dataquest Inc.’s Semiconductors Group

• Application-specific programmable products (ASPPs) will be the next best thing in semiconductor technology



Xputer Lab

28

Applications

The 10th International Conference on Field-programmable Logic and Applications

The Roadmap to Reconfigurable Systems

*) keynotes and papers at FPL 2000 Villach, Austria, August 27 - 30, 2000

http://www.fpl.uni-kl.de/FPL/

• next generations’ wireless* • network processors* • many other areas*



Xputer Lab

29

Applications (2)

• Image Processing:

– for smart car (collision avoidance, others ...),

– Smart traffic pilots, robotics, fast material inspection,

– smart stub finders, motion detection (MPEG-4, ...)

• Signal Processing, Speech Processing, Software Radio,

• Correlation, Encryption, Comm. Switching / Protocols,

• Innovative consumer electronics:

– super smart cards, smart handies, wearable,

– portable, set-top, laptop, desktop, embedded, ...

• many others, ...



Xputer Lab

30

Applications

•new cellular standard: up to 2 Mbit/sec: new CDMA standard: > 500 MIPS needed just for RF receiver part

•wide variety of end-user‘s devices: smart handies, palm pilots, laptops, games, camcorder-likes, ..the internet car, many new types of devices to come ...

• increasing wide variety of services available from network provider:download just what a particular customer is subscribed to

•expert group [Vissers]: > 20% of it will be accelerator code*

[email protected]









Xputer Lab

31

Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld

microprocessor / DSP

No

rmalized

p

rocesso

r sp

eed

battery performance

Algorithmic Complexity

(Shannon’s Law)

memory

Tra

nsis

tors

/ch

ip

1960 1970 1980 1990 2000 2010

100 000 000

10 000 000

1000 000

100 000

10 000

1000

100

10

1

2G

3G

4G

Why coarse grain ?

1G

wireless

100

10

1

0.1

0.01

0.001

mA/ MIP

computational

efficiency

StrongARM

SH7752



Xputer Lab

32

Shannon‘s Law

• In a number of application areas throughput requirements are growing faster than Moore's law

• Fundamental flaws in

software processor solutions

• 32 soft ARM cores fit onto contemporary FPGA

• Stream-based distributed processing is the way to go



Xputer Lab

33

It’s a Paradigm Shift !

• Using FPGAs (fine grain reconfigurable) just mainly is classical Logic Synthesis on a “strange hardware” platform

• Coarse Grain Reconfigurable Arrays (Reconfigurable Computing), however, mean a really fundamental Paradigm Shift

• This is still ignored by CS and EE Curricula and almost all R&D scenes



Xputer Lab

34

>> Coarse Grain: why ?

• History

• Paradidgm Shift






Xputer Lab

35

It’s a General Paradigm Shift !

• Using FPGAs (fine grain reconfigurable): just Logic Synthesis on a strange platform

• Coarse Grain Reconfigurable Arrays (Reconfigurable Computing): a fundamental Paradigm Shift

• ignored by Curricula & most R&D scenes

• Replacing Concurrent Processes by much more efficient parallelism: Stream-based ComputingArrays



Xputer Lab

36

Fine-grained vs. coarse-grained

• Fine-grained reconfiguration versus coarse-grained reconfiguration.

• fine grain is general purpose

• slow and area-inefficient, but high parallelism

• coarse grain is application domain-specific

• coarse grain is highly area-efficient

• extremely high performance

[email protected]









Xputer Lab

37

Reconfigurability Overhead

S S

S S

resources needed for reconfigurability

partly for configuration code storage

L

L L

L L

L

L L L

area used by application

“hidden RAM” not shown



Xputer Lab

38

Principle of a Typical FPGA

FF

FF

FF

FF

FF FFFF FF

Connection-Point

Tap

CLBCLB

CLBCLB

CLBCLBFF of hidden RAM



Xputer Lab

39

Routing Overhead in FPGAs

FF

FF

FF

FF

FF FF

>1000 transistors at each cross bar

FF part of the

hidden RAM most FPGA vendors’ gate count:

1 flipflop of configuration RAM = 4 gates

Routing Congestion [DeHon]: often 50% or less of CLBs used

FF FF

Ý 40 transistors at each switching point

>

Ý 15 transistors at each tap >



Xputer Lab

40

Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld

Why Coarse Grain instead of FPGA ?

physical logical

FPGA logical

1980 1990 2000 2010

FPGA physical

100 000 000 000

10 000 000 000

1000 000 000

100 000 000

10 000 000

1000 000

100 000

10 000

1000

Tra

nsis

tors

/ c

hip

~ 10

~ 10 000

drastically smaller configuration memory

a lot of more benefits

much faster loading

FPGA routed

reduced reconfigurability overhead by up to ~ 1000



Xputer Lab

41

>>> extremely high efficiency

1. avoiding address computation overhead

2. avoiding instruction fetch and interpretation overhead

3. high parallelism, massively multiple deep pipelines

4. much less configuration memory

5. no routing areas to configure functions from CLBs



Xputer Lab

42

Configurable Computing Systems

• combine programmable sequential processor with Flexware (structurally programmable „hard“ware):

• capitalize on the strength of both,flexware and software.

• early 60ies: Estrin (UCLA): enabling technology not available

• 90ies: significant increase of research activities (DARPA ...)

• FPGAs: not the enabling technology: hardware skills needed

• Verilog or VHDL based systems often result in poor performance

[email protected]









Xputer Lab

43

Platforms available

• Soft Data Path Arrays – KressArray – Xtreme (PACT) – ACM (Quicksilver Tech) – CHESS Array (Elixent) – others

• Compilation techniques feasibility studies: – Partitioning Co-Compiler – Design Space Explorer – others



Xputer Lab

44

Also as an autonomous Machine

• New Machine Paradigm (Xputer)

• is the counterpart of the so-called von Neumann paradigm – CONS: confuses customers (paradigm switch: the brain hurts) – PROS: strong guidance of EDA tool development – more effective hardware/software APIs – compilation techniques similar to traditional compilation – better Application Development Tools accepting C or Java

• easy to teach: simple machine principles – scan patterns (data counter) similar to control flow (program

counter) – general model of hardware / software co-design – fascination for freak effect: opening up a new R&D discipline



Xputer Lab

45

>> Coarse Grain Architectures

• History

• Paradidgm Shift






Xputer Lab

46

Triscend

System on Chip

Sell Chips

Embedded Systems

Company Adaptive Silicon Chameleon Systems Malleable

Silicon Spice Systolix

MorphICs

Architecture Not disclosed 32 bit datapath array Not disclosed

Not disclosed Bit Serial Systolic Array

Not disclosed

Business Model Sell Cores Sell Chips Sell Chips

Sell Solutions Sell Cores

Sell Cores

Markets

Embedded DSP Networking

Voice over IP

Networking Signal Conditioning

Wireless Commun.

Network Processors: > 20 Players

Some Players in Silicon Valley and ….



Xputer Lab

47

Commercial rDPAs

XPU family (IP cores): PACT Corp., Munich

XPU128

**) bought

**

**

flexible array: MorphICs

CALISTO: Silicon Spice

CS2000 family: Chameleon Systems

MECA family: Malleable

FIPSOC: SIDSA

ACM: Quicksilver Tech

CHESS array: Elixent

MorphoSys: Morpho Tech

*

*

*) here at SoC



Xputer Lab

48

PACT Corp

• Xtreme Processor Platform (XPP) family of IP cores, high-speed data-stream-capable, scalable, reconfigurable clusters of arrays of 32-bit DPUs with embedded memories, and high-speed I/O ports -

• Application development support software featuring a flow graph-style algorithm mapping language - to minimize training requirements.

• XPP's fabrics, featuring automatic DataFlow synchronization and flagged Event Network to dynamically configure the execution flow,

• Supports dynamic RTR: hierarchical configuration managers free the designer from chip-level details and ensure that configurations are independently loaded in exactly the intended order.

• Automatic event-based task swapping along with data streams: released resources automatically reconfigured immediately

[email protected]









Xputer Lab

49

Reconfigurable Interconnect Fabric

separate routing area

rDPA (Reconfigurable Datapath Array)

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

rDPU rDPU rDPU rDPU

RIF layouted over rDPUs: rDPA wired by abutment



Xputer Lab

50

Generically defined Fabrics: KressArray Family

f)g)

i)

a)

e)

routing

routing

d)b)

h)

only

andfunction

c) rDPU:

rDPU:

rDPU

+

Some Application Areas, like e. g. Wireless Communication,need extraordinarily powerful Communication Resources



Xputer Lab

51

Universal RAs are not always feasible

... often Functional Resources are not the Throughput Bottleneck

Some Application Areas, such as e. g. Wireless Communication, need extremely rich Communication Resources

Use Domain-specific Platform Generators !

The General Purpose (coarse grain) Reconfigurable Array

may appear to be an Illusion ...



Xputer Lab

52

KressArray Family Example

16 24

32

4

8

2 rDPU external view: only NNport Abutment Architecture shown

taylored KressArray rDPU example

http://kressarray.de



Xputer Lab

53

KressArray Family generic Fabrics: a few examples

Examples of 2nd Level Interconnect: layouted over rDPU cell - no separate routing areas !

+

rout-through and function

rout-through

only more NNports:

rich Rout Resources

Select Function

Repertory

select Nearest Neighbour (NN) Interconnect: an example

16 32 8 24

4

2 rDPU

Select mode, number, width of NNports

http://kressarray.de © 2001, [email protected] http://www.fpl.uni-kl.de


Xputer Lab

54

CMOS intercoonnect resources

Foundries offer up to 8 metal layers

and up to 3 poly layers

reconfigurable interconnect fabric

layouted over the rDU cell

[email protected]









Xputer Lab

55

Super Pipe Networks

pipeline properties array applications

shape resources

mapping scheduling

(data stream formation)

systolic array

regular data

dependencies only

linear only

uniform only

linear projection or algebraic synthesis

super-systolic rDPA

no restrictions simulated

annealing or P&R algorithm

(e.g. force-directed) scheduling algorithm

*

*) KressArray [1995]



Xputer Lab

56

Communication Resource Requirements

... often Functional Resources are not the Throughput Bottleneck

In some Application Areas, such as e. g. Wireless Communication, Reconfigurable Computing Arrays need extraordinarily rich and powerful Communication Resources

The Solution: Generators for Domain-specific RA Platforms



Xputer Lab

57

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs


SNN filter KressArray Mapping Example

rout thru only

not used backbus connect



Xputer Lab

58

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

route-thru-only rDPU

3 vert. NNports, 32 bit


Xplorer Plot: SNN Filter Example

+ [13]

2 hor. NNports, 32 bit

operator

result

operand

operand

route thru

backbus connect



Xputer Lab

59

Super Pipe Networks

pipeline propertiesarray applications

shape resources

mappingscheduling

(data streamformation)

systolicarray

regular data

dependenciesonly

linearonly

uniformonly

linear projection oralgebraic synthesis

super-systolicRA

no restrictionssimulated

annealing orP&R algorithm

(e.g. force-directed)schedulingalgorithm*

*) KressArray [ASP-DAC-1995]



Xputer Lab

60

KressArray: try out youself !

• You may experiment yourself

• You may use it over the internet

• Map an application onto a KressArray

• Start with a simple example

• Visit http://kressarray.de

• Click the link to Xplorer

• ... does not run on internet explorer ....

• ... since Bill Gates does not like Java

try Netscape 4.7x

http://kressarray.de/

[email protected]









Xputer Lab

61

Michael Herz

Dissertation Michael Herz: • ... on mapping parallel memory

architectures for stream-based arrays onto KessArrays

• ... also transformation of storage schemes to optimize memory bandwith

• (MoM scan pattern transformations)

Agilent, Sindelfingen



Xputer Lab

62

Ulrich Nageldinger

Dissertation

Ulrich Nageldinger:

• ... on mapping applications onto KessArrays

• ... simultaneous routing and placement by simulated annealing

• Supporting a huge family of KressArrays

• fuzzy logic improvement proposal generator

• profiling

• design space exploration

infineon technologies, Munich



Xputer Lab

63

Rainer Kress

Dissertation

Rainer Kress:

• ... on mapping applications onto his* KessArray

• DPSS datapath synthesis system

• Including a data scheduler

• (data stream scheduler)

• Generalization of the Systolic Array

• (KressArray is a super systolic array)

• 32 bit design via Eurochip support

infineon technologies, Munich



Xputer Lab

64

Jürgen Becker

Dissertation

Jürgen Becker:

• ... Automatically partitioning Co-compiler

• (configware / software co-compilation)

• Resource-parameter-driven retargettable

• Profiler-driven optimization

• Accepts HLL „ALE-X“ (extended C subset)

• (subset: pointers not supported)

Professor at Univ. Karlsruhe



Xputer Lab

65

Karin Schmidt

Dissertation

Karin Schmidt:

• Compilation Techniques for Xputers

• modified loop transformations

• Modified parts of implementation used for Jürgen Becker‘s Ph. D. thesis

DaimlerChrysler Research



Xputer Lab

66

CHESS Array w. embedded RAM (Elixent)

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

R A M

User Registers Clock Control

Me

mo

ry I

nte

rfa

ce

multi-granular e. g. 16 * 4 Bits = 64 Bits

ALU ALU

ALU

ALU

ALU ALU

Sequencer

[email protected]









Xputer Lab

67

Chameleon Systems

• RISC processor and an array of 108 arithmetic processing units. Each of those 32-bit processing cores runs at 125 MHz.

• The CS2112 is the industry's first Reconfigurable Communications

Processor (RCP), a streaming data processor.

• The vendor claims a performance of 20 billion 16-bit operations per second, and 2.4 billion 16-bit multiply-accumulates per second - and 1.6 GBytes / sec for ist programmable I/O (PIO) banks.

• It also has a PCI interface.

• Tool suite C~SIDE for developing, verifying and optimizing.



Xputer Lab

68

Coarse Grain Architectures

style project first

publ.

source architecture granularity fabrics mapping intended target application

DP-FPGA 1994 [4] 2-D array 1 & 4 bit multi-granular Inhomog. routing channels switchbox routing regular datapaths

KressArray 1995 [5,11] 2-D mesh family: sel. pathwidth multiple NN & bus segments (co-)compilation (adaptable)

Colt 1996 [12] 2-D array 1 & 16 bit inhomogenous run time reconfiguration highly dynamic reconfig.

Matrix 1996 [15] 2-D mesh 8 bit, multi-granular 8NN, length 4 & global lines multi-length general purpose

RAW 1997 [17] 2-D mesh 8 bit, multi-granular 8NN switched connections switchbox rout experimental

Garp 1997 [16] 2-D mesh 2 bit global & semi-global lines heuristic routing loop acceleration

REMARC 1998 [18] 2-D mesh 16 bit NN & full length buses (info not available) multimedia

MorphoSys 1999 [19] 2-D mesh 16 bit NN, length 2 & 3 global lines manual P&R (not disclosed)

CHESS 1999 [20] hexagon 4 bit, multi-granular 8NN and buses JHDL compilation multimedia

DReAM 2000 [21] 2-D array 8 &16 bit NN, segmented buses co-compilation next generation wireless

CS2000 family 2000 [23] 2-D array 16 & 32 bit inhomogenous array (not disclosed) communication

MECA family 2000 [24] 2-D array multi-granular (not disclosed) (not disclosed) tele- & datacommunication

CALISTO 2000 [25] 2-D array 16 bit multi-granular (not disclosed) (not disclosed) tele- & datacommunication

mesh

FIPSOC 2000 [26] 2-D array 4 bit multi-granular (not disclosed) (not disclosed) tele- & datacommunication

RaPID 1996 [27] 1-D array 16 bit segmented buses channel routing pipelining linear

PipeRench 1998 [29] 1-D array 128 bit (sophisticated) scheduling pipelining

PADDI 1990 [30] crossbar 16 bit central crossbar routing DSP

PADDI-2 1993 [32] crossbar 16 bit multiple crossbar routing DSP and others Cross bar

Pleiades 1997 [33] mesh+crossbar multi-granular multiple segmented crossbar switchbox routing multimedia



Xputer Lab

69

Primarily Mesh-based ….

market project bits granularity source

KressArray variable U. Kaiserslautern

Garp 2 UC Berkeley

CHESS 4 Hewlett Packard

Matrix

RAW8 M.I.T.

Colt 1 & 16 Virginia Tech

DReAM 8 &16 TU Darmstadt

REMARC Stanford

research

MorphoSys UC Irvine

CALISTO Slicon Spice

MECA family

16

Malleable

CS2000 family 16 & 32 Chameleon Systems

FIPSOC 16 & analog SIDSA

commercial

XPP XPU128 32 PACT Corp.



Xputer Lab

70

UC Berkeley (Jan Rabaey)

market project bits granularity source

PADDI

PADDI-2research

Pleiades

16 UC Berkeley



Xputer Lab

71

Crossbar-based Architectures

1993: PADY-II (Jan Rabaey)

EXUCTL

EXUCTL

EXUCTL

EXUCTL

EXUCTL

EXUCTL

EXUCTL

EXUCTL

crossbar switchI/OI/O

1990: UC Berkeley (Jan Rabaey)

16 bit

1997: Pleiades (mesh & crossbar)

32 bit



Xputer Lab

72

PADDI-II Architecture

NetworkP47

P48

P46

P45

P1P2P3P4

P5P6P7P8

P9P10P11P12

P13P14P15P16

P17P18P19P20

P21P22P23P24

P25P26P27P28

P29P30P31P32

P33P34P35P36

P37P38P39P40

P41P42P43P44

P45P46P47P48

break-switch

break-switch

I/O I/O I/O I/O

I/O I/O I/O I/O

6 x 16b

16 x 6 switch matrix

Level-2

16 x 16b

Level-1 Network

4-PE Cluster

[email protected]









Xputer Lab

73

MorphoSys



Xputer Lab

74

PipeRench Architecture (CMU 1998)

highly dynamic reconfiguration

alternating data/instruction stream



Xputer Lab

75

M.I.T.

MIPS-like processor

core

cross bar

global lines

global lines

RAW (M.I.T. 1997)

Reconfigurable Architecture Workbench

MATRIX (1996) Multiple Alu archiTecture with Reconfigurable Interconnect eXperiment

0.5 m CMOS 8 bit 10 x 10 1.8 mm2

100 MHz

ALU 8 bit

256x8 bit

Mem

WE mode

Net

wo

rk P

ort

A N

etwo

rk Po

rt B

Mem

Fu

nc P

ort A

LU

Fu

nc

Po

rt

compare / reduce 2

C / R Network

compare / reduce 1

C / R Network Level-1 Network

BFU opc operation

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15

× × +

× + + × const

insh nsh dsh csh

+ +0 +1

:=

nand nor xor



Xputer Lab

76

MATRIX Interconnect Fabrics

BFU

its neighbours

BFUs

Communication Resources are often the bottleneck



Xputer Lab

77

More Research Projects

.... and others

Garp (UC Berkeley)

RaPiD (U. Washington )

REMARC (Stanford) published between 1996 - 2000

DReAM (U. Karlsruhe)

Asia / Pacific: also see embedded tutorials by Prof. Amano (ASP_DAC’99, FPL-2000)



Xputer Lab

78

RaPiD Architecture

A

L

U

RAM

MU

LT

A

L

U

RAM

A

L

U

RAM

Bus Connectors Input Multiplexers Output Drivers

DatapathRegisters

[email protected]









Xputer Lab

79

REMARC



Xputer Lab

80

Future Coarse Grain RA Development

• It is indispensable to operate within the Convergence Area of Compilers, Co-Compilers, Architecture and full-custom-style VLSI Design (array cells).

• It is a must, that Products come with a Development Platform which encourages users,especially also those with a limited Hardware Background.



Xputer Lab

81

>> Reconfiguration Architecture

• History

• Paradidgm Shift






Xputer Lab

82

statically re-configurable

Dimensions of Reconfigurability

Class ofprocessor product vendor

ASIP Tensilica Tensilica

MECA family Malleable

CALISTO SiliconSpiceNetworkProcessor

many others many others

configuration time

ASIP

fabrication time

run time Network

Processor

design time

compile time

dynamically

reconfigurable

*) Application-Specific Instruction set Processors

ASIPs* vs. Network Processors

Extremes:



Xputer Lab

83

Configuration Architectures

host

Compiler, Mapper, RTOS

etc.

Soft

Data

Path RAM

RAM

RAM

RAM

multi-context:

Soft

Data

Path

RAM

host


etc.

straight forward:

host


etc.

Config. Cache

RAM

RAM

RAM

RAM

Soft

Data

Path

RAM

Configuration caching*:

Configuration Loading Resources: • separate configuration fabrics (e.g. FPGA)

• wormhole routing (KressArray, Colt, PipeRench)

• RA part computes code for other RA part (self reconfiguration)

(dynamic vs. static)

dynamic

*) no cache as usual !



Xputer Lab

84

Colt Architecture (P. Athanas 1996)

Multiplier

DP

DP

DP

Smart

Crossbar

IFUIFUIFUIFU

IFUIFUIFUIFU

IFUIFUIFUIFU

IFUIFUIFUIFU

DP

DP

DPI/O Pins

I/O Pins

I/O Pins

I/O Pins

I/O Pins

I/O Pins

Studying highly dynamic reconfiguration

wormhole routing

[email protected]









Xputer Lab

85

Schedule

time slot

08.30 – 10.00 Reconfigurable Computing (RC)


10.30 – 12.00 Compilation Techniques for RC

12.00 – 14.00 lunch break

14.00 – 15.30 Resources for Stream-based RC


16.00 – 17.30 FPGAs: recent developments



Xputer Lab

86

- END -

reiner hartenstein, university of kaiserslautern, germany - xputer lab … · 2013. 5. 8. ·...

Documents