(keynote) (from hpc to) new horizons of very high performance computing (vhpc): hurdles and chances...
Post on 19-Dec-2015
220 Views
Preview:
TRANSCRIPT
(keynote)(from HPC to)
New Horizons of Very High Performance Computing
(VHPC): Hurdles and Chances
Reiner Hartenstein
TU Kaiserslautern
Rhodes Island, Greece, April 25-26, 2006
© 2006, reiner@hartenstein.de http://hartenstein.de2
TU KaiserslauternReconfigurable Supercomputing
(VHPC) going commercial
Cray XD1
silicon graphics RASC
… it‘s a paradigm shift !… and other vendors
© 2006, reiner@hartenstein.de http://hartenstein.de3
TU Kaiserslautern
The Pervasiveness of RC
162,000
127,000
158,000113,000
171,000194,000
# of hits by Google
1,620,000
915,000
398,000
272,000
647,000
1,490,000
# of hits by Google
“FPGA and ….”ECE-savvy scene Math/SW-savvy sceneunqualified for RC ?
© 2006, reiner@hartenstein.de http://hartenstein.de4
TU Kaiserslautern
world-wide a mass movement
Methodology ?
reminds me to the mass migration of lemmings
terminology chaosnot really a sense of direction
an urgent need to get organized
© 2006, reiner@hartenstein.de http://hartenstein.de5
TU Kaiserslautern>> Outline <<
•Reconfigurable Computing Paradox
•The Supercomputing Paradox
•We are using the wrong model
•Coarse-grained Reconfigurable Devices
•Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
© 2006, reiner@hartenstein.de http://hartenstein.de6
TU KaiserslauternThe Reconfigurable Computing
Paradox
very poor effective integration density
„very power-hungry“ [Rick Kornfeld*]
very poor application development support
poor FPGA technology:
lower clock frequencies, and more expensive.
RC education: extremely poor, or none
Languages and tools unacceptable for software peoplemost hardware experts (86%**) hate their tools
**) DeHon ‘98 *) personal communication
poor tools:
poor education:
However, brilliant
results everywhere
what paradox ?
ignored by CS curricula
… teach like for a 50 year old mainframe …
© 2006, reiner@hartenstein.de http://hartenstein.de7
TU Kaiserslautern
Computing Curricula 2004fully ignores
Reconfigurable Computing
Joint Task Force for
FPGA & synonyma: 0 hits
not even here
(Google: 10 million hits)
Education ?
© 2006, reiner@hartenstein.de http://hartenstein.de8
TU Kaiserslautern
Computing Curricula v.2005:no changes other than „… FPGA, etc.“(not really mentioning that it‘s missing)
Completed ?
Taskforce activity completed ?Next task force in 2020 or later ?
© 2006, reiner@hartenstein.de http://hartenstein.de9
TU Kaiserslautern
End of this week: brainstorming session at DARPA:
(urgently needed – overdue! )
Tools ?
© 2006, reiner@hartenstein.de http://hartenstein.de10
TU Kaiserslautern
fine-grained RC: 1st DeHon‘s Law Technology:
reconfigurability overhead>
routing congestion
wiring overhead
overhead:
>> 10 000
1980 1990 2000 2010100
103
106
109
FPGAlogical
FPGArouted
density:
FPGAphysical
(Gordon Moore curve)
transistors / microchip
(microprocessor)
immense area inefficiency
[1996: Ph. D, MIT]
© 2006, reiner@hartenstein.de http://hartenstein.de11
TU Kaiserslautern
X 2/yr
FPGA
published speed-up factors
1980 1990 2000 2010100
103
106
109
8080
Pentium 4
7%/yr
50%/yr
http://xputers.informatik.uni-kl.de/faq-pages/fqa.html
10 000
Los Alamos traffic simulation
Los Alamos traffic simulation
47
real-time face detectionreal-time face detection6000
video-rate stereo vision
video-rate stereo vision
900pattern
recognitionpattern
recognition730
SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching
Smith-Waterman pattern matching
288
BLASTBLAST52protein identificationprotein identification
40
molecular dynamics simulationmolecular dynamics simulation
88
Reed-Solomon Decoding
Reed-Solomon Decoding2400
Viterbi DecodingViterbi Decoding
400
FFTFFT
100
1000MA
CMA
C
Grid-based DRC:no FPGA: DPLA on MoM by TU-KL
Grid-based DRC:no FPGA: DPLA on MoM by TU-KL
20002000
2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]
39,4
Lee Routing (by TU-KL)
Lee Routing (by TU-KL)
160
Grid-based DRC („fair
comparizon“)
Grid-based DRC („fair
comparizon“)1500015000
DSP and wirelessImage processing,Pattern matching,
Multimedia
Bioinformatics
GRAPEGRAPE20
Astrophysics
DPLADPLA
MoM Xputer architecture
Microprocessor
rela
tive
perf
orm
anc
e
Memory
10 000
x1.25 / yr (Moore)
cryptocrypto
1000
pre-FPGA era
© 2006, reiner@hartenstein.de http://hartenstein.de12
TU Kaiserslautern
pre FPGA era: Why DPLA* was so good
Close to Moore because of small overhead (wiring, programmability, routing)
Large arrays of canonical boolean expressions
PLA layout ~similar to RAM / ROM layout:
Mid’ 80ies: first very tiny FPGAs available
*) designed by TU-KL, fabricated by E.I.S. German multi university project
GAG Generic Address Generator to avoid address computation overhead
2ASM: Auto-Sequencing MemoryASM
[M. Herz et al.: ICECS 2003, Dubrovnik]
© 2006, reiner@hartenstein.de http://hartenstein.de13
TU Kaiserslautern(anti-von-Neumann machine
paradigm)Data Counter instead of Program CounterGeneralization of the DMA
datacounter
GAG RAM
ASM: Auto-Sequencing MemoryASM
GAG & enabling technology:published 1989 [by TU-KL],Survey paper: [M. Herz et al.*: IEEE ICECS 2003, Dubrovnik] *) IMEC & TU-KL
**) -- patented by TI** 1995
Storge Scheme optimization methodology, etc.
© 2006, reiner@hartenstein.de http://hartenstein.de14
TU Kaiserslautern
Thousands or Millions of $ for free
Application migration [from supercomputer] resulting not only in massive speed-upsElectricity bills reduced by an order of magnitude and even more you may get for free…. up to millions of $ dollars per year
(also a matter of national energy policy)
GoogleAmsterdam
NY
© 2006, reiner@hartenstein.de http://hartenstein.de15
TU KaiserslauternReconfigurable Scientific
Computing How software types do programming the FPGAs ?Hiring a good student from the EE Dept. ?
Because of Missing RC education: Far away from optimum solutions ?Much higher speedup achievable ?
1 or 2 more orders of magnitude ? 100.000 ? 1.000.000 ?
© 2006, reiner@hartenstein.de http://hartenstein.de16
TU Kaiserslautern
X 2/yr
FPGA
By education: better speed-up factors ?
1980 1990 2000 2010100
103
106
109
8080
P4
7%/yr
50%/yr
http://xputers.informatik.uni-kl.de/faq-pages/fqa.html
10 000
Los Alamos traffic simulation
Los Alamos traffic simulation
47
real-time face detectionreal-time face detection6000
video-rate stereo vision
video-rate stereo vision
900pattern
recognitionpattern
recognition730
SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching
Smith-Waterman pattern matching
288
BLASTBLAST52protein identificationprotein identification
40
molecular dynamics simulationmolecular dynamics simulation
88
Reed-Solomon Decoding
Reed-Solomon Decoding2400
Viterbi DecodingViterbi Decoding
400
FFTFFT
100
1000MA
CMA
C
Grid-based DRC:no FPGA: DPLA on MoM by TU-KL
Grid-based DRC:no FPGA: DPLA on MoM by TU-KL
20002000
2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]
39,4
Lee Routing (by TU-KL)
Lee Routing (by TU-KL)
160
Grid-based DRC („fair
comparizon“)
Grid-based DRC („fair
comparizon“)1500015000
DSP and wirelessImage processing,Pattern matching,
Multimedia
Bioinformatics
GRAPEGRAPE20
Astrophysics
DPLADPLA
MoM Xputer architecture
Microprocessor
rela
tive
perf
orm
anc
e
Memory
10 000
x1.25 / yr (Moore)
cryptocrypto
1000
tool
s & e
du a
vaila
ble
?
© 2006, reiner@hartenstein.de http://hartenstein.de17
TU Kaiserslautern>> Outline <<
•Reconfigurable Computing Paradox
•The Supercomputing Paradox
•We are using the wrong model
•Coarse-grained Reconfigurable Devices
•Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
© 2006, reiner@hartenstein.de http://hartenstein.de18
TU Kaiserslautern
The Supercomputing Paradox
Growing listed Teraflops
Often limited sustained Teraflops
Almost stalled application implementation progress
Increasing number of processors running in parallel
COTS processor decreasing cost
Very high total cost of the Tera(?)flops
promising technology
poor results
Scientists waiting for affordable compute capacity
The Law of More
© 2006, reiner@hartenstein.de http://hartenstein.de19
TU Kaiserslautern>> Outline <<
•Reconfigurable Computing Paradox
•The Supercomputing Paradox
•We are using the wrong model
•Coarse-grained Reconfigurable Devices
•Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
© 2006, reiner@hartenstein.de http://hartenstein.de20
TU Kaiserslautern
Why traditional supercomputing / HPC failed
instruction-stream-based: memory-cycle-hungry
the wrong way, how the data are moved around
because of the wrong multi-core interconnect architecture
extr
emel
y unbal
ance d
stolen from Bob Colwell
CPU
© 2006, reiner@hartenstein.de http://hartenstein.de21
TU Kaiserslautern
Earth Simulator
5120 Processors, 5000 pins eachES 20: TFLOPS
Crossbar weight: 220 t, 3000 km of thick cable,moving data around
inside the
© 2006, reiner@hartenstein.de http://hartenstein.de22
TU Kaiserslautern
Bringing together data and processor
moving the grand piano
by SoftwareMoving data to the processor:
© 2006, reiner@hartenstein.de http://hartenstein.de23
TU Kaiserslautern>> Outline <<
•Reconfigurable Computing Paradox
•The Supercomputing Paradox
•We are using the wrong model
•Coarse-grained Reconfigurable Devices
•Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
© 2006, reiner@hartenstein.de http://hartenstein.de24
TU Kaiserslautern
coarse-grained RC: Hartenstein‘s Law
rDPA
FPGArouted
>> 10 000
1980 1990 2000 2010100
103
106
109
(Gordon Moore curve)
transistors / microchip
rDPA physical rDPA logical
area efficiency very close to Moore‘s law
[1996: ISIS, Austin, TX]
e.g.
KressArray
family
© 2006, reiner@hartenstein.de http://hartenstein.de25
TU Kaiserslautern
X 2/yr
FPGA
higher speed-up factors by coarse-grained?
1980 1990 2000 2010100
103
106
109
8080
P4
7%/yr
50%/yr
http://xputers.informatik.uni-kl.de/faq-pages/fqa.html
10 000
Los Alamos traffic simulation
Los Alamos traffic simulation
47
real-time face detectionreal-time face detection6000
video-rate stereo vision
video-rate stereo vision
900pattern
recognitionpattern
recognition730
SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching
Smith-Waterman pattern matching
288
BLASTBLAST52protein identificationprotein identification
40
molecular dynamics simulationmolecular dynamics simulation
88
Reed-Solomon Decoding
Reed-Solomon Decoding2400
Viterbi DecodingViterbi Decoding
400
FFTFFT
100
1000MA
CMA
C
Grid-based DRC:no FPGA: DPLA on MoM by TU-KL
Grid-based DRC:no FPGA: DPLA on MoM by TU-KL
20002000
2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]
39,4
Lee Routing (by TU-KL)
Lee Routing (by TU-KL)
160
Grid-based DRC („fair
comparizon“)
Grid-based DRC („fair
comparizon“)1500015000
DSP and wirelessImage processing,Pattern matching,
Multimedia
Bioinformatics
GRAPEGRAPE20
Astrophysics
DPLADPLA
MoM Xputer architecture
Microprocessor
rela
tive
perf
orm
anc
e
Memory
10 000
x1.25 / yr (Moore)
cryptocrypto
1000Coa
rse-
grai
ned
arra
ys ?
© 2006, reiner@hartenstein.de http://hartenstein.de26
TU Kaiserslautern
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
array size: 10 x 16 = 160 rDPUs
Coarse grain is about computing, not logic
rout thru only
not usedbackbus connect
SNN filter on KressArray (mainly a pipe network)
[Ulrich Nageldinger]
reconfigurable Data Path Unit, e. g. 32 bits wide
reconfigurable Data Path Unit, e. g. 32 bits wide
no CPUrDPUrDPU
© 2006, reiner@hartenstein.de http://hartenstein.de27
TU Kaiserslautern
SW 2coarse-grained CW migration example
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
S
+
© 2006, reiner@hartenstein.de http://hartenstein.de28
TU KaiserslauternCompare it to software solution on CPU
on a very simple CPU C = 1
memory cycles
nanoseconds
if C then read A
read instruction
instruction decoding
read operand*
operate & register transfers
if not C then read B
read instruction
instruction decoding
add & store
read instruction
instruction decoding
operate & register transfers
store result
total
S
+
ABR C
Clock200
=1
S
+
S = R + (if C then A else B endif);
© 2006, reiner@hartenstein.de http://hartenstein.de29
TU Kaiserslautern
hypothetical branching example to illustrate software-to-configware
migration
*) if no intermediate storage in register file
C = 1simple conservative CPU example
memory cycles
nanoseconds
if C then read A
read instruction 1 100instruction decoding
read operand* 1 100operate & reg. transfers
if not C then read B
read instruction 1 100instruction decoding
add & store
read instruction 1 100instruction decoding
operate & reg. transfers
store result 1 100
total 5 500
S = R + (if C then A else B endif);
S
+
ABR C
clock200 MHz(5 nanosec)
=1
no m
emor
y cy
cles
:
no m
emor
y cy
cles
:
spee
d-up
fac
tor
= 1
00
spee
d-up
fac
tor
= 1
00
© 2006, reiner@hartenstein.de http://hartenstein.de30
TU Kaiserslautern
moving the locality of operation into the route of the data stream by P&R
Why the speed-up? What‘s the difference?
instead of moving data by instruction streams
© 2006, reiner@hartenstein.de http://hartenstein.de31
TU Kaiserslautern
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
rout thru only
not usedbackbus connect[Ulrich Nageldinger]
The wrong mind set ....
S = R + (if C then A else B endif);
=1
+
ABR C
section of a very large pipe network:
decision
not knowing this solution:symptom of the hardware / software chasm
and the configware / software chasm
„but you can‘t implement decisions!“
We need Reconfigurable Computing Education
© 2006, reiner@hartenstein.de http://hartenstein.de32
TU Kaiserslautern
The new paradigm: how the data are traveling
not transport-triggered: old hat
pipeline, or chaining
super systolic array
no, not by instruction execution
DPU DPU DPU
vN Move Processor
instruction-driven
+ instruction-driven
[Jack Lipovski, EUROMiCRO, Nice, 1975]
P&R: move locality of operation, not data !
© 2006, reiner@hartenstein.de http://hartenstein.de33
TU Kaiserslautern
DPA
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data stream
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|output data streams
„data
streams“ time
port #
time
time
port #time
port #
define: ... which data item at which time at which port
Data streams
(pipe network)
H. T. Kung paradigm(systolic array)
implemented by distributed
memory
datacounter
GAG RAM
ASM
ASM
ASM
ASM
ASM
ASM
AS
M
AS
M
AS
M
AS
M
AS
M
AS
MASM: Auto-Sequencing
Memory
50 & more on-chip ASM are feasible
50 & more on-chip ASM are feasible
© 2006, reiner@hartenstein.de http://hartenstein.de34
TU Kaiserslautern
The Generalization of the Systolic Array
[R. Kress]:use optimization algorithmse. g.: simulated annealing
Achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible
reconfigurability makes sense
discard algebraic synthesis methods
remedy?
only for applications with regular data dependencies
Kress-Kung paradigmsuper systolic array
© 2006, reiner@hartenstein.de http://hartenstein.de35
TU Kaiserslautern>> Outline <<
• Reconfigurable Computing Paradox
• The Supercomputing Paradox
• We are using the wrong model
• Coarse-grained Reconfigurable Devices
• Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
© 2006, reiner@hartenstein.de http://hartenstein.de36
TU Kaiserslautern
Here is the common model
data-stream-based
instruction-stream-
based
software code
accelerator reconfigurable
accelerator hardwired
configware code
CPU
it’s not von Neumann the vN monopoly in our curricula is severely harmful
wagging the dog
the tail is
we need dual paradigm education
© 2006, reiner@hartenstein.de http://hartenstein.de37
TU Kaiserslautern
A potential Pentium successorDiscard most caches
have 64* cores, 0.5 - 1 GHz
with clever interconnect for:
concurrent processes and
and for multithreading,
Kung-Kress pipe network
The Desk-top Supercomputer!
*) CPU mode / DPU mode capability
and, for
CPU
mod
eDP
U m
ode
© 2006, reiner@hartenstein.de http://hartenstein.de38
TU Kaiserslautern“Super Pentium” configuration
examplerDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU
CPUCPU
CPUCPU CPUCPU
CPUCPU
© 2006, reiner@hartenstein.de http://hartenstein.de39
TU Kaiserslautern
e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz
GamesGames MusicMusicVideosVideos
SMeXPPSMeXPP
CameraCamera
Baseband-Baseband-ProcessorProcessor
Radio-Radio-InterfaceInterface
AudioAudio--InterfaceInterface
SD/MMC CardsSD/MMC Cards
LCD DISPLAY
rDPArDPA
• Variable resolutions and refresh rates• Variable scan mode characteristics• Noise Reduction and Artifact Removal• High performance requirements• Variable file encoding formats• Variable content security formats• Variable Displays• Luminance processing• Detail enhancement• Color processing• Sharpness Enhancement• Shadow Enhancement• Differentiation • Programmable de-interlacing heuristics• Frame rate detection and conversion• Motion detection & estimation & compensation• Different standards (MPEG2/4, H.264)• A single device handles all modes
World TV & game console & multi media center
http://pactcorp.com
© 2006, reiner@hartenstein.de http://hartenstein.de40
TU Kaiserslautern
Dual Paradigm Application Development
instruction-stream-
based
software code
accelerator reconfigurable
accelerator hardwired
configware codedata-stream-based
CPU
software/configwareco-compiler
high level language
© 2006, reiner@hartenstein.de http://hartenstein.de41
TU KaiserslauternSoftware / Configware Co-
Compilation
Juergen Becker’s CoDe-
X, 1996
CPUCPU
Resource Parameters
supportingdifferentplatforms
SWcompiler
CWcompiler
C language source
Partitioner
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
Placement &
Routing
Placement &
Routing(Move the Locality of Operation
)
© 2006, reiner@hartenstein.de http://hartenstein.de42
TU Kaiserslautern
Bringing together data and processor
Move the stool
byConfigware
Place the location of execution into the data pipe
© 2006, reiner@hartenstein.de http://hartenstein.de43
TU Kaiserslautern>> Conclusions <<
•Reconfigurable Computing Paradox
•The Supercomputing Paradox
•We are using the wrong model
•Coarse-grained Reconfigurable Devices
•Super Pentium for Desktop Supercomputer
•Conclusions http://www.uni-kl.de
© 2006, reiner@hartenstein.de http://hartenstein.de44
TU Kaiserslautern
Conclusions (1): Hurdles
Obstacles are:
unbelievably disastrous tools market:
unbelievably ignorant curricula:
enabling technologies available, partly decades old, but not used
transdisciplinary models not available nor taught at CS, nor elsewhere
fragmentation into application-domain-specific cultures and trick boxes
… teach like for a 50 year old mainframe …
© 2006, reiner@hartenstein.de http://hartenstein.de45
TU Kaiserslautern
Conclusions (2): Future Work
CS disciplines must recognize and accept its strategic role and its responsibility toward all its application disciplines: embedded and scientific computing.
The monopoly of the von-Neumann-based mind set in CS education:
heavily stalls progress in R&D, not only in HPC causes high cost in R&D, not only in supercomputing
The von-Neumann-only-based mind set in CS urgently needs to go to adopt the dual paradigm common model
CS graduates are not qualified for our job market
© 2006, reiner@hartenstein.de http://hartenstein.de46
TU Kaiserslautern
Conclusions (3): Chances
New horizons: chances are brilliant
© 2006, reiner@hartenstein.de http://hartenstein.de47
TU Kaiserslautern
thank you
© 2006, reiner@hartenstein.de http://hartenstein.de48
TU Kaiserslautern
END
© 2006, reiner@hartenstein.de http://hartenstein.de49
TU Kaiserslautern
thank you
© 2006, reiner@hartenstein.de http://hartenstein.de50
TU Kaiserslautern
Backup:
© 2006, reiner@hartenstein.de http://hartenstein.de51
TU Kaiserslautern
Co-Compiler Enabling Technology
is available from academia
only a small team needed for commercial re-implementation
on the road map to the Personal Supercomputer
© 2006, reiner@hartenstein.de http://hartenstein.de52
TU KaiserslauternCompilation: Software vs.
Configware
source program
softwarecompiler
software code
Software Engineeri
ng
Software Engineeri
ng
configware code
mapper
configwarecompiler
scheduler
flowware code
source „program“
Configware
Engineering
Configware
Engineering
placement &
routing
data
C, FORTRANMATHLAB
© 2006, reiner@hartenstein.de http://hartenstein.de53
TU Kaiserslautern
configware resources: variable
Nick Tredennick’s Paradigm Shifts explain the differences
2 programming sources needed
flowware algorithm: variable
Configware EngineeringConfigware Engineering
Software EngineeringSoftware Engineering
1 programming source needed
algorithm: variable
resources: fixedsoftware
CPU
© 2006, reiner@hartenstein.de http://hartenstein.de54
TU Kaiserslautern
Co-Compilation
softwarecompiler
software code
Software / Configware Co-Compiler
Software / Configware Co-Compiler
configware code
mapperconfigware
compiler
scheduler
flowware code
data
C, FORTRAN, MATHLAB
automatic SW / CW partitionersimulated annealing
simulated annealing
simulated annealing
simulated annealing
© 2006, reiner@hartenstein.de http://hartenstein.de55
TU Kaiserslautern
Co-Compiler for Hardwired Kress/Kung Machine
[e. g. Brodersen]
softwarecompiler
software code
Software / Flowware
Co-Compiler
Software / Flowware
Co-Compiler
flowwarecompiler
scheduler
flowware code
data
source
automatic SW / CW partitioner
© 2006, reiner@hartenstein.de http://hartenstein.de56
TU KaiserslauternThe first archetype machine model
mainframe
CPU
compile orassemble
proceduralpersonalization
Software IndustrySoftware Industry Software Industry’sSecret of Success
simple basic .Machine Paradigm
personalization:RAM-based
instruction-stream- based mind set
“von Neumann”
© 2006, reiner@hartenstein.de http://hartenstein.de57
TU KaiserslauternThe 2nd archetype machine model
compilestructural
personalization
Configware IndustryConfigware Industry
Configware Industry’sSecret of Success
personalization:RAM-based
data-stream- based mind set
“Kress-Kung”
accelerator reconfigurable
simple basic .Machine Paradigm
© 2006, reiner@hartenstein.de http://hartenstein.de58
TU Kaiserslautern
„Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack“ [Herb Riley, R. Associates]
© 2006, reiner@hartenstein.de http://hartenstein.de59
TU Kaiserslauternmodern FPGA bestsellers:
The new model is reality:FPGA fabrics, together with several µprocessors, many memory banks, and other IP cores, on the same COTS microchip
© 2006, reiner@hartenstein.de http://hartenstein.de60
TU Kaiserslautern
500MHz FlexibleSoft Logic Architecture
200KLogic Cells
500MHz Programmable DSP Execution Units
0.6-11.1GbpsSerial Transceivers
500MHz PowerPC™ Processors(680DMIPS)
withAuxiliary Processor Unit
1Gbps DifferentialI/O
500MHz multi-portDistributed 10 Mb SRAM
500MHz DCM DigitalClock Management
DSP platform FPGA[courtesy Xilinx Corp.]
top related