how to cope with the power wall reiner hartenstein tu kaiserslautern patmos 2015, the 25 th...
TRANSCRIPT
How to cope with the Power
Wall
Reiner Hartenstein
TU Kaiserslautern
PATMOS 2015, the 25th International Workshop on Power and Timing Modeling, Optimization and Simulation;
Salvador, Bahia, Brazil, Sept 1-5, 2015
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
2
>> Outline <<
http://www.uni-kl.de
• The Power Wall• “Dataflow” Computing• Reconfigurable Computing• Time to Space Mapping• The Xputer Paradigm• Conclusions
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
Oldest conference series on power efficiencyPower efficiency is going to become an industry-wide issue
Some incremental improvements are on track, at all abstraction levels - not only in hardware design and software development
however, there is still a lot to be done
http://hartenstein.de/PATMOS/
3
Project .leader: Reiner Hartenstein
partner . . leader: Antonio Núñez
partner . . leader: Francis Jutand
spin-off from thePATMOS project
The Workshop Series
© 2015, [email protected] http://hartenstein.de
Power Efficiency of Programming
Languages(an example)
4
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
5 5© New York Times Data Center at Dallas
Columbia river
G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 – 11 Sep 2008
Power consumption by internet: x30 til 2030 if trends continue
Already the internet’s 2007 carbon footprint was higher than that of worldwide air traffic
It‘s more than the entire world‘s total power consumption to-day !!!
IDC predicts: the total number of data centers around
the world will peak at 8.6 million in 2017.
ICT infrastructures
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
6
>> Outline <<
http://www.uni-kl.de
• The Power Wall• “Dataflow” Computing• Reconfigurable Computing• Time to Space Mapping• The Xputer Paradigm• Conclusions
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
Terminology Problems
Searching a more power-efficient machine paradigm:
Stressing differences to „Control-Flow“ Computers an area called „Dataflow“ Computers was started mid‘ 70ies at MITXputer area people are forced to sidestep by using terms like „data-driven“ or „data streams“…
… although the „Dataflow“ scene is I-Structure-centered. 7
Tagged Token Flow
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
A Second OpinionD. D. Gajski, D. A. Padua, D. J. Kuck, R. H.
Kuhn: A Second Opinion on Data Flow Machines and Languages; IEEE
COMPUTER, February 1982the subtitle: "... data flow techniques attract a great deal of attention. Other alternatives, however, offer more hope for the future."
( Still active workshops …, e. g.:5th Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM 2015), Oct 18-21, 2015, San Francisco, CA, USA )
However, the power efficiency break-thru did not happen here
8
© 2015, [email protected] http://hartenstein.de9
Computing Paradigms
term program counter
execution triggered by paradigm
CPUyes instruction
fetchinstruction-
stream-based (von Neumann)program
counter
DPUCPUCPU
term
nodata
arrival**data-driven or
data-stream-baseddata
counters
rDPURPURPU
Xputer
Reconfigurable Computing
no
complicated I-
structure handling
“Dataflow” Computer
**) “transport-triggered”*) based on tagged token “I-Structure”
I-structure
DPUsDFCDFC
DFC*
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
10
>> Outline <<
http://www.uni-kl.de
• The Power Wall• “Dataflow” Computing• Reconfigurable Computing• Time to Space Mapping• The Xputer Paradigm• Conclusions
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
11
FFTFFT100
Reed-Solomon Decoding
Reed-Solomon Decoding 2400
Viterbi DecodingViterbi Decoding400
1000
MACMAC
DSP and wireless
Smith-Waterman pattern matchingSmith-Waterman pattern matching288
molecular dynamics simulation
molecular dynamics simulation
88
BLASTBLAST52
protein identification
protein identification
40
Bioinformatics
GRAPEGRAPE
2020
Astrop
hysics
Astrop
hysics
SPIHT wavelet-based image compression
SPIHT wavelet-based image compression 457
real-time face detection
real-time face detection
60006000
video-rate stereo vision
video-rate stereo vision
900pattern recognition
pattern recognition
730
Image processing,Pattern matching,Multimedia
3000CT imagingCT imaging
106
Speedup-Factor
1995 2000 20102005
~300
DNA & protein sequencing
8723 779
x2/y
r
1985 1990100
*) TU Kaiserslautern
103
100
+ Pre-FPGA solutionsXputer
15000
no FPGA: DPLA on MoM by TU-KL*
no FPGA: DPLA on MoM by TU-KL*
Design Rule Check accelerator PISA
(„fair comparizon“)
Design Rule Check accelerator PISA
(„fair comparizon“)
1984: 1 DPLA replaces
256 FPGAs
105
Speedup-Factor
Speed-ups by vN Software to FPGA Migrations
103
~10
% o
f
spee
d-up
Pow
er
savi
ng:
http://www.fpl.uni-kl.de/staff/hartenstein/Hartenstein-Speedup-Factors.pdf
http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html
fabricated by E.I.S. Multi
University Project Chip
19841984>15 years earlier
3439
CryptoCrypto1000
28514DES breakingDES breaking
Saving Power
© 2015, [email protected] http://hartenstein.de12
The Reconfigurable Computing Paradoxalthough the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of:•wiring overhead• reconfigurability overhead•routing congestion
“von Neumann Syndrome”
C.V. RamamoorthyVon Neumann Syndrome
von Neumann: an extremely power-
inefficient paradigm
Reinvent Computing
© 2015, [email protected] http://hartenstein.de13
Obstacles to widespread FPGA adoption go well beyond the required skill set
- Workshop at FPL_2015
http://reconfigurablecomputing4themasses.net/
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
14
>> Outline <<
http://www.uni-kl.de
• The Power Wall• “Dataflow” Computing• Reconfigurable Computing• Time to Space Mapping• The Xputer Paradigm• Conclusions
© 2006, [email protected] http://hartenstein.de
TU Kaiserslautern
Dual paradigm mind set: an old hat – but was ignored
15
time to space mapping: procedural to structural:
1967: W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc.C. G. Bell et al: The Description and Use of Register-Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972
FF
token bitevoke
FF FF
„This is
so simple ……
1971
loop to pipe mapping
Tunnel Vision!
why did it take 25 years to find out?
PDP-16 RTMs:
© 2015, [email protected] http://hartenstein.de16
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data stream
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
output data streams
„data streams“
time
port #
time
time
port #time
port #
define: ... which data item at which time at which port
The Systolic Arrays (1)no instruction streams needed
1980
DPA(pipe network)DataPath Array (array of DPUs)
M. J. Foster, H. T. Kung: The Design of Special-Purpose VLSI Chips ... IEEE 7th ISCA, La Baule, France, May 6-8, 1980 H. T. Kung
execution transport-triggered
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
17
just only . special
purpose ? just only . special
purpose ?
M. J. Foster and H. T. Kung: “The Design of Special-Purpose VLSI Chips ... “
(Tunnel Vision)
Karl Steinbuch
It is not sufficient to invent something. You need to recognize, that you have invented something.
Systolic Arrays (2)
http
://ka
rl-st
einb
uch.
org
17
© 2015, [email protected] http://hartenstein.de18
H. T. Kung: „of course algebraic!“
My student Rainer Kress replaced it by simulated annealing: this supports also any irregular & wild shape pipe networks
What Synthesis Method?
The super-systolic array: a generalization of the systolic array:
The super-systolic array: a generalization of the systolic array:
http://kressarray.de/
supports only very special applications with strictly regular data
dependencies
only linear pipes
*) KressArray [ASP-DAC-1995]
now a general purpose methodology !
now a general purpose methodology !
(linear projection)
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
H. T. Kung: “It’s not our job”
19*) or receives
another Tunnel Vision Symptom
without a sequencer: missed to invent a new machine paradigm
Who generates* the datastreams?
Who generates* the datastreams?
(the Xputer)
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
20
>> Outline <<
http://www.uni-kl.de
• The Power Wall• “Dataflow” Computing• Reconfigurable Computing• Time to Space Mapping• The Xputer Paradigm• Conclusions
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
The Xputer machine Paradigm
obtained by adding auto-sequencing memory (ASM)
GAG: Generic Address GenerstorGAG: Generic Address Generstor
With data counters instead of a program counter
is the TU-KL‘s Symbiosis of Time to Space Mapping and Reconfigurable Computing!
(TU-KL)
ASM
ASM
ASM
datacounter
GAG RAM
ASM
Xputer literature
21
© 2006, [email protected] http://hartenstein.de
TU Kaiserslautern
Duality of procedural Languages
Flowware Languagesread next data itemgoto (data address)jump to (data address)data loopdata loop nestingdata loop escapedata stream branching
yes: internally parallel loops
22
Software Languagesread next instructiongoto (instruction address)jump to (instruction address)instruction loopinstruction loop nestinginstruction loop escapeinstruction stream branching
no: no internally parallel loops
But there is an Asymmetry
program counter:data counter(s):
more simple: no ALU tasks
MoPLstate register(s):
Xp
ute
r lit
era
ture
Xp
ute
r p
ag
es
easy
to
lear
n
cont
rol-p
roce
dura
l
data
-pro
cedu
ral
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
23
Compilation: Software vs. Configware u. Flowware
source program
softwarecompiler
software code
Software Engineeri
ng
Software Engineeri
ng
configware code
mapper
configwarecompiler
scheduler
flowware code
source „program“
Configware
Engineering
Configware
Engineering
time to space
mapping
data
C, etc.
procedural: time domain)
space domain
time domain
Xputer literature
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
24
Heterogeneous: Co-Compilation
softwarecompiler
software code
Software / Configware Co-Compiler
Software / Configware Co-Compiler
configware code
mapperconfigware
compiler
scheduler
flowware code
data
C, or other high level language
automatic SW / CW partitioner
Xputer pages
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
25
>> Outline <<
http://www.uni-kl.de
• The Power Wall• “Dataflow” Computing• Reconfigurable Computing• Time to Space Mapping• The Xputer Paradigm• Conclusions
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
26
Illustrating the Paradigm Trap
The von Neumann Manycore Approach
the watering can model [Hartenstein]
( many crippled watering cans )
many von
Neumann
bottle-necks
many von
Neumann
bottle-necks
The von Neumann single core Approach
The Memory Wall
The Memory Wall ( crippled
watering can )
(1)
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
27
Illustrating the Paradigm Trap
The “Dataflow“ Computer
extremely complicate
d: no watering
can model !
(2)
(a power efficiency break-thru did not happen here)
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
28
Xputer: the only massively power-efficient Paradigm
The Xputer Paradigmhas no von Neumann bottle-neck
has no von Neumann bottle-neck
the watering can model [Hartenstein]
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
We need a Seismic Shift …
The Aristotelian model with the von Neumann CPU as the center of the world has become obsolete To cope with power wall and manycore we have to accept a Copernican world of Heterogeneous SystemsFor many more years we need a heterogeneous triple-paradigm mind set: Configware, Flowware, and still even Software
This is an extremely massive challenge, since more than a half century of software sits squarely on top
… to avoid unaffordable ICT power consumption cost in the
future
29
© 2015, [email protected] http://hartenstein.de30
thank you !
© 2015, [email protected] http://hartenstein.de31
END
© 2015, [email protected] http://hartenstein.de32
Backup for discussion:
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
LUT
LUT
LUT
LUT
LUT
LUT
TableLUT Table
First FPGA available 1984 from Xilinx
© 2015, [email protected] http://hartenstein.de
Transformations since the 70ies
time domain:procedure domain
space domain:structure domain
34
Pipelinek time steps, n DPUs
time algorithm space/time algorithmus
Strip Mining Transformation
Loop Transformations: rich methodology published: [survey: Diss. Karin Schmidt, 1994, Shaker Verlag]
n x k time steps, program loop
1 CPU
(time to time/space mapping)
© 2015, [email protected] http://hartenstein.de35
The Reconfigurable Computing Paradox
Enabling software developers to apply their skills over FPGAs has been a long and, as of yet, unreached research objective in reconfigurable computing.
Reinvent Computing
although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of:•wiring overhead• reconfigurability overhead•routing congestion•etc.
© 2015, [email protected] http://hartenstein.de36
datacounter
GAG RAM
ASM: Auto-Sequencing
Memory
ASM: Auto-Sequencing
Memory
use data counters, no program
counter
rDPArDPA
x x
x
x
x
x
x x
x
- -
-
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
ASM Data stream generator usage
(pipe network)
xxx
xxx
xxx
|
||
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ASM
ASM
ASM
ASM
ASM
ASM
AS
M
AS
M
AS
M
AS
M
AS
M
AS
M
implemented by distributed
on-chip- memory
implemented by distributed
on-chip- memory
Xputermachine paradigm
Xputer pages
GAG: Generic Address
Generators
GAG: Generic Address
Generators(reconfigurable: to avoid
memory-cycle-hungry address
computation)
(reconfigurable: to avoid memory-cycle-
hungry address computation)
general purpose
reconfigurable
general purpose
reconfigurable
also more irregular
data routes supported
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
37
Paradigm Shift Consequences
Software EngineeringSoftware Engineering1 programming source needed
algorithm: variable
resources: fixedsoftware
CPUProgram Counter (PC)
configware resources: variable 2 programming sources needed
flowware algorithm: variable
Configware EngineeringConfigware Engineering
Data Counters (DCs) sequencing code (e. g. see MoPL language)
Configuration Code (CC)
von Neumann:
Xputer:
Xputer literature
Heterogenous EngineeringHeterogenous EngineeringXputer and vN:all 3 programming sources needed
© 2015, [email protected] http://hartenstein.de38
no memory wall
DPA
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data streams
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
| output data
streams
DPU operation is transport-triggered
no instruction streamsno message passing
nor thru common memory
massively avoiding memory cycles
Pipelining through DPU Arrays: the TU-KL Xputer principle
DPA = DPU arrayDPU = Data Path Unit
© 2015, [email protected] http://hartenstein.de39
I-Structure Storage
Communication Network
PEI-Structure Storage
PE
MIT Tagged Token Dataflow Architecture*
Wait-Match Unit &Waiting Token Store
Instruction Fetch Unit
Token Queue
Program Store &Constant Store
Form Token Unit
ALU & Form Tag
to/from the Communication Network
RU
SU
PE:
*) source: Jurij Silc
„instruction is executed even if some of its operands are not yet available“
no PC
no updateable global store
?
Tagged Token FlowI would call it
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
Solution: I-Structure concept*
40
can be read but not written
at least one read request has been deferred
read request deferrend but write
operation is allowed
[source: Jurij Silc]
•very complex data structures !
•“each update consumes the structure and the value produces a new data structure”•“awkward or even impossible to implement”
Problems with “Dataflow” Architectures
*) „I-Structure Flow“ instead of „Dataflow“
?
?
= Tagged Token Structure
© 2015, [email protected] http://hartenstein.de
I-Structures (I = incremental) - part 1http://csd.ijs.si/courses/dataflow/Jurij Silc: Dataflow Architectures
28
27
28
26
41
© 2015, [email protected] http://hartenstein.de
MIT Tagged-Token
Dataflow Architecture
(PE)
31
29
http://csd.ijs.si/courses/dataflow/Jurij Silc: Dataflow Architectures
I-structure select
I-structure
assign
30 20
I-Structures - part 2
42
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
43
[Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008]
much less equipment
neededmuch less memory and bandwidth needed
massively saving energy
Reconfigurable Computing (RC): the intensive Impact
SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster
Tarek El-Ghazawi
Speed-ups by von Neumann to RC Migrations
ApplicationSpeed-up
factor
Savings divisorPower Cost Size
.DNA and Proteinsequencing 8723 779 22 253DES breaking 28514 3439 96 1116
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
Taxonomy
44
Flynn‘s taxonomy:von Neumann only
Diana‘s taxonomy:Reconfigurable Computing
Reiner‘s taxonomy:heterogeneous systems
Reiner‘s 2nd taxonomy:Xputers only
noI
© 2015, [email protected] http://hartenstein.de45
source: iStockphoto
Programming Language Popularity
not yet determined by power efficiency …
(IEEE)
© 2015, [email protected] http://hartenstein.de
TU Kaiserslautern
Von Neumann Syndrome
46
Lambert M. Surhone, Mariam T. Tennoe, Susan F. Hennessow (ed.): Von Neumann Syndrome; ßetascript publishing 2011
Shifting to the
dominance of von-
Neumann-only caused an
cumulated damage of
at least trillions of
Dollars, if not
quadrillions ….