vanguard tti 12/6/11 1. vanguard tti 12/6/11 2 q: what do these have in common? tianhe 1a ~4pflops...
TRANSCRIPT
Vanguard TTI12/6/11 1
Vanguard TTI12/6/11 2
Q: What do these have in common?
Tianhe 1A~4PFLOPS peak2.5PFLOPS sustained~7,000 NVIDIA GPUsAbout 5MW
3G smart phoneBaseband processing 10GOPSApplications processing 1GOPS and increasingPower limit of 300mW
Vanguard TTI12/6/11 3
?
Vanguard TTI12/6/11 4
Both are Based on NVIDIA Chips
Fermi3 x 109 Transistors512 Cores
Tegra-2 (T20)3 ARM CoresGPUAudio, Video, etc…
Vanguard TTI12/6/11 5
More Fundamentally
Both
are power limited
get performance from parallelism
need 100x performance increase in 10 years
Vanguard TTI12/6/11 6
100x performance in 10 years, Moore’s Law will take care of that, right?
Vanguard TTI12/6/11 7
100x performance in 10 years, Moore’s Law will take care of that, right?
Wrong!
Vanguard TTI12/6/11 8
Moore’s Law gives us transistorsWhich we used to turn into scalar performance
Moore, Electronics 38(8) April 19, 1965
Vanguard TTI12/6/11 9
ISAT LCC: 9
But ILP was ‘mined out’ in 2000
1e-4
1e-3
1e-2
1e-1
1e+0
1e+1
1e+2
1e+3
1e+4
1e+5
1e+6
1e+7
1980 1990 2000 2010 2020
Perf (ps/Inst)
Linear (ps/Inst)
52%/year
74%/year
19%/year30:1
1,000:1
30,000:1
Dally et al. “The Last Classical Computer”, ISAT Study, 2001
Vanguard TTI12/6/11 10
And L3 energy scaling ended in 2005
Gordon Moore, ISSCC 2003Moore, ISSCC Keynote, 2003
Vanguard TTI12/6/11 11
Result: The End of Historic Scaling
C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
Vanguard TTI12/6/11 12
Historic scaling is at an end!
To continue performance scaling of all sizes of computer systems requires addressing two challenges:
Power and Parallelism
Much of the economy depends on this
Vanguard TTI12/6/11 13
The Power Challenge
Vanguard TTI12/6/11 14
In the past we had constant-field scalingL’ = L/2V’ = V/2
E’ = CV2 = E/8f’ = 2f
D’ = 1/L2 = 4DP’ = P
Halve L and get 8x the capability for the same power
Vanguard TTI12/6/11 15
Now voltage is held nearly constantL’ = L/2V’ = V
E’ = CV2 = E/2f’ = 2f*
D’ = 1/L2 = 4DP’ = 4P
Halve L and get 2x the capability for the same power in ¼ the area
*f is no longer scaling as 1/L, but it doesn’t matter, we couldn’t power it if it did
Vanguard TTI12/6/11 16
Performance = Efficiency
Efficiency = Locality
Vanguard TTI12/6/11 17
Locality
Vanguard TTI12/6/11 18
The High Cost of Data MovementFetching operands costs more than computing on them
20mm
64-bit DP20pJ 26 pJ 256 pJ
1 nJ
500 pJ Efficientoff-chip link
28nm
256-bitbuses
16 nJDRAMRd/Wr
256-bit access8 kB SRAM
50 pJ
Vanguard TTI12/6/11 19
Scaling makes locality even more important
Vanguard TTI12/6/11 20
Its not about the FLOPS
Its about data movement
Algorithms should be designed to perform more work per unit data movement.
Programming systems should further optimize this data movement.
Architectures should facilitate this by providing an exposed hierarchy and efficient communication.
Vanguard TTI12/6/11 21
System SketchSystem Sketch
Vanguard TTI12/6/11 22
Echelon Chip Floorplan
L2 Banks
XBAR
NOC
SM
Lan
e
Lan
eL
ane
Lan
eL
ane
Lan
eL
ane
Lan
e
SM
SM
DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOCS
M
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
CNOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOCS
M
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O
DR
AM
I/OD
RA
M I/O
DR
AM
I/OD
RA
M I/O
NW
I/O
DR
AM
I/OD
RA
M I/O
DR
AM
I/OD
RA
M I/O
NW
I/O 17mm
10nm process290mm2
Vanguard TTI12/6/11 23
Overhead
Vanguard TTI12/6/11 24
4/11/11 Milad Mohammadi 24
An Out-of-Order CoreSpends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)
Vanguard TTI12/6/11 25
SM Lane Architecture
ORF ORFORF
LS/BRFP/IntFP/Int
To LD/ST
L0AddrL1Addr
Net
LM Bank
0
To LD/ST
LM Bank
3
RFL0AddrL1Addr
Net
RF
Net
DataPath
L0I$
Thr
ead
PC
sA
ctiv
eP
Cs
Inst
ControlPath
Sch
edul
er
64 threads4 active threads2 DFMAs (4 FLOPS/clock)ORF bank: 16 entries (128 Bytes)L0 I$: 64 instructions (1KByte)LM Bank: 8KB (32KB total)
Vanguard TTI12/6/11 26
Solving the Power Challenge – 1, 2, 3
Vanguard TTI12/6/11 27
Solving the ExaScale Power Problem
Vanguard TTI12/6/11 28
More Fundamentally
Both
are power limited
get performance from parallelism
need 100x performance increase in 10 years
Vanguard TTI12/6/11 29
Parallelism
Vanguard TTI12/6/11 30
Parallel programming is not inherently any more difficult than serial programming
However, we can make it a lot more difficult
Vanguard TTI12/6/11 31
A simple parallel program
forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // nested forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } }}
Vanguard TTI12/6/11 32
Why is this easy?
forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // nested forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } }}
No machine detailsAll parallelism is expressedSynchronization is semantic (in reduction)
Vanguard TTI12/6/11 33
We could make it hard
pid = fork() ; // explicitly managing threads
lock(struct.lock) ; // complicated, error-prone synchronization// manipulate structunlock(struct.lock) ;
code = send(pid, tag, &msg) ; // partition across nodes
Vanguard TTI12/6/11 34
Programmers, tools, and architectureNeed to play their positions
Programmer
Architecture
Tools
Vanguard TTI12/6/11 35
Programmers, tools, and architectureNeed to play their positions
Programmer
Architecture
Tools
AlgorithmAll of the parallelismAbstract locality
Fast mechanismsExposed costs
Combinatorial optimizationMappingSelection of mechanisms
Vanguard TTI12/6/11 36
Programmers, tools, and architectureNeed to play their positions
Programmer
Architecture
Tools
forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } }}
Map foralls in time and spaceMap molecules across memoriesStage data up/down hierarchySelect mechanisms
Exposed storage hierarchyFast comm/sync/thread mechanisms
Vanguard TTI12/6/11 37
Fundamental and Incidental Obstacles to Programmability
FundamentalExpressing 109 way parallelism
Expressing locality to deal with >100:1 global:local energy
Balancing load across 109 cores
IncidentalDealing with multiple address spaces
Partitioning data across nodes
Aggregating data to amortize message overhead
Vanguard TTI12/6/11 38
The fundamental problems are hard enough. We must eliminate the incidental ones.
Vanguard TTI12/6/11 39
Execution ModelExecution Model
A B
Active Message
Abstract Memory
Hierarchy
Global Address Space
ThreadObject
B
Lo
ad
/Sto
re
A
B Bulk Xfer
Vanguard TTI12/6/11 40
Thread array creation, messages, block transfers, collective operations – at the “speed of light”
Vanguard TTI12/6/11 41
Fermi
Hardware thread-array creation
Fast syncthreads() ;
Shared memory
Vanguard TTI12/6/11 42
Scalar ISAs don’t matter
Parallel ISAs – the mechanisms for threads, communication, and synchronization make a huge difference.
Vanguard TTI12/6/11 43
Abstract description of Locality – not mapping
compute_forces::inner(molecules, forces) { tunable N ; set part_molecules[N] ; part_molecules = subdivide(molecules, N) ;
forall(i in 0:N-1) { compute_forces(part_molecules) ; }
Vanguard TTI12/6/11 44
Abstract description of Locality – not mapping
compute_forces::inner(molecules, forces) { tunable N ; set part_molecules[N] ; part_molecules = subdivide(molecules, N) ;
forall(i in 0:N-1) { compute_forces(part_molecules) ; }
Autotuner picks number and size of partitions - recursively
No need to worry about “ghost molecules” with global address space, it just works
Vanguard TTI12/6/11 45
Autotuning Search Spaces
T. Kisuki and P. M. W. Knijnenburg and Michael F. P. O'Boyle
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation.
In IEEE PACT, pages 237-248, 2000.
ExeExecution Time of Matrix Multiplication for Unrolling and Tiling
Architecture enables simple and effective autotuning
Vanguard TTI12/6/11 46
Performance of Auto-tuner
Conv2D SGEMM FFT3D SUmb
Cell Auto 96.4 129 57 10.5
Hand 85 119 54
Cluster Auto 26.7 91.3 5.5 1.65
Hand 24 90 5.5
Cluster of PS3s
Auto 19.5 32.4 0.55 0.49
Hand 19 30 0.23
Measured Raw Performance of Benchmarks: auto-tuner vs. hand-tuned version in GFLOPS.
For FFT3D, performances is with fusion of leaf tasks.
SUmb is too complicated to be hand-tuned.
Vanguard TTI12/6/11 47
More Fundamentally
Both
are power limited
get performance from parallelism
need 100x performance increase in 10 years
Vanguard TTI12/6/11 48
A Prescription
Vanguard TTI12/6/11 49
Research
Need a research vehicle (experimental system)Co-design architecture, programming system, applications
Productive parallel programmingExpress all the parallelism and locality
Compiler and run-time map to the target machine
Leverage an existing eco-system
Mechanisms – for: threads, comm, syncEliminate ‘incidental’ programming issues
Enable fine-grain execution
PowerLocality – exposed memory hierarchy and software to use it
Overhead – move scheduling to compiler
Others are investing, if we don’t invest we will be left behind.
Vanguard TTI12/6/11 50
Education
We need parallel programmersBut we are training serial programmers
and serial thinkers
Parallelism throughout the CS curriculumProgramming
AlgorithmsParallel algorithms
Analysis focused on communications, not counting ops
Systems
Models need to include locality
Vanguard TTI12/6/11 51
A Bright Future from Supercomputers to Cellphones
Eliminate overhead and exploit locality to get 100x power efficiency
Easy parallelism with a coordinated team
Programmer
Tools
Architecture
100x performance increase in 10 years
Vanguard TTI12/6/11 52
Vanguard TTI12/6/11 53
Granularity
Vanguard TTI12/6/11 54
#Threads increasing faster than problem size.
Vanguard TTI12/6/11 55
Number of Threads increasing faster than problem size
Vanguard TTI12/6/11 56
Number of Threads increasing faster than problem size
WeakScalingWeak
Scaling
Vanguard TTI12/6/11 57
Number of Threads increasing faster than problem size
WeakScalingWeak
ScalingStrongScalingStrongScaling
Vanguard TTI12/6/11 58
Smaller sub-problem per thread
Vanguard TTI12/6/11 59
Smaller sub-problem per thread
Vanguard TTI12/6/11 60
Smaller sub-problem per thread
More frequent comm, sync, and thread operations
Vanguard TTI12/6/11 61
Smaller sub-problem per thread
More frequent comm, sync, and thread operations
Vanguard TTI12/6/11 62
This fine-grain parallelism is multi-level and irregular
Vanguard TTI12/6/11 63
To support this requires fast mechanisms for
Thread arrays – create, terminate, suspend, resumeHardware allocation of resources to a thread array
threads, registers, shared memory
With locality
CommunicationData movement up and down the hierarchy
Fast active messages (message-driven computing)
SynchronizationCollective operations (e.g., barrier, reduce)
Pairwise (producer-consumer)
Vanguard TTI12/6/11 64
Execution ModelExecution Model
A B
Active Message
Abstract Memory
Hierarchy
Global Address Space
ThreadObject
B
Lo
ad
/Sto
re
A
B Bulk Xfer
J-Machine Speedup with Strong Scaling
Noakes et al. “The J-Machine Multicomputer: an Architectural Evaluation”, ISCA, 1993, pp.224-235
J-Machine Speedup with Strong Scaling
Noakes et al. “The J-Machine Multicomputer: an Architectural Evaluation”, ISCA, 1993, pp.224-235
2 characters per node