entrails - ipdps - ieee international parallel & distributed … · 2014-06-10 · mlc flash...

IPDPS Keynote May 21, 2014

Reading the Tea-Leaves:How Architecture Has

Evolved at the High EndPeter M. Kogge

McCourtney Prof. of CS & Engr.University of Notre Dame

Acknowledgement: This work was funded in part by the US Dept. of

Energy, Sandia National Labs, as part of their XGC project,

and in part by the DOE PSAAP C-SWARM Project.

Entrails


Rationale Behind This Talk

• Track technology trends

• Understand “best of breed”

• Project ahead

• Identify “tall poles”

• Do so on yearly basis– First as part of EXASCALE technology report

– Then as yearly updates under Sandia XGC project

– In future under DOE PSAAP C-SWARM Center


Topics

• 2004: Technology “Perfect Storm”

• Our Current Architecture Spectrum

• The Exascale Memory Energy Horror Show

• TOP500 Lessons

• GRAPH500 Lessons

• HPCC Lessons

• 3D Stack Projections


The Technology Environment

• We all believe death of Moore’s Law is in sight

• And “TOP500” has continued unabated for 20 years

• But the world already changed in 2004

• Problem: Power Wall Limits Clock

• Problem: Loss of Memory Density

• Problem: Loss of off Chip Bandwidth

• Result: Have entered era of very massive parallelism


1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

01

/01

/92

01

/01

/96

01

/01

/00

01

/01

/04

01

/01

/08

01

/01

/12

01

/01

/16

Rm

ax (

Gfl

op

s/se

c)

Heavyweight Lightweight Hybrid

We All Know The Story:Unbroken Growth in TOP500 Rmax

2004


But When We Look DeeperSomething Big Happened in 2004!

Cores/socket increased

Clock rates went flat

Memory/core went flat

Total Cores increased

even faster

Flops/cycle increased

even faster

0.50

0.75

1.00

1.25

1.50

1.75

2.00

2.25

2.50

01

/01

/96

01

/01

/00

01

/01

/04

01

/01

/08

01

/01

/12

Co

mp

ou

nd

An

nu

al G

row

th R

ate:

CA

GR

Rmax (Gflop/s) Total Cores

Ave Cycles/sec per core (Mhz) Mem/Core (GB)

Ave. Cores/Socket TC: Total Concurrency (Rmax)

2004

Rmax Growth Continues


The 2004 “Perfect Storm”

Single core microprocessors: • more capable & faster• power increase offset by lower

voltagesMemory: more memory/chip• Due to density increase & bigger

chips• Memory latency improves slowlyInterconnect: tracks clock• Wire driven

2004

Moore’s Law: transistor size and intrinsic delay continue to decrease

• Operating Voltage stops decreasing• Chip power exceeds inexpensive cooling• No more performance gains from most complex cores• Off-chip I/O maxes out• Economics of DRAM inhibits bigger chips• Wire interconnect peaks

2020Today

The rise of multi-core:• More, simpler, cores per die• Slower clocks• Relatively constant off-die bandwidthMemory: • Slow density increase• Slow grow in off-memory bandwidthInterconnect: • Complex, power consuming wire• Very complex fiber optics

1970s


Power: THE 1st Design Constraint

• Moore’s Law: feature reduction by “S” per year

• Power in CMOS circuit: αCFV2

– αC: average capacitance switched per clock cycle; Scales ≈1/S

– F: clock rate; Intrinsic peak scales as S

– V: operating voltage: This is the problem!

• αCF essentially flat for different Moore generations

• But we can pack S2 more circuits on same chip size

• Thus power: S2 V2

• Before 2004, V declined, reducing increase in power

• After 2004, V has flattened; power can run wild!

– And better cooling is too expensive

• Only option: reduce clock


0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.01/1

/80

1/1

/84

1/1

/88

1/1

/92

1/1

/96

1/1

/00

1/1

/04

1/1

/08

1/1

/12

1/1

/16

1/1

/20

1/1

/24

Op

era

tin

g V

olt

ag

e (

Vd

d)

Actual MPU High Performance Logic Low Power Logic DRAM

Flattening Operating Voltage

2004

Today


We’ve Hit a Power Ceiling

data from www.cpudb.stanford.edu

1

10

100

1,000

1/1

/80

1/1

/84

1/1

/88

1/1

/92

1/1

/96

1/1

/00

1/1

/04

1/1

/08

1/1

/12

To

tal D

ie P

ow

er

(W

att

s)


The Clock Ceiling

data from www.cpudb.stanford.edu

10

100

1,000

10,000

1/1/80 1/1/84 1/1/88 1/1/92 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12

Clock (MHz)


Memory Density Increasing, But

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1985 1990 1995 2000 2005 2010 2015 2020 2025

Mem

ory

Den

sity

(G

b/c

m2)

DRAM SLC FlashMLC Flash (2b Cells) MLC Flash (3b Cells)MLC Flash (4b Cells) MLC Flash (3D)


DRAM Die Sizes Are Flattening or Decreasing

10

100

1000

1985 1990 1995 2000 2005 2010 2015 2020 2025

Ma

x P

rod

uc

tio

n D

ie S

ize

(s

q.

mm

)

Microprocessor DRAM ASIC Flash


So Memory Density Growth/Die is Slowing

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1985 1990 1995 2000 2005 2010 2015 2020 2025

Mem

ory

Cap

acit

y (G

b/c

hip

)

DRAM SLC FlashMLC Flash (2b Cells) MLC Flash (3b Cells)MLC Flash (4b Cells) MLC Flash (3D)


Chip I/O Pitch is Flattening

1

10

100

1000

1985 1990 1995 2000 2005 2010 2015 2020 2025

Co

nta

ct

Pit

ch

(m

icro

ns

)

Area Array Flip Chip 2-row Staggered Wedge Bond

Ball Bond 3 row staggered TSV (thermocompression)

TSV (solder microbump) Eqvt. Cost/Perf MPU Eqvt. High Perf MPU

Eqvt. ASIC


0.1

1

10

100

2000 2005 2010 2015 2020 2025

Sig

na

llin

g R

ate

(G

Tp

s)

DDR DDR2 DDR3

DDR4 CAGR of 17% CAGR of 12%

GDDR PCI-Express Hypertransport

QPI Differential Differential per wire

And Commodity Off-Chip Signaling Rates At Best Slow Rise


Thus “Per Unit Logic” Off-Chip B/W Decaying

0.1

1.0

10.0

2000 2005 2010 2015 2020

Ma

x O

ff-C

hip

B/W

/ M

illi

on

T

ran

sis

tors

(G

bp

s)

All DDRx All GDDR All I/O Protocol


With Even Less B/W When We Normalize to “per Cycle”

0

0

1

10

2000 2005 2010 2015 2020

BW

/ M

T / P

ow

er-

lim

ited

Cyc

le

All DDRx All GDDR All I/O Protocol


Topics




• TOP500 Lessons


• HPCC Lessons



Today’s Heavyweight

DIM

MD

IMM

DIM

MD

IMM

ProcessorSocket

DIM

MD

IMM

DIM

MD

IMM

ProcessorSocket

Router I/O Socket

DIM

MD

IMM

DIM

MD

IMM Processor

Socket

DIM

MD

IMM

DIM

MD

IMM Processor

Socket

A Power 7 Drawer


LightWeight ArchitecturesStarting with BlueGene L

2 Nodes per “Compute Card.” Each node:

• A low power compute chip

• Some memory chips

• “Nothing Else”

System Architecture:

• Multiple Identical Boards/Rack

• Each board holds multiple Compute Cards

• “Nothing Else”

• 2 simple dual issue cores

• Each with dual FPUs

• Memory controller

• Large eDRAM L3

• 3D message interface

• Collective interface

• All at subGHz clock“Packaging the Blue Gene/L supercomputer,” IBM J. R&D, March/May 2005

“Blue Gene/L compute chip: Synthesis, timing, and physical design,” IBM J. R&D, March/May 2005


BlueGene/Q

http://www.heise.de/newsticker/meldung/SC-2010-IBM-zeigt-BlueGene-Q-mit-17-Kernen-1138226.html

Integrated

• NIC

• Memory controllers

DR

AM

DR

AM

DR

AM

DR

AM Processor

Socket

DR

AM

DR

AM

DR

AM

DR

AM


Other Lightweight Systems Emerging

Calxeda quad-socket, quad-core ARMs HP Moonshot


Heterogeneous Architectures

http://www.nvidia.com/object/fermi_architecture.html

Host

Processor

Host

Memory

Streaming

Multi Processor

(SM)

Streaming

Multi Processor

(SM)

Streaming

Multi Processor

(SM)

Streaming

Multi Processor

(SM)

Streaming

Multi Processor

(SM)

Streaming

Multi Processor

(SM)

Streaming

Multi Processor

(SM)

L2 & Memory Channel Interfaces

Kernel Scheduler

MemoryChips

MemoryChips

….

GPU: Graphics

Processing Unit

Conventional Computer

Graphics Memory

Str

eam

Process

or

Str

eam

Process

or

Str

eam

Process

or

Multithreaded

Instruction

Unit

Local

Memory

…

http://www.nvidia.com/docs/IO/43395/BD-05238-001_v03.pdf

A Titan Blade


Big Little Architectures

• Heterogeneous multi-core with same ISA

• “Bigger” cores:

– Higher performance

– But less energy efficient

• “Littler” cores:

– Less performance

– But more energy efficient

• Ability to move program states from core to core

• Examples:

– ARM Cortex-A15 and A7, A53 and A57

– Almost Intel Xeon and Xeon PHI

BigCore

Litt

le C

ore

Shared Cache

(b) 2-to-1

BigCore

Litt

le C

ore

Shared Cache

Litt

le C

ore

(a) 1-to-1


Topics




• TOP500 Lessons


• HPCC Lessons



The Exascale Study Analysis:67MW for 1EF/s = 67pj/flop

(a) Quilt Packaging (b) Thru via chip stack

Reg File

11%

Cache Access

20%

Off-chip

13%

Leakage

28%

On-chip

6%DRAM Access

1%

FPU

21%

But This IGNORED

Most of Memory Access Path!


The Access Path: Interconnect-Driven

ROUTER

MemoryMICROPROCESSOR

More

Routers

Some sort of memory structure


Sample Path – Off Module Access1. Check local L1 (miss)

2. Go thru TLB to remote L3 (miss)

3. Across chip to correct port (thru routing table RAM)

4. Off-chip to router chip

5. 3 times thru router and out

6. Across microprocessor chip to correct DRAM I/F

7. Off-chip to get to correct DRAM chip

8. Cross DRAM chip to correct array block

9. Access DRAM Array

10. Return data to correct I/R

11. Off-chip to return data to microprocessor

12. Across chip to Routre Table

13. Across microprocessor to correct I/O port

14. Off-chip to correct router chip

15. 3 times thru router and out

16. Across microprocessor to correct core

17. Save in L2, L1 as required

18. Into Register File


Taper Data from Exascale Report


Relook at Exascale Strawman

Access L1 Tag

Access TLB Move to L2/L3 Access Tag

Read L1 (1W) Write RF

Read 4W Move 4W to L1

Write 4W to L1

Access DirectoryMove to Port Chip-Chip (TSV) Move to DRAM Block

Read 4W DRAM Move to Port Chip-Chip (TSV)

Move to Port On Module Chip-Chip Thru Router Module-Module

Rack-Rack

Thru Router

Thru Router Rack-Rack Thru RouterModule-Module

On Module Chip-Chip Move to Port Chip-Chip (TSV) Move to DRAM Block

Hit

HitMiss

Miss

Local

Non-local

Access Directory Move to Port Read 1W DRAMMove to PortChip-Chip (TSV)

Move to Port On Module Chip-Chip Thru Router Module-Module

Rack-Rack

Thru Router

Thru Router Rack-Rack Thru RouterModule-Module

On Module Chip-Chip Move to RF Write RF

AND THIS DOESN’T ACCOUNT FOR TLB MISSES!!!

In 2015, core energy per flop

for Linpack is < 10pJ

Step Target pJ #Occurrances Total pJ % of Total

Read Alphas Remote 13,819 4 55,276 16.5%

Read pivot row Remote 13,819 4 55,276 16.5%

Read 1st Y[i] Local 1,380 88 121,400 36.3%

Read Other Y[i]s L1 39 264 10,425 3.1%

Write Y's L1 39 352 13,900 4.2%

Flush Y's Local 891 88 78,380 23.4%

Total 334,656

Ave per Flop 475

If this is true, 1 EF/s = 0.5 GW!

Operation Energy (pJ/bit)

Register File Access 0.16

SRAM Access 0.23

DRAM Access 1

On-chip movement 0.0187

Thru Silicon Vias (TSV) 0.011

Chip-to-Board 2

Chip-to-optical 10

Router on-chip 2

50X


Access vs Reach

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E-05 1.E-03 1.E-01 1.E+01 1.E+03 1.E+05 1.E+07

Reachable GB

pJ

to

Ac

ce

ss a

Wo

rd

Curve Fit = 626*GB0.2

Tianhe-2:

9000 pJ

Tianhe-2:

7500 pJ


And This Means You Can Reference Memory Occasionally….

• 100% L1 Hit, once every 6.5 Flops

• 80% L1 Hit and 100% L2/L3 Hit, once every 18 Flops

• 80% L1 Hit, 90% L2/L3 Hit, and all the rest is local, once every 35 Flops

• 80% L1 Hit, 90% L2/L3 Hit, 60% local, and 40% Global, once every 118 Flops


What Does This Tell Us?

• Cannot afford ANY memory references

• Many more energy sinks than you think

• Cost of Interconnect Dominates

• Must design for on-board or stacked DRAM

• Need to redesign the entire access path:– Alternative memory technologies – reduce access cost

– Alternative packaging costs – reduce bit movement cost

– Alternative transport protocols – reduce # bits moved

– Alternative execution models – reduce # of movements

AND IT GETS MUCH WORSE

FOR CACHE UNFRIENDLY PROBLEMS


The Key Take Away

You can hide the latency…

But

You Can’t Hide the Energy


Topics




• TOP500 Lessons


• HPCC Lessons



1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

01

/01

/92

01

/01

/96

01

/01

/00

01

/01

/04

01

/01

/08

01

/01

/12

01

/01

/16

Rm

ax (

Gfl

op

s/se

c)

Heavyweight Lightweight Hybrid Trend: CAGR=1.88

Emergence of Lightweight, Hybrid


1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+071

/1/9

2

1/1

/96

1/1

/00

1/1

/04

1/1

/08

1/1

/12

1/1

/16

Total Sockets (H) Total Sockets (L) Total Sockets (M)

Total Cores (H) Total Cores (L) Total Cores (M)

Sockets and Cores Growing


1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+080

1/0

1/9

2

01

/01

/96

01

/01

/00

01

/01

/04

01

/01

/08

01

/01

/12

01

/01

/16

TC: F

lop

s p

er

Cyc

le (

Rm

ax)


Flops/Cycle: This Drives Programming Complexity


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

01

/01

/92

01

/01

/96

01

/01

/00

01

/01

/04

01

/01

/08

01

/01

/12

01

/01

/16

Effi

cien

cy (

Rm

ax/R

pe

ak)

Heavyweight Lightweight Hybrid

Floating Point Efficiency

Vs. 1%-3% for Tianhe-2

HPCG Benchmark


1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

01

/01

/92

01

/01

/96

01

/01

/00

01

/01

/04

01

/01

/08

01

/01

/12

01

/01

/16

Me

mo

ry (

GB

)


Memory Growth Has Slowed


1.E-02

1.E-01

1.E+00

1.E+01

01

/01

/92

01

/01

/96

01

/01

/00

01

/01

/04

01

/01

/08

01

/01

/12

01

/01

/16M

emo

ry/G

F(R

max

) (G

B)


And Memory per Flop/s Is Dropping!


1.E+01

1.E+02

1.E+03

01

/01

/92

01

/01

/96

01

/01

/00

01

/01

/04

01

/01

/08

01

/01

/12

01

/01

/16

Wat

ts/S

ock

et


Ceiling on System Power “per Socket”


Energy per Flop is Dropping

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1/1

/19

92

1/1

/19

96

1/1

/20

00

1/1

/20

04

1/1

/20

08

1/1

/20

12

1/1

/20

16

1/1

/20

20

1/1

/20

24

Ene

rgy

pe

r Fl

op

(p

J)

Heavyweight Heavyweight - Scaled Heavyweight - Constant

Lightweight Lightweight - Scaled Lightweight - Constant

Heterogeneous Hetergeneous - Scaled Historical

CMOS Projection - Hi Perf CMOS Projection - Low Power UHPC Goal

Trends vs 20pJ/Flop goal

• Heavyweight: never

• Lightweight: close

• Hybrid: Possible

Technology alone

One-time Lightweight Gain

One-time Hybrid Gain

But this is Rpeak


But Energy Efficiency Limits Performance

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E

+0

2

1.E

+0

3

1.E

+0

4

1.E

+0

5

1.E

+0

6

1.E

+0

7

1.E

+0

8

pJ/

Flo

p (

Rm

ax)

Gflops (Rmax)Top500 Heavyweight Green500 HeavyweightTop500 Lightweight Green500 LightweightTop500 Hybrid Green500 Hybrid


Conclusions

• TOP500 continues to grow at 2X/year

– With 80% efficiency

• But less and less relevance to real problems

• And need to program for 10s of millions of flops/cycle

• With falling relative memory footprint

• New HPCG benchmark due out June


Topics




• TOP500 Lessons


• HPCC Lessons



20

9

Graph 500

• Start with a root, find reachable nodes

• Simplifications: only 1 kind of edge, no weights

• Performance metric: TEPS: Traversed Edges/sec

1

3

5

7

8e0 e1

e2

e3e4

e5

e6

e7e8

Level Scale Size GB Edge(B)

10 26 Toy 17 272

11 29 Mini 140 280

12 32 Small 1024 256

13 36 Medium 17408 272

14 39 Large 143360 280

15 42 Huge 1153434 281.6

Average 273.6Starting at 1: 1, 0, 3, 2, 9, 5


1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1/1

/10

1/1

/11

1/2

/12

1/1

/13

1/1

/14

TEP

S (B

illio

ns/

sec)

Heavyweight Lightweight Hybrid Other Trend

TEPS vs Time


1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05

TEP

S (B

illio

ns/

sec)

NodesHeavyweight Lightweight Hybrid Other Trend

TEPS vs # Nodes


TEPS per Node vs Time

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1/1

/10

1/1

/11

1/2

/12

1/1

/13

1/1

/14

GTE

PS/

No

de

Heavyweight Lightweight Hybrid Other


Combined Metric: TEPS*# Vertices

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

1.E+11

1.E+12

1.E+13

1.E+14

1.E+15

1.E+16

1.E+17

1/1

/10

1/1

/11

1/2

/12

1/1

/13

1/1

/14

GTE

PS*

# V

erti

cs



TEPS per Node vs # Nodes

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05

GTE

PS

pe

r N

od

e

NodesHeavyweight Lightweight Hybrid Other


1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1/1/10 1/1/11 1/2/12 1/1/13 1/1/14

GTE

PS/

No

de

NodesBGQ BGP

Where Does Improvement Come From? BlueGene Analysis

Due to Hardware

Software/Algorithm

Improvements


TEPS/Node vs Node Count

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+02 1.E+03 1.E+04 1.E+05

GTE

PS/

No

de

NodesBGQ BGP


1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E

+0

0

1.E

+0

1

1.E

+0

2

1.E

+0

3

1.E

+0

4

1.E

+0

5

ns

pe

r T

EP p

er

core

# Nodes


ns per TEP vs # Nodes

Single Node Systems 100X Better


1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

10

15

20

25

30

35

40

ns

pe

r T

EP p

er

core

Problem Scale


ns per TEP vs Memory

Large Memory Systems Are Not Shared


Conclusions

• GRAPH500 has two metrics: TEPS & problem size

– With problem size growing towards petabytes

• Algorithm: heavy on

– remote atomic accesses

– All to all comm patterns

• Near perfect weak scaling, but lose 100X in going from shared memory to distributed

• Comparative BlueGene data: ½ hardware, ½ algorithm (driven by new architectural features)

• New benchmarks coming: search, shortest path, max independent set


Topics




• TOP500 Lessons


• HPCC Lessons



The HPCC Challenge Benchmarks

• Outgrowth of the DARPA HPCS program

• Focus on large-scale parallel MPI implementations

• 362 system measurements reported since 2003

• Benchmarks:

– HPL*: the LINPACK benchmark used in TOP500

– DGEMM: DP FP matrix-matrix multiply

– STREAM*: Sustainable memory B/W for 4 long vector ops• TRIAD: a[i] = b[i] + c[i]*SCALAR

– PTRANS: Comm capabilities for parallel matrix transpose

– RandomAccess*: Integer random updates (GUPS)

– FFT*: DP FP

• *ed versions reported at HPC Challenge Awards at SC


Performance/Process vs Time

1.E-08

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1/1/02 1/1/04 1/1/06 1/1/08 1/1/10 1/1/12 1/1/14 1/1/16

HP

L TF

/s p

er

Pro

cess


HPL1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1/1/02 1/1/04 1/1/06 1/1/08 1/1/10 1/1/12 1/1/14 1/1/16

EP S

TREA

M T

riad

GB

/s p

er

Pro

cess


STREAM

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1/1/02 1/1/04 1/1/06 1/1/08 1/1/10 1/1/12 1/1/14 1/1/16

Glo

bal

FFT

GF/

Pro

cess


FFT

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1/1/02 1/1/04 1/1/06 1/1/08 1/1/10 1/1/12 1/1/14 1/1/16

Glo

bal

Ran

do

mA

cce

ss G

UP/

s p

er

Pro

cess


GUPS


Performance/Process vs # Processes

1.E-08

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

HP

L TF

/s p

er

Pro

cess

ProcessesHeavyweight Lightweight Hybrid Other

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

Glo

bal

FFT

GF/

Pro

cess


1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

EP S

TREA

M T

riad

GB

/s p

er

Pro

cess


1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

Glo

bal

Ran

do

mA

cce

ss G

UP/

s p

er

Pro

cess


HPL STREAM

FFT GUPS


How Does HPL Relate to Streams?

1.E-08

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04

HP

L TF

/s p

er

Pro

cess

EP STREAM Triad GB/s per ProcessHeavyweight Lightweight Hybrid Other

Seems like linear correlation

between Memory B/W & Flops


How Does HPL Relate to Interconnect?

1.E-08

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 1.E+01

HP

L TF

/s p

er

Pro

cess

Global RandomAccess GUP/s per ProcessHeavyweight Lightweight Hybrid Other

Seems like weak correlation

between Network B/W & Flops


1.E-08

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 1.E+01 1.E+02

HP

L TF

/s p

er

Pro

cess

PTRANS GB/s per ProcessHeavyweight Lightweight Hybrid Other

Another View Using PTRANS

Seems like some correlation

between Network B/W & Flops


Topics




• TOP500 Lessons


• HPCC Lessons



Emerging Technology: 3D Stacks

• Stackable memory chips

– Same memory cells as in today’s DRAM

– Internally re-architected for more simultaneous access

• “Through Silicon Vias” (TSVs) to go vertically up and down

• Logic chip on bottom

– Multiple memory controllers

– More sophisticated off-stack interfaces than DDR

• Prototype demonstrated in 2011

• 1st Product in 2014

– Spec:http://www.hybridmemorycube.org

– Capacity: up to 8GB: 8X single chip

– Bandwidth: up to 480GB/s: 40X

– Lots of room on logic chip

http://www.micron.com/products/hybrid-memory-cube

2-bank Partition

16 64Mb “Partitions”


The HMC Architecture

VaultController

VaultController

VaultController

VaultController

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Vault:

TSV-connectedstack of

Memory Die

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Multi-bankPartition

Logic Base Die

Memory Die

Off-Stack I/O(Shared by All Vault Controllers via Distributed Routing)

160 GB/s =>320 => …

Partition

Partition

…

VaultController

Vault N

Partition

Partition

…

VaultController

Vault 1

Intra-Logic chip Routing

… Memory Die 1

… Memory Die M

Built-InTest Engine

Maintenance Interface

Link1

LinkL…

High Speed Full Duplex Multi-lane Interfaces

Blue: TSVs

LogicBaseDie

…

…


0

1

10

100

1000

2014 2016 2018 2020 2022 2024 2026

Es

tim

ate

d D

RA

M S

tac

k M

em

ory

C

ap

ac

ity (

GB

)

4-High Stack 8-High Stack 16-High Stack Production DRAM DIE

Projecting Forward

Remember: Stack takes ~ area of a single DRAM die.

Leading edge DDR DIMMs have up to 128 die.

Possible TB DIMM-like module


Bandwidth Estimates

0

200

400

600

800

1000

1200

2011 2013 2015 2017 2019 2021 2023 2025

GB

/s

Peak Memory-Logic Peak I/O (Full Duplex) Peak DDRx B/W (GB/s)

DDRx


Stack Power Projections

0

1

10

100

2014 2016 2018 2020 2022 2024 2026

Es

t. M

ax

Sta

ck

Po

we

r (W

att

s)

16-High Stack 8-High Stack

4-High Stack Base Die

Base: I/O Base: Router

Base: Memory Controller DRAM Die


SNL Xcaliber Architecture

(b) X-caliber Node Mockup

(a) X-caliber Node Architecture

(c) X-caliber stack notional architecture


10

100

2014 2016 2018 2020 2022 2024 2026

pJ

/flo

p (

Rp

ea

k)

16-High Stack 8-High Stack 4-High Stack

Thought Experiment: Extrapolate from HMC to “CMC”

• Add simple core + atomics/vault

• Ditch the Ps, NICs’ and use only stacks

• Package multiple stacks per “DIMM”

20pJ/flop

Assumes all resources running 100% If off-chip and memory bandwidth <100%

we are almost there!

1

10

100

2014 2016 2018 2020 2022 2024 2026

Es

t. M

ax

Sta

ck

Po

we

r w

ith

1 C

ore

p

er

Va

ult

(W

att

s)

16-High Stack 8-High Stack 4-High Stack Total cores


Possible 3D Architectures• Advantages

– Already has significant internal resiliency

– Huge bandwidth to logic chip

– Huge off-stack bandwidth with better protocol

– Very much in line with lightweight architectures

• Future Possibilities

– Direct mount multiple stacks to same substrate

– Lightweight cores + atomics natural per vault

– Heterogeneous memory die in stack to provide on die checkpointing, storage, etc.

– Multitude of novel resiliency options

– Augment protocol to support active messages

• Study of a stack-based Big Data solution: Huge increase in performance & performance/unit volume


Overall Summary

• 2004: Turning point in HPC architecture

• Heavyweight alone are going flat

• Hybrid more energy efficient for flops – but there are memory issues

• Lightweight more scalable for both flops and communication

• No escape for massive parallelism

• New algorithms will blend multiple metrics

• 3D stacks an interesting alternative

entrails - ipdps - ieee international parallel & distributed … · 2014-06-10 · mlc flash...

Documents