the future of computer architecture

1This research was, in part, funded by the U.S. Government. The views and

conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the

U.S. Government.

The Future of Computer Architecture

Shekhar Borkar Intel Corp.

June 9, 2012

2

OutlineCompute performance goalsChallenges & solutions for: • Compute,

• Memory,

• Interconnect

Importance of resiliencyParadigm shiftSummary

3

Performance Roadmap

1.E-04

1.E-02

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1960 1970 1980 1990 2000 2010 2020

GFL

OP

Mega

Giga

Tera

Peta

Exa

12 Years 11 Years 10 Years

Client

Hand-held

4

From Giga to Exa, via Tera & Peta

1

10

100

1000

1986 1996 2006 2016

Rela

tive

Tr P

erfo

rman

ce

G

Tera

Peta

30X 250X

0.001

0.01

0.1

1

1986 1996 2006 2016

Rel

ativ

e En

ergy

/Op G

TeraPeta

5V

Vcc scaling

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1986 1996 2006 2016

G

Tera

Peta

36X

Exa

4,000XConcurrency

2.5M X

Transistor Performance

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1986 1996 2006 2016

G

Tera

Peta

80X

Exa

4,000X

1M X

Power

5

Energy per Operation

45nm 32nm 22nm 14nm 10nm 7nm0

50

100

150

200

250

300

pJ/bit CompJ/bit DRAMDP RF OppJ/DP FP

Ener

gy (p

J)

FP Op

DRAM

Communication

Operands

75 pJ/bit

25 pJ/bit

10 pJ/bit

100 pJ/bit

6

Where is the Energy Consumed?

100pJ per FLOP100W

150W

100W

100W

2.5KW

Decode and controlAddress translations…Power supply lossesBloated with inefficient architectural features

~3KW

Compute

Memory

Com

Disk

10TB disk @ 1TB/disk @10W

0.1B/FLOP @ 1.5nJ per Byte

100pJ com per FLOP

5W2W

~5W~3W5W

Goal

~20W

7

Voltage Scaling

0

0.2

0.4

0.6

0.8

1

0.3 0.5 0.7 0.9

Vdd (Normal)

Nor

mal

ized

0

2

4

6

8

10

Freq

Total PowerLeakage

Energy Efficiency

When designed to voltage scale

8

Near Threshold-Voltage (NTV)

1

101

103

104

102

10-2

10-1

1

101

102

0.2 0.4 0.6 0.8 1.0 1.2 1.4Supply Voltage (V)

Max

imum

Fre

quen

cy (M

Hz)

Tota

l Pow

er (m

W)

320mV

65nm CMOS, 50°C

320mVSu

bthr

esho

ld R

egio

n

9.6X

65nm CMOS, 50°C

10-2

10-1

1

101

0

50

100

150

200

250

300

350

400

450

0.2 0.4 0.6 0.8 1.0 1.2 1.4Supply Voltage (V)

Ener

gy E

ffici

ency

(GO

PS/W

att)

Act

ive

Leak

age

Pow

er (m

W)

H. Kaul et al, 16.6: ISSCC08

4 O

rder

s

< 3

Ord

ers

NTV Across Technology Generations

9

0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4 0.6 0.8 1.0 1.2 1.4Supply Voltage (V)

Ener

gy E

ffici

ency

(TO

PS/W

)

10-3

10-2

10-1

1

10

Act

ive

Leak

age

Pow

er (m

W)

Reconfigurable Fabric, 32nm CMOS, 50 � C

340mV

0.8mW5.7x

Sub-

thre

shol

d R

egio

n

0.2 0.4 0.6 0.8 1.0 1.2En

ergy

Effi

cien

cy (G

OPS

/W)

Supply Voltage (V)

Leak

age

Pow

er (m

W)

103

102

10

10-2

1

10-3

103

102

10

1

10-1

Sub-

thre

shol

d Re

gion

22nm CMOS, 50°C

9x

9x

Register FilePermute Crossbar

0123456789

0.15 0.40 0.65 0.90 1.15 1.40Nor

mal

ized

Ene

rgy

Effic

ienc

y

Supply Voltage (V)

300mV

8X

45nm CMOS 50 � C

32b Multiply

16b SIMD Multiply

72b Add 1.1V

0.980.870.740.590.370.15VhiVlo

H. Kaul, et. al., ISSCC 2009

A. Agarwal, et. al., ISSCC 2010

S. K. Hsu, et. al., ISSCC 2012

NTV operation improves energy efficiency across 45nm-22nm CMOS

Impact of Variation on NTV

10

0%

10%

20%

30%

40%

50%

60%

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Freq

(Rel

ativ

e)

Vdd (Relative)

Frequency

Spread

+/- 5% Variation in Vdd or Vt

0

1

2

3

4

5

6

1.0 0.9 0.8 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Circ

uit v

ulne

rabi

lity t

o 5%

noi

se

Vdd scaling towards threshold Threshold

5% variation in Vt or Vdd results in 20 to 50% variation in circuit performance

dd

tdd

VVVfrequency

)(

Mitigating Impact of Variation

11

1.Variation control with body biasingBody effect is substantially reduced in advanced technologiesEnergy cost of body biasing could become substantialFully-depleted transistors have no body left

f

f

f f

f

f

f

f

f

f

f f

Example: Many-core System

f

f

f f

f/2

f/2

f/2

f/2

f/4

f/4

f/4 f/4

Running all cores at full frequency exceeds energy budget

Run cores at the native frequencyLaw of large numbers—averaging

2. Variation tolerance at the system level

Subthreshold Leakage at NTV

12

45nm 32nm 22nm 14nm 10nm 7nm 5nm0%

10%

20%

30%

40%

50%

60%SD

Lea

kage

Pow

er

100% Vdd

75% Vdd

50% Vdd

40% Vdd Increasing Variations

NTV operation reduces total power, improves energy efficiencySubthreshold leakage power is substantial portion of the total

13

Experimental NTV Processor

IA-32 CoreLogic

Scan

RO ML1$-I L1$-D

Level Shifters + clk spine

1.1 mm

1.8

mm

Custom Interposer951 Pin FCBGA Package

Legacy Socket-7 Motherboard

Technology 32nm High-K Metal GateInterconnect 1 Poly, 9 Metal (Cu)Transistors 6 Million (Core)Core Area 2mm2

S. Jain, et al, “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS”, ISSCC 2012

14

Power and Performance915MHz

500MHz

100MHz

3MHz

737mW

174mW

17mW2mW

0

100

200

300

400

500

600

700

800

1

10

100

1000

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.20.55 0.55 0.55 0.55 0.6 0.7 0.8 0.9 1 1.1 1.2

Total Power (m

W)Fr

eque

ncy

(MH

z)32nm CMOS, 25oC

Logic Vcc / Memory Vcc (V)

4%

33%

62%

1%

53%

27%

15%

5% 81%

11%3%

5%

Logic dynamicMemory leakage

Logic leakage

Subthreshold NTV Super-threshold

15

Observations

Sub-Vt NTV Full Vdd

0%

20%

40%

60%

80%

100%

Mem LkgMem DynLogic LkgLogic Dyn

Leakage power dominatesFine grain leakage power management is required

16

Memory & Storage Technologies

SRAM DRAM NAND, PCM Disk0.01

0.1

1

10

100

1000

10000

100000

Energy/bit(Pico Joule)

Capacity(G Bit)

Cost/bit(Pico $)

(Endurance issue)

17

Revise DRAM Architecture

Page Page Page

RASCAS

Activates many pagesLots of reads and writes (refresh)Small amount of read data is usedRequires small number of pins

Traditional DRAM New DRAM architecture

Page Page PageAddr

Addr

Activates few pagesRead and write (refresh) what is neededAll read data is usedRequires large number of IO’s (3D)

Energy cost today:~150 pJ/bit

Signaling

DRAM Array

M Control

18

Package

3D Integration of DRAM

Logic Buffer

Thin Logic and DRAM die

Through silicon vias

Power delivery through logic die

Energy efficient, high speed IO to logic buffer

Detailed interface signals to DRAMs created on the logic die

The most promising solution for energy efficient BW

19

Communication Energy

0.1 1 10 100 1000012345678

Interconnect Distance (cm)

Ener

gy/b

it (p

J)

On Die

Chip to chip

Board to Board

Between cabinets

On DieChip to chip

Board to Board

Between cabinets

20

On-Die Communication Power

IMEM + DMEM21%

10-port RF4%

Router + Links28%

Clock dist.11%

Dual FPMACs

36%

Global Clocking

1%Routers

& 2D-mesh10%

MC & DDR3-

80019%

Cores70%

21.7

2mm

12.64mmI/O Area

I/O AreaPLL

single tile1.5mm

2.0mm

TAP

21.7

2mm

12.64mmI/O Area

I/O AreaPLL

single tile1.5mm

2.0mm

TAP

8 X 10 Mesh 32 bit links320 GB/sec bisection BW @ 5 GHz

80 Core TFLOP Chip (2006)

VRC

21.4

mm

26.5mm

System Interface + I/O

DD

R3

MC

DD

R3

MC

DD

R3

MC

DD

R3

MC

PLL

TILE

TILE

JTAG

CC

CC

2 Core clusters in 6 X 4 Mesh (why not 6 x 8?)128 bit links256 GB/sec bisection BW @ 2 GHz

48 Core Single Chip Cloud (2009)

21

On-die Communication Energy

0.01

0.1

1

10

100

1000

0.5u 0.18u 65nm 22nm 7nm

Switc

h En

ergy

/bit

(pJ)

Measured

Extrapolated

Traditional Homogeneous Network

0.08 pJ/bit/switch0.04 pJ/mm (wire)Assume Byte/Flop

27 pJ/Flop on-die communication energy

1. Network power too high (27MW for EFLOP)2. Worse if link width scales up each generation3. Cache coherency mechanism is complex

22

Packet Switched Interconnect

STOP STOP5mm

3-5 Clk 3-5 Clk<1 Clk

0.5mW 0.5mW 0.5mW5 Clk 3-5 Clk

STOP STOP10mm

3-5 Clk2-3 Clk

0.5mW 1mW 0.5mW6-7 Clk

1. Router acts like STOP signs—adds latency2. Each hop consumes power (unnecessary)

23

Mesh—RetrospectiveBus: Good at board level, does not extend well• Transmission line issues: loss and signal integrity, limited frequency

• Width is limited by pins and board area

• Broadcast, simple to implement

Point to point busses: fast signaling over longer distance• Board level, between boards, and racks

• High frequency, narrow links

• 1D Ring, 2D Mesh and Torus to reduce latency

• Higher complexity and latency in each node

Hence, emergence of packet switched network

But, pt-to-pt packet switched network on a chip?

24

Interconnect Delay & Energy

100

1000

10000

100000

0 5 10 15 20 25

Length (mm)

Del

ay (p

s)

00.10.20.30.40.50.60.70.80.9

pJ/B

it

0.3u pitch, 0.5V

Wire Delay

Router Energy

Wire Energy

Repeated wire delayRouter Delay

25

Bus—The Other Extreme…Issues:Slow, < 300MHzShared, limited scalability?

Solutions:Repeaters to increase freqWide busses for bandwidthMultiple busses for scalability

Benefits:Power?Simpler cache coherency

Move away from frequency, embrace parallelism

26

Repeated Bus (Circuit Switched)

O R R

R R

R R

R

R

Arbitration:Each cycle for the next cycleDecision visible to all nodes

Repeaters:Align repeater directionNo driving contention

Core (mm)

Bus Seg Delay (ps)

Max Bus Freq (GHz)

65nm 5 195 2.245nm 3.5 99 232nm 2.5 51 1.822nm 1.8 26 1.516nm 1.3 13 1.2

Assume:10mm die,1.5u bus pitch50ps repeater delay

Anders et al, A 2.9Tb/s 8W 64-Core Circuit-switched Network-on-Chip in 45nm CMOS, ESSCIRC 2008Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008

27

A Circuit Switched Network

PClk

CClk

Packet-switched Request

Circuit-switchedAcknowledge

Circuit-switched Data Transmission

8x8 Circuit-switched NoC

CClk

Src Dest0 1 n

Routers

Routers2mm links

Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008

● Circuit-switched NoC eliminates intra-route data storage● Packet-switching used only for channel requests

Þ High bandwidth and energy efficiency (1.6 to 0.6 pJ/bit)

28

Hierarchical & Heterogeneous

Bus

C C

C C

Bus to connect over short distances

BusC C

C CBus

C C

C C

BusC C

C CBus

C C

C C2nd Level Bus

Hierarchy of BussesOr hierarchical circuit and packet switched networks

R R

R R

29

Tapered Interconnect Fabric

0

500

1000

1500

2000

2500

Cores Unit Chip System

BW

(a.u

.)

0

1

2

3

4

5

Byt

es/F

lop

(a.u

.)

Tapered, but over-provisioned bandwidthPay (energy) as you go (afford)

30

But wait, what about Optical?

Source: Hirotaka Tamure (Fujitsu), ISSCC 08 Workshop on HS Transceivers

65nm

Possible today

Hopeful

Optical:Pre-driverDriverVCSEL

TIALADeMUXCDRClk Buffer

31

Impact of Exploding Parallelism

100

150

200

250

300

350

400

450

65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm

Mill

ion

Cor

es/E

FLO

P

1x Vdd

0.7x Vdd

0.5x Vdd

Almost flat because Vdd close to Vt4X increase in the number of cores (Parallelism)

Increased communication and related energyIncreased HW, and unreliability

1. Strike a balance between Com & Computation2. Resiliency (Gradual, Intermittent, Permanent faults)

32

Road to Unreliability?From Peta to Exa Reliability Issues

1,000X parallelism More hardware for something to go wrong>1,000X intermittent faults due to soft errors

Aggressive Vcc scaling to reduce power/energy

Gradual faults due to increased variationsMore susceptible to Vcc droops (noise)More susceptible to dynamic temp variationsExacerbates intermittent faults—soft errors

Deeply scaled technologies

Aging related faultsLack of burn-in?Variability increases dramatically

Resiliency will be the corner-stone

Soft Errors and Reliability

33

0

0.2

0.4

0.6

0.8

1

0.5 1 1.5 2Voltage (V)

n-SE

R/c

ell (

sea-

leve

l)

65nm90nm130nm180nm250nm

Soft error/bit reduces each generationNominal impact of NTV on soft error rate

1

10

180 130 90 65 45 32Technology (nm)

Rel

ativ

e to

130

nm

Memory

Latch

Assuming 2X bit/latch count increase per

generation

Soft error at the system level will continue to increase

Positive impact of NTV on reliabilityLow V lower E fields, low power lower temperatureDevice aging effects will be less of a concernLower electromigration related defects

N. Seifert et al, "Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices", 44th Reliability Physics Symposium, 2006

34

ResiliencyFaults ExamplePermanent faults Stuck-at 0 & 1Gradual faults Variability

TemperatureIntermittent faults Soft errors

Voltage droopsAging faults Degradation

Faults cause errors (data & control)Datapath errors Detected by parity/ECCSilent data corruption Need HW hooksControl errors Control lost (Blue screen)

Minimal overhead for resiliency

Circuit & Design

Microarchitecture

Microcode, Platform

Programming system

Applications Error detectionFault isolationFault confinementReconfigurationRecovery & Adapt

System Software

35

Compute vs Data Movement

0.001 0.01 0.1 1 10 100 10001E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

1E+07

Distance (meters)

Ener

gy (p

J)

Read a bit from internal SRAM

Computation Read a bit from DRAM

1 bit Xfer using Bluetooth1 bit Xfer over Ethernet

1 bit Xfer over WiFi1 bit Xfer over 3G

1nJ

1pJ

1mJ

Instruction Execution

Data movement energy will dominate

1 bit over linkR/W Operands (RF)

System Level Optimization

36

00.20.40.60.8

11.2

45 32 22 14 10 7

Rela

tive

Technology (nm)

Compute Energy

Interconnect Energy

6X

1.6X

Compute energy reduces faster than global interconnect energyFor constant throughput, NTV demands more parallelismIncreases data movement at the system levelSystem level optimization is required to determine NTV operating

point

ComputeGlobal Interconnect

Supply Voltage

Ener

gy

37

Architecture needs a Paradigm Shift

Must revisit and evaluate each (even legacy) architecture feature

Single thread performance FrequencyProgramming productivity Legacy, compatibility

Architecture features for productivity

Constraints (1) Cost(2) Reasonable Power/Energy

Throughput performance Parallelism, application specific HWPower/Energy Architecture features for energy

Simplicity

Constraints (1) Programming productivity(2) Cost

Architect’s past and present priorities—

Architect’s future priorities should be—

38

A non-architect’s biased view…Power, energy

VariationsInterconnects Soft-errors

ResiliencyBandwidth

39

Appealing the ArchitectsExploit transistor integration capacity

Dark Silicon is a self-fulfilling prophecy

Compute is cheap, reduce data movement

Use futuristic work-loads for evaluation

Fearlessly break the legacy shackles

40

SummaryPower & energy challenge continues

Opportunistically employ NTV operation

3D integration for DRAM

Hierarchical, heterogeneous, tapered interconnect

Resiliency spanning the entire stack

Does computer architecture have a future?• Yes, if you acknowledge issues & challenges, and embrace

the paradigm shift

• No, if you keep “head buried in the sand”!

the future of computer architecture

Documents

total power

energy efficient

energy budgetrun cores

tera peta energy

nm cmosimpact of variation

silicon vias power delivery

variation tolerance

6x65nm cmos