14 memory interconnect

7/31/2019 14 Memory Interconnect

1/25

Memory-Interconnect Design

Bharadwaj Amrutur


2/25

AMD x86-64

32nm Process with High-K Metal Gate35million xtors, 3GHz, 2W to 25W8T memory cell (as opposed to 6T cell)Read followed by Write in same cycle for L1 D$

Shallow bitlines: 8 cells/line

[Jotwani et.al., ISSCC 2010]


3/25

Sun 16-Core SPARC

TSMC 40 nm, 11 Cu levels, 1B xtors8 threads x 16 cores. 4-way gluelessconnection of 4 chipsUnified 6MB L2 each L2 is 386KBCrossbar: 461GB/s

[Shin et.al., ISSCC 2010]


4/25

IBM Eight-core POWER7

45nm CMOS SOI Process, 1.2B xtors11 layer cu with low-k32MB Embedded DRAM for L3$L2: 8-way 256KB L2 per coreL1: 32KB, 2 cycle access time, each I$ and D$GPR: 4R,4W 112 Entry 64+8 register file

8MB eDRAM per L2 in each core, 8-waySmall SRAM directory (probably to selectthe way)25 cycle load to use latency16B/cycle to/fro to L2 bandwidth

[Wendel et.al., ISSCC 2010]


5/25

IBM wirespeed 16-core processor

16 cores, 4 threads per core45nm SOIH/W acceleratorsShared bus also used for power managementEDRAM L2 cache: 4 x 292Kb x 1200 blocks x 16

3x SRAM density, 1/5th SRAM power

Dynamic voltage scaling0.7V and higher

Can hookup 4 of thesechips to scale up to 64 cores65W at 2.0GHz

[Johnson et.al., ISSCC 2010]


6/25

Intel 48-core in 45nm

45nm CMOS 1.3B Xtors48 IA-32 cores, 256KB L2, 2 per tile6x4 mesh with router in each tile.L1: 16KB for I$ and D$ resp.L2: Unified 4-way associative, write back10cycle hit, SECDED,

64-entry TLB + 256 entry LUT extension16KB message passing buffer to supportMPI and OpenMP5-port virtual cut-through router16B Flits,

[Howard et.al., ISSCC 2010]

Vi f h


7/25

View from the processorClk

MemOp

Address

ReadData

Processor Memory

Memory Operations (MemOp)(DLX)

Load,Store

(Other RISC Processors)Prefetch, Load/Store coprocessorCache Flush,Synchronization

WriteData

Address is 32bits or 64bits (modern processors)

Data bus width is 64 (accesses can be inbytes, 32bits, 64bits)

Th G


8/25

The Gap

P r o c

6 0 % / y r

D R A M

7 % / y r .1

1 0

1 0 0

1 0 0 0

1

98

0

1

98

1

1

98

3

1

98

4

1

98

5

1

98

6

1

98

7

1

98

8

1

98

9

1

99

0

1

99

1

1

99

2

1

99

3

1

99

4

1

99

5

1

99

6

1

99

7

1

99

8

1

99

9

2

00

0

D R A M

C P U

1

98

2

P r o c e s s o r- M e m o r

P e r f o r m a n c( g r o w s 5 0 %

P

e

rfo

rm

a

n

c

e M o o r e s L a w

L e s s L a w ?

P r o c

6 0 % / y r

D R A M

7 % / y r .1

1 0

1 0 0

1 0 0 0

1

98

0

1

98

1

1

98

3

1

98

4

1

98

5

1

98

6

1

98

7

1

98

8

1

98

9

1

99

0

1

99

1

1

99

2

1

99

3

1

99

4

1

99

5

1

99

6

1

99

7

1

99

8

1

99

9

2

00

0

D R A M

C P U

1

98

2

P r o c e s s o r- M e m o r

P e r f o r m a n c( g r o w s 5 0 %

P

e

rfo

rm

a

n

c

e M o o r e s L a w

L e s s L a w ?

From Kubiatowicz/UCB

Cl i th


9/25

Closing the gap

Use fast high speed RAMS close to the processor

Caches

Takes up ~ 90% transistors in the processor chip!

Disk

Main Memory (DRAM)

L2 $

L1 $

Processor Registers

Bigger Faster

Proc

M Hi h Ch t i ti


10/25

Memory Hierarchy Characteristics

Disk

Main Memory (DRAM)

L2 $

L1 $

RegsProc

16-128 64-bit

4KB-32KB

1MB - 8MB

1GB - 64GB

80GB-few TB

cycle latency, ~ 1000 Gb/s

1 cycle latency, ~ 400 Gb/s

5-10 cycles latency, ~200 Gb/s

40-100 cycles latency, ~50Gb/s

1000s of cycles, ~1Gb/s

Chip

Chip/PackagePCBBox

Integration


11/25

Memory Hierarchy

Exercise Find Power/Mbps/bit for each layer of the memory

hierarchy

Plot Power/Mbps versus Bit as well as Bit0.5

Which is better?

Register File


12/25

Register File

ReadDecoder WriteDecoder

R0 R63

W0 W63

RWL0

RWL31

WWL0

WWL31

ReadAddressReadAddress Write

Address

Register File


13/25

Register File

Can add more ports

One switch, bitline per cell, and one decoder

Wire dominated

Register file cell can be 10x bigger than SRAM cell (used in L1/L2

cache)

Hence small in size Register files are explicitly visible to the processor

Unlike caches

Access latency can be clock cycle to allow for reading and

execution (or execution and writeback) in same cycle. Easy to scale up word width (64/128/256/512)

Power cost

Cache concept


14/25

Cache concept Small, fast storage to exploit

Spatial and Temporal Locality

Found in other places: File caches, Name Caches etc.

H/w managed: Programming is easy

Consider the memory as a sequence of lines

Also known as blocks

The line can contain multiple bytes.

Cache allows storage of a subset of the lines from main memory

Cache is first searched to satisfy the memory access request.

A hit will return fast. A miss will incur a penalty.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory

Cache

0 1 2 3

Main memory lines are temporarilystored in the Cache

Average Memory Access Time


15/25

Average Memory Access Time

CPUtime=IC

ALUops

Instr CPIAluops

MemAccess

Inst AMATCycletime

Program Execution Time is given as:

Average Memory Access Time (AMAT) is given as:

AMAT=HitTimeMissRateMissPenalty

HitTimeand MissPenaltyare in number of clock cycles

ICis Instruction Count in the program

To reduce AMAT, reduce HitTime, MissRateand MissPenalty

HitTimeis usually the lowest possible of 1 cycle

MissPenaltyis a function of the upper levels of the memory hierarchy

MissRate is a function of Cache Size & Associativity

which also impacts Cycletime: Hence an optimization problem

Exercise


16/25

Exercise

Write the corresponding equation for the energy

consumed by a program

Cache issues


17/25

Cache issues

Where should a line be placed in the cache?

How is a line searched for in the cache? Which line should be replaced on a cache

miss?

What to do on a write?

Block Placement: Direct Map


18/25

Block Placement: Direct Map

Direct Mapped

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory

Cache

0 1 2 3

0

4

8

12

1

5

9

13

2

6

10

14

3

7

11

15

The main memorylines which mapto specific cache lines

are:

The formula is:

Direct Mapped: Placement


19/25

Direct Mapped: Placement

Direct Mapped

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory

Cache

0 1 2 3

0

4

8

12

1

5

9

13

2

6

10

14

3

7

11

15

The main memoryblocks which mapto specific cache blocks

are:

The formula is:

index = lineAddress modcacheSize

(cacheSize is in lines)

index

Direct Mapped: Search


20/25


Direct Mapped

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory

Cache Data

0 1 2 3

0

4

8

12

1

5

9

13

2

6

10

14

3

7

11

15

The main memorylines which mapto specific cache lines

are:

Cache Tag ByteSelCacheIndexTag031



21/25


ByteSelCacheIndexTag

031

=

Hit/Miss

Tag DataDecoder

What is missing?



22/25


ByteSelcacheIndexTag

031

=Hit/Miss

Tag DataDecoderValid

Block Placement: 2-way Associative


23/25

Block Placement: 2 way Associative

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory

Cache

0 1

04

8

12

26

10

14

15

9

13

37

11

15

The main memorylines which mapto specific cache linesare:

The formula is:

setIndex =

Set 0 Set 1

Within each set, the blocks

can be in either of the locationssetIndex

Block Placement: 2-way Associative


24/25

Block Placement: 2 way Associative

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory

Cache

04

8

12

26

10

14

15

9

13

37

11

15

The main memorylines which mapto specific cache blocksare:

The formula is:

setIndex = lineAddress modcacheSize/Associativity

Set 0 Set 1

Within each set, the lines

can be in either of the locations0 1setIndex

Note: Other formulae for mapping into cache/set index are possible

2-Way Associative: Search


25/25

2 Way Associative: Search

ByteSelCacheIndexTag

031

=

Hit/Miss_Set0


=


Exercises:a) Complete the wiringb) How do you generate the final Hit/Miss signalc) Extend the design to a Fully Associative Cached) What happens to MissRate with associativitye) What happens to MissRate with sizef) What happens to cycle time with Associativity and Size?

Hit/Miss_Set1

Tristate Driver Tristate Driver

14 memory interconnect

Documents