14 memory interconnect

Upload: madhu-yalaka

Post on 05-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 14 Memory Interconnect

    1/25

    Memory-Interconnect Design

    Bharadwaj Amrutur

  • 7/31/2019 14 Memory Interconnect

    2/25

    AMD x86-64

    32nm Process with High-K Metal Gate35million xtors, 3GHz, 2W to 25W8T memory cell (as opposed to 6T cell)Read followed by Write in same cycle for L1 D$

    Shallow bitlines: 8 cells/line

    [Jotwani et.al., ISSCC 2010]

  • 7/31/2019 14 Memory Interconnect

    3/25

    Sun 16-Core SPARC

    TSMC 40 nm, 11 Cu levels, 1B xtors8 threads x 16 cores. 4-way gluelessconnection of 4 chipsUnified 6MB L2 each L2 is 386KBCrossbar: 461GB/s

    [Shin et.al., ISSCC 2010]

  • 7/31/2019 14 Memory Interconnect

    4/25

    IBM Eight-core POWER7

    45nm CMOS SOI Process, 1.2B xtors11 layer cu with low-k32MB Embedded DRAM for L3$L2: 8-way 256KB L2 per coreL1: 32KB, 2 cycle access time, each I$ and D$GPR: 4R,4W 112 Entry 64+8 register file

    8MB eDRAM per L2 in each core, 8-waySmall SRAM directory (probably to selectthe way)25 cycle load to use latency16B/cycle to/fro to L2 bandwidth

    [Wendel et.al., ISSCC 2010]

  • 7/31/2019 14 Memory Interconnect

    5/25

    IBM wirespeed 16-core processor

    16 cores, 4 threads per core45nm SOIH/W acceleratorsShared bus also used for power managementEDRAM L2 cache: 4 x 292Kb x 1200 blocks x 16

    3x SRAM density, 1/5th SRAM power

    Dynamic voltage scaling0.7V and higher

    Can hookup 4 of thesechips to scale up to 64 cores65W at 2.0GHz

    [Johnson et.al., ISSCC 2010]

  • 7/31/2019 14 Memory Interconnect

    6/25

    Intel 48-core in 45nm

    45nm CMOS 1.3B Xtors48 IA-32 cores, 256KB L2, 2 per tile6x4 mesh with router in each tile.L1: 16KB for I$ and D$ resp.L2: Unified 4-way associative, write back10cycle hit, SECDED,

    64-entry TLB + 256 entry LUT extension16KB message passing buffer to supportMPI and OpenMP5-port virtual cut-through router16B Flits,

    [Howard et.al., ISSCC 2010]

    Vi f h

  • 7/31/2019 14 Memory Interconnect

    7/25

    View from the processorClk

    MemOp

    Address

    ReadData

    Processor Memory

    Memory Operations (MemOp)(DLX)

    Load,Store

    (Other RISC Processors)Prefetch, Load/Store coprocessorCache Flush,Synchronization

    WriteData

    Address is 32bits or 64bits (modern processors)

    Data bus width is 64 (accesses can be inbytes, 32bits, 64bits)

    Th G

  • 7/31/2019 14 Memory Interconnect

    8/25

    The Gap

    P r o c

    6 0 % / y r

    D R A M

    7 % / y r .1

    1 0

    1 0 0

    1 0 0 0

    1

    98

    0

    1

    98

    1

    1

    98

    3

    1

    98

    4

    1

    98

    5

    1

    98

    6

    1

    98

    7

    1

    98

    8

    1

    98

    9

    1

    99

    0

    1

    99

    1

    1

    99

    2

    1

    99

    3

    1

    99

    4

    1

    99

    5

    1

    99

    6

    1

    99

    7

    1

    99

    8

    1

    99

    9

    2

    00

    0

    D R A M

    C P U

    1

    98

    2

    P r o c e s s o r- M e m o r

    P e r f o r m a n c( g r o w s 5 0 %

    P

    e

    rfo

    rm

    a

    n

    c

    e M o o r e s L a w

    L e s s L a w ?

    P r o c

    6 0 % / y r

    D R A M

    7 % / y r .1

    1 0

    1 0 0

    1 0 0 0

    1

    98

    0

    1

    98

    1

    1

    98

    3

    1

    98

    4

    1

    98

    5

    1

    98

    6

    1

    98

    7

    1

    98

    8

    1

    98

    9

    1

    99

    0

    1

    99

    1

    1

    99

    2

    1

    99

    3

    1

    99

    4

    1

    99

    5

    1

    99

    6

    1

    99

    7

    1

    99

    8

    1

    99

    9

    2

    00

    0

    D R A M

    C P U

    1

    98

    2

    P r o c e s s o r- M e m o r

    P e r f o r m a n c( g r o w s 5 0 %

    P

    e

    rfo

    rm

    a

    n

    c

    e M o o r e s L a w

    L e s s L a w ?

    From Kubiatowicz/UCB

    Cl i th

  • 7/31/2019 14 Memory Interconnect

    9/25

    Closing the gap

    Use fast high speed RAMS close to the processor

    Caches

    Takes up ~ 90% transistors in the processor chip!

    Disk

    Main Memory (DRAM)

    L2 $

    L1 $

    Processor Registers

    Bigger Faster

    Proc

    M Hi h Ch t i ti

  • 7/31/2019 14 Memory Interconnect

    10/25

    Memory Hierarchy Characteristics

    Disk

    Main Memory (DRAM)

    L2 $

    L1 $

    RegsProc

    16-128 64-bit

    4KB-32KB

    1MB - 8MB

    1GB - 64GB

    80GB-few TB

    cycle latency, ~ 1000 Gb/s

    1 cycle latency, ~ 400 Gb/s

    5-10 cycles latency, ~200 Gb/s

    40-100 cycles latency, ~50Gb/s

    1000s of cycles, ~1Gb/s

    Chip

    Chip/PackagePCBBox

    Integration

  • 7/31/2019 14 Memory Interconnect

    11/25

    Memory Hierarchy

    Exercise Find Power/Mbps/bit for each layer of the memory

    hierarchy

    Plot Power/Mbps versus Bit as well as Bit0.5

    Which is better?

    Register File

  • 7/31/2019 14 Memory Interconnect

    12/25

    Register File

    ReadDecoder WriteDecoder

    R0 R63

    W0 W63

    RWL0

    RWL31

    WWL0

    WWL31

    ReadAddressReadAddress Write

    Address

    Register File

  • 7/31/2019 14 Memory Interconnect

    13/25

    Register File

    Can add more ports

    One switch, bitline per cell, and one decoder

    Wire dominated

    Register file cell can be 10x bigger than SRAM cell (used in L1/L2

    cache)

    Hence small in size Register files are explicitly visible to the processor

    Unlike caches

    Access latency can be clock cycle to allow for reading and

    execution (or execution and writeback) in same cycle. Easy to scale up word width (64/128/256/512)

    Power cost

    Cache concept

  • 7/31/2019 14 Memory Interconnect

    14/25

    Cache concept Small, fast storage to exploit

    Spatial and Temporal Locality

    Found in other places: File caches, Name Caches etc.

    H/w managed: Programming is easy

    Consider the memory as a sequence of lines

    Also known as blocks

    The line can contain multiple bytes.

    Cache allows storage of a subset of the lines from main memory

    Cache is first searched to satisfy the memory access request.

    A hit will return fast. A miss will incur a penalty.

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory

    Cache

    0 1 2 3

    Main memory lines are temporarilystored in the Cache

    Average Memory Access Time

  • 7/31/2019 14 Memory Interconnect

    15/25

    Average Memory Access Time

    CPUtime=IC

    ALUops

    Instr CPIAluops

    MemAccess

    Inst AMATCycletime

    Program Execution Time is given as:

    Average Memory Access Time (AMAT) is given as:

    AMAT=HitTimeMissRateMissPenalty

    HitTimeand MissPenaltyare in number of clock cycles

    ICis Instruction Count in the program

    To reduce AMAT, reduce HitTime, MissRateand MissPenalty

    HitTimeis usually the lowest possible of 1 cycle

    MissPenaltyis a function of the upper levels of the memory hierarchy

    MissRate is a function of Cache Size & Associativity

    which also impacts Cycletime: Hence an optimization problem

    Exercise

  • 7/31/2019 14 Memory Interconnect

    16/25

    Exercise

    Write the corresponding equation for the energy

    consumed by a program

    Cache issues

  • 7/31/2019 14 Memory Interconnect

    17/25

    Cache issues

    Where should a line be placed in the cache?

    How is a line searched for in the cache? Which line should be replaced on a cache

    miss?

    What to do on a write?

    Block Placement: Direct Map

  • 7/31/2019 14 Memory Interconnect

    18/25

    Block Placement: Direct Map

    Direct Mapped

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory

    Cache

    0 1 2 3

    0

    4

    8

    12

    1

    5

    9

    13

    2

    6

    10

    14

    3

    7

    11

    15

    The main memorylines which mapto specific cache lines

    are:

    The formula is:

    Direct Mapped: Placement

  • 7/31/2019 14 Memory Interconnect

    19/25

    Direct Mapped: Placement

    Direct Mapped

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory

    Cache

    0 1 2 3

    0

    4

    8

    12

    1

    5

    9

    13

    2

    6

    10

    14

    3

    7

    11

    15

    The main memoryblocks which mapto specific cache blocks

    are:

    The formula is:

    index = lineAddress modcacheSize

    (cacheSize is in lines)

    index

    Direct Mapped: Search

  • 7/31/2019 14 Memory Interconnect

    20/25

    Direct Mapped: Search

    Direct Mapped

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory

    Cache Data

    0 1 2 3

    0

    4

    8

    12

    1

    5

    9

    13

    2

    6

    10

    14

    3

    7

    11

    15

    The main memorylines which mapto specific cache lines

    are:

    Cache Tag ByteSelCacheIndexTag031

    Direct Mapped: Search

  • 7/31/2019 14 Memory Interconnect

    21/25

    Direct Mapped: Search

    ByteSelCacheIndexTag

    031

    =

    Hit/Miss

    Tag DataDecoder

    What is missing?

    Direct Mapped: Search

  • 7/31/2019 14 Memory Interconnect

    22/25

    Direct Mapped: Search

    ByteSelcacheIndexTag

    031

    =Hit/Miss

    Tag DataDecoderValid

    Block Placement: 2-way Associative

  • 7/31/2019 14 Memory Interconnect

    23/25

    Block Placement: 2 way Associative

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory

    Cache

    0 1

    04

    8

    12

    26

    10

    14

    15

    9

    13

    37

    11

    15

    The main memorylines which mapto specific cache linesare:

    The formula is:

    setIndex =

    Set 0 Set 1

    Within each set, the blocks

    can be in either of the locationssetIndex

    Block Placement: 2-way Associative

  • 7/31/2019 14 Memory Interconnect

    24/25

    Block Placement: 2 way Associative

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Main Memory

    Cache

    04

    8

    12

    26

    10

    14

    15

    9

    13

    37

    11

    15

    The main memorylines which mapto specific cache blocksare:

    The formula is:

    setIndex = lineAddress modcacheSize/Associativity

    Set 0 Set 1

    Within each set, the lines

    can be in either of the locations0 1setIndex

    Note: Other formulae for mapping into cache/set index are possible

    2-Way Associative: Search

  • 7/31/2019 14 Memory Interconnect

    25/25

    2 Way Associative: Search

    ByteSelCacheIndexTag

    031

    =

    Hit/Miss_Set0

    Tag DataDecoderValid

    =

    Tag DataDecoderValid

    Exercises:a) Complete the wiringb) How do you generate the final Hit/Miss signalc) Extend the design to a Fully Associative Cached) What happens to MissRate with associativitye) What happens to MissRate with sizef) What happens to cycle time with Associativity and Size?

    Hit/Miss_Set1

    Tristate Driver Tristate Driver