eecs 470 lecture 13 basic caches · lecture 13 eecs 470 slide 1 © wenisch 2016 -- portions ©...

Lecture 13 Slide 1 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

EECS470Lecture13BasicCaches

Winter2019

Prof.RonaldDreslinski

h8p://www.eecs.umich.edu/courses/eecs470

Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin.



Readings ForToday:

❒  H&P2.1

ForWednesday:❒  H&P2.2,2.3,B.3❒  N.Jouppi.Improvingdirect-mappedcacheperformance…



Memory Systems: Basic Caches



Memory Systems

Basiccaches❒  introducAon❒  fundamentalquesAons❒  cachesize,blocksize,associaAvity

Advancedcaches

Mainmemory

Virtualmemory

Start today



Motivation

Wantmemorytoappear:❒  asfastasCPU❒  aslargeasrequiredbyalloftherunningapplicaAons

1

10

100

1000

10000

1985 1990 1995 2000 2005 2010

Perf

orm

ance

Processor

Memory



LargerFaster

Memory Hierarchy Makecommoncasefast:

❒  common:temporal&spaAallocality❒  fast:smallermoreexpensivememory

Registers

Caches

Memory

Disk (MEMS?)



Storage Hierarchies Storagesarelayeredbyhierarchiesinorderof

❒  increasinglatency(ti) ti<ti+1❒  increasingsize(si)

⇒decreaseunitcost(ci) si<si+1,ci>ci+1❒  decreasingbandwidth(bi) bi>bi+1❒  increasingxferunit(xi) xi<xi+1

Level0Registers

Level1(nlevelsof)Caches

Level2MainMemory(PrimaryStorage)

Level3Disks(SecondaryStorage)

Level4TapeBackup(TerAaryStorage)

ISA feature Memory Abstractions

Level 2.5: Flash?

Level 1.5: NVRAM?



Processor/Memory Boundaries

I-Unit E-Unit

L1 I-Cache L1 D-Cache

L2 Cache (SRAM on-chip)

D-TLB I-TLB

Regs

Main Memory (DRAM)

Processor

L3 Cache (SRAM off-chip)



Caches AnautomaAcallymanagedhierarchy

“Ahidingplace,esp.ofgoods,treasure,etc.”--OED

Keeprecentlyaccessedblock❒  temporallocality

Breakmemoryintoblocks(severalbytes)andtransferdatato/fromcacheinblocks

❒  spaAallocality

AlotofarchitecturesoptforsoFwaremanagedscratch-padmemoryinsteade.g.Cray-1,embeddedprocessors,Why??

CPU

$

Memory



Cache (Abstractly) Keeprecentlyaccessedblockin“blockframe”

❒  state(e.g.,valid)❒  addresstag❒  data

address state

bookkeepingoverhead

data

mulAplebytesperblockframetoamorAzeoverhead



Cache (Abstractly) Onmemoryread

ifincomingaddresscorrespondstooneofthestoredaddresstagthen❍  HIT❍  returndata

else❍  MISS❍  choose&displaceacurrentblockinuse❍  fetchnew(referenced)blockfrommemoryintoframe❍  returndata

- Whereandhowtolookforablock?(Blockplacement)- Whichblockisreplacedonamiss?(Blockreplacement)- Whathappensonawrite?Writestrategy(Later)- Whatiskept?(Bookkeeping,data)



Terminology block(cacheline)—minimumunitthatmaybepresent

hit—blockisfoundinthecache

miss—blockisnotfoundinthecache

missraAo—fracAonofreferencesthatmiss

hitAme—Ametoaccessthecache

misspenalty❒  Ametoreplaceblockinthecache+delivertoupperlevel❒  accessAme—Ametogetfirstword❒  transferAme—Ameforremainingwords



Cache Performance Assume

❒  CacheaccessAmeisequalto1cycle❒  CachemissraAois0.01❒  Cachemisspenaltyis20cycles

MeanaccessAme

=CacheaccessAme+missraAo*misspenalty

=1+0.01*20=1.2

Typically❒  level-1is16K-64K,level-2is512K-4M,memoryis128M-4G❒  level-1asfastastheprocessor(increasingly2-cycles)❒  level-1is1/10000capacitybutcontains98%ofreferences

MemoizaSon&amorSzaSon



Fundamental Cache Parameters that affects miss rate

Cachesize (C)

Blocksize (b)

CacheassociaAvity (a)



Cache Size Cachesizeisthetotaldata(notincludingtag)capacity

❒  biggercanexploittemporallocalitybener❒  notALWAYSbener

Toolargeacache❒  smallerisfaster=>biggerisslower❒  accessAmemaydegradecriAcalpath

Toosmallacache❒  don’texploittemporallocalitywell❒  usefuldataconstantlyreplaced

hit rate

C

“working set” size

holding b and a constant



Block Size Blocksizeisthedatathatis

❒  associatedwithanaddresstag❒  notnecessarilytheunitoftransferbetweenhierarchies(sub-blocking)

Toosmallblocks❒  don’texploitspaAallocalitywell❒  haveinordinatetagoverhead

Toolargeblocks❒  uselessdatatransferred❒  usefuldatapermanentlyreplaced—toofewtotal#blocks

b holding C and a constant



Associativity

Fully-associaAveblockgoesinanyframe

(thinkallframesin1set)

Direct-mappedblockgoesinexactly

oneframe

(think1frameperset)

Set-associaAveablockgoesinany

frameinexactlyoneset

(framesgroupedintosets)

Wheredoesblock12(b’1100)go?

0123

01234567

01010101

01234567

BlockSet/BlockSet



Impact of Associativity TypicalvaluesforassociaAvity

❒  1,2-,4-,8-wayassociaAve

LargerassociaAvity❒  lowermissrate,lessvariaAonamongprograms

❒  onlyimportantforsmall“C/b”

SmallerassociaAvity❒  lowercost,fasterhitAme

hit rate

a

~5

holding C and b constant



Direct Mapped Caches

tag idx b.o.

= Tag

match

(hit?)

Multiplexor de

code

r

= Tag

Match

(hit?)

deco

der

tag index

block index

Don’t forget to check the valid/state bits



tag blk.offset

Fully Associative Cache

= = =

= Multiplexor

Associative Search

Tag



N-Way Set Associative Cache

tag idx b.o.

= Tag match

deco

der

= Tag match

Multiplexor

deco

der

a set a way (bank)

Cache Size = N x 2B+b



Mark Hill’s DM vs. SA: “Bigger & Dumber is Better”

tavg=thit+missraAoxtmiss❒  comparableDMandSAcacheswithsametmiss❒  but,associaAvitythatminimizestavgisosensmallerthanassociaAvitythatminimizesmissraAo

remember:

diff(tcache)=tcache(SA)-tcache(DM)≥0 (SAneedsslowerclock)

diff(miss)=miss(SA)-miss(DM)≤0 (DMmissesmore)

e.g.,Ifdiff(tcache)=0=>SAbener,butassumingdiff(miss)=-1%,tmiss=20 ⇒ifdiff(tcache)>0.2cyclethenSAloses



Associative Block Replacement Whichblockinasettoreplaceonamiss?Ideally—Belady’salgorithm,replacetheblockthat“will”beaccessedthefurthestinthefuture

❒  Howdoyouimplementit?

ApproximaAons:Leastrecentlyused—LRU

❒  opAmized(assume)fortemporallocality (expensiveformorethan2-way)

Notmostrecentlyused—NMRU❒  trackMRU,randomselectfromothers,goodcompromise

Random❒  nearlyasgoodasLRU,simpler(usuallypseudo-random)

HowmuchcanblockreplacementpolicymaVer?



Example: a=2, C=1kB, b=4B, word-size=2B Basic Solution

data 0

128-lines x

4-bytes

data 1

128-lines x

4-bytes

tag0

128-l x

23-b

v0 “ x

1-b

tag1

128-l x

23-b

v1 “ x

1-b

tag PA[31:9]

PA[0]

b.o. PA[1]

idx PA[8:2]

7

idx 7

idx 7

idx 7

idx

= tag

23

hit0

=

hit1

2-1-mux 2-1-mux b.o.

2-1-muxd hit0 hit1

HIT DATA

hit0

hi

t1

16



Write Policies WritesaremoreinteresAng

❒  onreads,datacanbeaccessedinparallelwithtagcompare❒  onwrites,needstwosteps❒  isturn-aroundAmeimportantforwrites? cacheopSmizaSonoFendeferwritesforreads

ChoicesofWritePolicies❒  Onwritehits,updatememory?

❍  Yes:write-through+nocoherenceissue,+immediateobservability,-morebandwidth

❍  No:write-back❒  Onwritemisses,allocateacacheblockframe?

❍  Yes:write-allocate❍  No:no-write-allocate



Write Policies (Cont.) Write-through

❒  updatememoryoneachwrite❒  keepsmemoryup-to-date❒  traffic/reference=fwrites,e.g.0.20 independentofcacheperformance(missrate)

Write-back❒  updatememoryonlyonblockreplacement❒  manycachelinesareonlyreadandneverwrinento❒  add“dirty”bittostatusword

❍  originallyclearedaserreplacement❍  setwhenablockframeiswrinento❍  onlywritebackadirtyblock,and“drop”cleanblocksw/omemoryupdate

❒  traffic/reference=fdirtyxmissxB❍  e.g.,traffic/reference=1/2x0.05x4=0.1



Store Buffers

BufferCPUwrites❒  allowsreadstoproceed❒  stallonlywhenfull❒  datadependence?

❍  Whathappensondependentloads/stores?

CPU $



Writeback Buffers

Betweenwrite-backcacheandnextlevel1.Movereplaced,dirtyblockstobuffer2.Readnewline3.Movereplaceddatatomemory

Usuallyonlyneed1or2write-backbufferentries

$ $$/Memory



“Harvard” vs. “Princeton” Unified(someSmesknownasPrinceton)

❒  lesscostly,dynamicresponse,handleswritestoinstrucAons

SplitIandD(someSmesknownasHarvard)❒  mostoftheAmecodeanddatadon’tmix❒  2xbandwidth,placeclosetoI/Dports❒  cancustomizesize(I-footprintgenerallysmallerthand-footprint),nointerferencebetweenI/D

❒  self-modifyingcodecancause“coherence”problems

CachesshouldbesplitforfrequentsimultaneousI&Daccess❒  nolongeraquesAonin“high-performance”on-chipL-1caches

eecs 470 lecture 13 basic caches · lecture 13 eecs 470 slide 1 © wenisch 2016 -- portions ©...

Documents