eecs 470 lecture 15

Lecture 13 Slide 1 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

EECS470Lecture15BasicCaches

Winter2022

Prof.RonaldDreslinski

h6p://www.eecs.umich.edu/courses/eecs470

Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin.



Readings ForToday:

❒  H&P2.1

ForThursday:❒  H&P2.2,2.3,B.3❒  N.Jouppi.Improvingdirect-mappedcacheperformance…



Announcements MidtermGradesreleased.

Ifyouaremorethan2Std.Dev.fromthemean,pleaseemailmetosetupaJmetochat.

LookforHW4tobereleasedtomorrowsomeJme



Staff Midterm Outcome Lot’sofsmallsuggesJons,hereisalistofacJonableoneswewilltrytoaddress:

1)  Fixthewebsite/calendar2)  MoreGSI’s(lessacJonablethissemester)3)  Grade’sbacksooner4)  Officehoursqueueslong



Wide Fetch - Non-sequential TworelatedquesJons

q  Howmanybranchespredictedpercycle?q  CanwefetchfrommulJpletakenbranchespercycle?

Simplest,mostcommonorganizaJon:“1”and“No”q  OnepredicJon,discardpost-branchinsnsifpredicJonis“Taken”–  LowerseffecJvefetchwidthandIPCq  AveragenumberofinstrucJonspertakenbranch?

q  Assume:20%branches,50%taken→~10instrucJonsq  Considera10-instrucJonloopbodywithan8-issueprocessor

q  Withoutsmarterfetch,ILPislimitedto5(not8)

Compilercanhelpq  Unrollloops,reducetakenbranchfrequency



Multiple Branch Predictions IssueswithmulJplebranchpredicJons:

q  LatencyresulJngfromsequenJalpredicJonsq  LaterpredicJonsbasedonstale/speculaJvehistoryq  Don’tforget,0.95x0.95x0.95=0.85

BTB

BTB

BTB

Fetch address

Block 1 Block 2 Block 3



Examples of Multi-Branch Predictors

bn b0 BHR

PHT

p0 p1 p2

How do you update this thing after a branch resolves?



Examples of Multi-Branch Predictors

bn b0 BHR

bn:2 bn-1:1

bn-2:0

b1 b0

p0

b0 p0

p0 p1

p1 p2

PHT

2n-2 x 4 entries



Multiple Predicted Taken Branches

IssueswithmulJpletakenbranches:q  LonglatencywithmulJplesequenJalI-cacheaccessesq  or,mulJ-portedI-cachewithsloweraccesslatencyq  or,mulJ-bankedI-cachetoapproximatemulJ-port

Block 2 FA

Block 1 FA

Block 3 FA

Block 1 instructions



Multi-ported I-cache



Instruction Alignment and Collapsing

Issueswithalignmentandcollapsing:q  Misalignmentbetweenfetchgroupandcacheline.q  Packingofvariable-sizedblocksintofetchbuffer.

I-cache Port 1

I-cache Port 2

I-cache Port 3

Fetch buffer



Memory Systems: Basic Caches



Memory Systems

Basiccaches❒  introducJon❒  fundamentalquesJons❒  cachesize,blocksize,associaJvity

Advancedcaches

Mainmemory

Virtualmemory

Start today



Motivation

Wantmemorytoappear:❒  asfastasCPU❒  aslargeasrequiredbyalloftherunningapplicaJons

1

10

100

1000

10000

1985 1990 1995 2000 2005 2010

Perf

orm

ance

Processor

Memory



LargerFaster

Memory Hierarchy Makecommoncasefast:

❒  common:temporal&spaJallocality❒  fast:smallermoreexpensivememory

Registers

Caches

Memory

Disk (MEMS?)



Storage Hierarchies Storagesarelayeredbyhierarchiesinorderof

❒  increasinglatency(ti) ti<ti+1❒  increasingsize(si)

⇒decreaseunitcost(ci) si<si+1,ci>ci+1❒  decreasingbandwidth(bi) bi>bi+1❒  increasingxferunit(xi) xi<xi+1

Level0Registers

Level1(nlevelsof)Caches

Level2MainMemory(PrimaryStorage)

Level3Disks(SecondaryStorage)

Level4TapeBackup(TerJaryStorage)

ISA feature Memory Abstractions

Level 2.5: Flash?

Level 1.5: NVRAM?



Processor/Memory Boundaries

I-Unit E-Unit

L1 I-Cache L1 D-Cache

L2 Cache (SRAM on-chip)

D-TLB I-TLB

Regs

Main Memory (DRAM)

Processor

L3 Cache (SRAM off-chip)



Caches AnautomaJcallymanagedhierarchy

“Ahidingplace,esp.ofgoods,treasure,etc.”--OED

Keeprecentlyaccessedblock❒  temporallocality

Breakmemoryintoblocks(severalbytes)andtransferdatato/fromcacheinblocks

❒  spaJallocality

AlotofarchitecturesoptforsoFwaremanagedscratch-padmemoryinsteade.g.Cray-1,embeddedprocessors,Why??

CPU

$

Memory



Cache (Abstractly) Keeprecentlyaccessedblockin“blockframe”

❒  state(e.g.,valid)❒  addresstag❒  data

address state

bookkeepingoverhead

data

mulJplebytesperblockframetoamorJzeoverhead



Cache (Abstractly) Onmemoryread

ifincomingaddresscorrespondstooneofthestoredaddresstagthen❍  HIT❍  returndata

else❍  MISS❍  choose&displaceacurrentblockinuse❍  fetchnew(referenced)blockfrommemoryintoframe❍  returndata

- Whereandhowtolookforablock?(Blockplacement)- Whichblockisreplacedonamiss?(Blockreplacement)- Whathappensonawrite?Writestrategy(Later)- Whatiskept?(Bookkeeping,data)



Terminology block(cacheline)—minimumunitthatmaybepresent

hit—blockisfoundinthecache

miss—blockisnotfoundinthecache

missraJo—fracJonofreferencesthatmiss

hitJme—Jmetoaccessthecache

misspenalty❒  Jmetoreplaceblockinthecache+delivertoupperlevel❒  accessJme—Jmetogetfirstword❒  transferJme—Jmeforremainingwords



Cache Performance Assume

❒  CacheaccessJmeisequalto1cycle❒  CachemissraJois0.01❒  Cachemisspenaltyis20cycles

MeanaccessJme

=CacheaccessJme+missraJo*misspenalty

=1+0.01*20=1.2

Typically❒  level-1is16K-64K,level-2is512K-4M,memoryis128M-4G❒  level-1asfastastheprocessor(increasingly2-cycles)❒  level-1is1/10000capacitybutcontains98%ofreferences

MemoizaSon&amorSzaSon



Fundamental Cache Parameters that affects miss rate

Cachesize (C)

Blocksize (b)

CacheassociaJvity (a)



Cache Size Cachesizeisthetotaldata(notincludingtag)capacity

❒  biggercanexploittemporallocalitybeter❒  notALWAYSbeter

Toolargeacache❒  smallerisfaster=>biggerisslower❒  accessJmemaydegradecriJcalpath

Toosmallacache❒  don’texploittemporallocalitywell❒  usefuldataconstantlyreplaced

hit rate

C

“working set” size

holding b and a constant



Block Size Blocksizeisthedatathatis

❒  associatedwithanaddresstag❒  notnecessarilytheunitoftransferbetweenhierarchies(sub-blocking)

Toosmallblocks❒  don’texploitspaJallocalitywell❒  haveinordinatetagoverhead

Toolargeblocks❒  uselessdatatransferred❒  usefuldatapermanentlyreplaced—toofewtotal#blocks

b holding C and a constant



Associativity

Fully-associaJveblockgoesinanyframe

(thinkallframesin1set)

Direct-mappedblockgoesinexactly

oneframe

(think1frameperset)

Set-associaJveablockgoesinany

frameinexactlyoneset

(framesgroupedintosets)

Wheredoesblock12(b’1100)go?

0123

01234567

01010101

01234567

BlockSet/BlockSet



Impact of Associativity TypicalvaluesforassociaJvity

❒  1,2-,4-,8-wayassociaJve

LargerassociaJvity❒  lowermissrate,lessvariaJonamongprograms

❒  onlyimportantforsmall“C/b”

SmallerassociaJvity❒  lowercost,fasterhitJme

hit rate

a

~5

holding C and b constant



Direct Mapped Caches

tag idx b.o.

= Tag

match

(hit?)

Multiplexor de

code

r

= Tag

Match

(hit?)

deco

der

tag index

block index

Don’t forget to check the valid/state bits



tag blk.offset

Fully Associative Cache

= = =

= Multiplexor

Associative Search

Tag



N-Way Set Associative Cache

tag idx b.o.

= Tag match

deco

der

= Tag match

Multiplexor

deco

der

a set a way (bank)

Cache Size = N x 2B+b



Associative Block Replacement Whichblockinasettoreplaceonamiss?Ideally—Belady’salgorithm,replacetheblockthat“will”beaccessedthefurthestinthefuture

❒  Howdoyouimplementit?

ApproximaJons:Leastrecentlyused—LRU

❒  opJmized(assume)fortemporallocality (expensiveformorethan2-way)

Notmostrecentlyused—NMRU❒  trackMRU,randomselectfromothers,goodcompromise

Random❒  nearlyasgoodasLRU,simpler(usuallypseudo-random)

HowmuchcanblockreplacementpolicymaUer?



Example: a=2, C=1kB, b=4B, word-size=2B Basic Solution

data 0

128-lines x

4-bytes

data 1

128-lines x

4-bytes

tag0

128-l x

23-b

v0 “ x

1-b

tag1

128-l x

23-b

v1 “ x

1-b

tag PA[31:9]

PA[0]

b.o. PA[1]

idx PA[8:2]

7

idx 7

idx 7

idx 7

idx

= tag

23

hit0

=

hit1

2-1-mux 2-1-mux b.o.

2-1-muxd hit0 hit1

HIT DATA

hit0

hi

t1

16



Write Policies WritesaremoreinteresJng

❒  onreads,datacanbeaccessedinparallelwithtagcompare❒  onwrites,needstwosteps❒  isturn-aroundJmeimportantforwrites? cacheopSmizaSonoFendeferwritesforreads

ChoicesofWritePolicies❒  Onwritehits,updatememory?

❍  Yes:write-through+nocoherenceissue,+immediateobservability,-morebandwidth

❍  No:write-back❒  Onwritemisses,allocateacacheblockframe?

❍  Yes:write-allocate❍  No:no-write-allocate



Write Policies (Cont.) Write-through

❒  updatememoryoneachwrite❒  keepsmemoryup-to-date❒  traffic/reference=fwrites,e.g.0.20 independentofcacheperformance(missrate)

Write-back❒  updatememoryonlyonblockreplacement❒  manycachelinesareonlyreadandneverwritento❒  add“dirty”bittostatusword

❍  originallyclearedawerreplacement❍  setwhenablockframeiswritento❍  onlywritebackadirtyblock,and“drop”cleanblocksw/omemoryupdate

❒  traffic/reference=fdirtyxmissxB❍  e.g.,traffic/reference=1/2x0.05x4=0.1



Store Buffers

BufferCPUwrites❒  allowsreadstoproceed❒  stallonlywhenfull❒  datadependence?

❍  Whathappensondependentloads/stores?

CPU $



Writeback Buffers

Betweenwrite-backcacheandnextlevel1.Movereplaced,dirtyblockstobuffer2.Readnewline3.Movereplaceddatatomemory

Usuallyonlyneed1or2write-backbufferentries

$ $$/Memory



“Harvard” vs. “Princeton” Unified(someSmesknownasPrinceton)

❒  lesscostly,dynamicresponse,handleswritestoinstrucJons

SplitIandD(someSmesknownasHarvard)❒  mostoftheJmecodeanddatadon’tmix❒  2xbandwidth,placeclosetoI/Dports❒  cancustomizesize(I-footprintgenerallysmallerthand-footprint),nointerferencebetweenI/D

❒  self-modifyingcodecancause“coherence”problems

CachesshouldbesplitforfrequentsimultaneousI&Daccess❒  nolongeraquesJonin“high-performance”on-chipL-1caches

eecs 470 lecture 15

Documents