eecs 470 lecture 15
TRANSCRIPT
Lecture 13 Slide 1 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
EECS470Lecture15BasicCaches
Winter2022
Prof.RonaldDreslinski
h6p://www.eecs.umich.edu/courses/eecs470
Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin.
Lecture 13 Slide 2 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Readings ForToday:
❒ H&P2.1
ForThursday:❒ H&P2.2,2.3,B.3❒ N.Jouppi.Improvingdirect-mappedcacheperformance…
Lecture 13 Slide 3 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Announcements MidtermGradesreleased.
Ifyouaremorethan2Std.Dev.fromthemean,pleaseemailmetosetupaJmetochat.
LookforHW4tobereleasedtomorrowsomeJme
Lecture 13 Slide 4 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Staff Midterm Outcome Lot’sofsmallsuggesJons,hereisalistofacJonableoneswewilltrytoaddress:
1) Fixthewebsite/calendar2) MoreGSI’s(lessacJonablethissemester)3) Grade’sbacksooner4) Officehoursqueueslong
Lecture 12 Slide 5 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Wide Fetch - Non-sequential TworelatedquesJons
q Howmanybranchespredictedpercycle?q CanwefetchfrommulJpletakenbranchespercycle?
Simplest,mostcommonorganizaJon:“1”and“No”q OnepredicJon,discardpost-branchinsnsifpredicJonis“Taken”– LowerseffecJvefetchwidthandIPCq AveragenumberofinstrucJonspertakenbranch?
q Assume:20%branches,50%taken→~10instrucJonsq Considera10-instrucJonloopbodywithan8-issueprocessor
q Withoutsmarterfetch,ILPislimitedto5(not8)
Compilercanhelpq Unrollloops,reducetakenbranchfrequency
Lecture 12 Slide 6 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Multiple Branch Predictions IssueswithmulJplebranchpredicJons:
q LatencyresulJngfromsequenJalpredicJonsq LaterpredicJonsbasedonstale/speculaJvehistoryq Don’tforget,0.95x0.95x0.95=0.85
BTB
BTB
BTB
Fetch address
Block 1 Block 2 Block 3
Lecture 12 Slide 7 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Examples of Multi-Branch Predictors
bn b0 BHR
PHT
p0 p1 p2
How do you update this thing after a branch resolves?
Lecture 12 Slide 8 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Examples of Multi-Branch Predictors
bn b0 BHR
bn:2 bn-1:1
bn-2:0
b1 b0
p0
b0 p0
p0 p1
p1 p2
PHT
2n-2 x 4 entries
Lecture 12 Slide 9 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Multiple Predicted Taken Branches
IssueswithmulJpletakenbranches:q LonglatencywithmulJplesequenJalI-cacheaccessesq or,mulJ-portedI-cachewithsloweraccesslatencyq or,mulJ-bankedI-cachetoapproximatemulJ-port
Block 2 FA
Block 1 FA
Block 3 FA
Block 1 instructions
Block 2 instructions
Block 3 instructions
Multi-ported I-cache
Lecture 12 Slide 10 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Instruction Alignment and Collapsing
Issueswithalignmentandcollapsing:q Misalignmentbetweenfetchgroupandcacheline.q Packingofvariable-sizedblocksintofetchbuffer.
I-cache Port 1
I-cache Port 2
I-cache Port 3
Fetch buffer
Lecture 13 Slide 11 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Memory Systems: Basic Caches
Lecture 13 Slide 12 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Memory Systems
Basiccaches❒ introducJon❒ fundamentalquesJons❒ cachesize,blocksize,associaJvity
Advancedcaches
Mainmemory
Virtualmemory
Start today
Lecture 13 Slide 13 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Motivation
Wantmemorytoappear:❒ asfastasCPU❒ aslargeasrequiredbyalloftherunningapplicaJons
1
10
100
1000
10000
1985 1990 1995 2000 2005 2010
Perf
orm
ance
Processor
Memory
Lecture 13 Slide 14 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
LargerFaster
Memory Hierarchy Makecommoncasefast:
❒ common:temporal&spaJallocality❒ fast:smallermoreexpensivememory
Registers
Caches
Memory
Disk (MEMS?)
Lecture 13 Slide 15 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Storage Hierarchies Storagesarelayeredbyhierarchiesinorderof
❒ increasinglatency(ti) ti<ti+1❒ increasingsize(si)
⇒decreaseunitcost(ci) si<si+1,ci>ci+1❒ decreasingbandwidth(bi) bi>bi+1❒ increasingxferunit(xi) xi<xi+1
Level0Registers
Level1(nlevelsof)Caches
Level2MainMemory(PrimaryStorage)
Level3Disks(SecondaryStorage)
Level4TapeBackup(TerJaryStorage)
ISA feature Memory Abstractions
Level 2.5: Flash?
Level 1.5: NVRAM?
Lecture 13 Slide 16 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Processor/Memory Boundaries
I-Unit E-Unit
L1 I-Cache L1 D-Cache
L2 Cache (SRAM on-chip)
D-TLB I-TLB
Regs
Main Memory (DRAM)
Processor
L3 Cache (SRAM off-chip)
Lecture 13 Slide 17 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Caches AnautomaJcallymanagedhierarchy
“Ahidingplace,esp.ofgoods,treasure,etc.”--OED
Keeprecentlyaccessedblock❒ temporallocality
Breakmemoryintoblocks(severalbytes)andtransferdatato/fromcacheinblocks
❒ spaJallocality
AlotofarchitecturesoptforsoFwaremanagedscratch-padmemoryinsteade.g.Cray-1,embeddedprocessors,Why??
CPU
$
Memory
Lecture 13 Slide 18 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Cache (Abstractly) Keeprecentlyaccessedblockin“blockframe”
❒ state(e.g.,valid)❒ addresstag❒ data
address state
bookkeepingoverhead
data
mulJplebytesperblockframetoamorJzeoverhead
Lecture 13 Slide 19 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Cache (Abstractly) Onmemoryread
ifincomingaddresscorrespondstooneofthestoredaddresstagthen❍ HIT❍ returndata
else❍ MISS❍ choose&displaceacurrentblockinuse❍ fetchnew(referenced)blockfrommemoryintoframe❍ returndata
- Whereandhowtolookforablock?(Blockplacement)- Whichblockisreplacedonamiss?(Blockreplacement)- Whathappensonawrite?Writestrategy(Later)- Whatiskept?(Bookkeeping,data)
Lecture 13 Slide 20 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Terminology block(cacheline)—minimumunitthatmaybepresent
hit—blockisfoundinthecache
miss—blockisnotfoundinthecache
missraJo—fracJonofreferencesthatmiss
hitJme—Jmetoaccessthecache
misspenalty❒ Jmetoreplaceblockinthecache+delivertoupperlevel❒ accessJme—Jmetogetfirstword❒ transferJme—Jmeforremainingwords
Lecture 13 Slide 21 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Cache Performance Assume
❒ CacheaccessJmeisequalto1cycle❒ CachemissraJois0.01❒ Cachemisspenaltyis20cycles
MeanaccessJme
=CacheaccessJme+missraJo*misspenalty
=1+0.01*20=1.2
Typically❒ level-1is16K-64K,level-2is512K-4M,memoryis128M-4G❒ level-1asfastastheprocessor(increasingly2-cycles)❒ level-1is1/10000capacitybutcontains98%ofreferences
MemoizaSon&amorSzaSon
Lecture 13 Slide 22 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Fundamental Cache Parameters that affects miss rate
Cachesize (C)
Blocksize (b)
CacheassociaJvity (a)
Lecture 13 Slide 23 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Cache Size Cachesizeisthetotaldata(notincludingtag)capacity
❒ biggercanexploittemporallocalitybeter❒ notALWAYSbeter
Toolargeacache❒ smallerisfaster=>biggerisslower❒ accessJmemaydegradecriJcalpath
Toosmallacache❒ don’texploittemporallocalitywell❒ usefuldataconstantlyreplaced
hit rate
C
“working set” size
holding b and a constant
Lecture 13 Slide 24 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Block Size Blocksizeisthedatathatis
❒ associatedwithanaddresstag❒ notnecessarilytheunitoftransferbetweenhierarchies(sub-blocking)
Toosmallblocks❒ don’texploitspaJallocalitywell❒ haveinordinatetagoverhead
Toolargeblocks❒ uselessdatatransferred❒ usefuldatapermanentlyreplaced—toofewtotal#blocks
b holding C and a constant
Lecture 13 Slide 25 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Associativity
Fully-associaJveblockgoesinanyframe
(thinkallframesin1set)
Direct-mappedblockgoesinexactly
oneframe
(think1frameperset)
Set-associaJveablockgoesinany
frameinexactlyoneset
(framesgroupedintosets)
Wheredoesblock12(b’1100)go?
0123
01234567
01010101
01234567
BlockSet/BlockSet
Lecture 13 Slide 26 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Impact of Associativity TypicalvaluesforassociaJvity
❒ 1,2-,4-,8-wayassociaJve
LargerassociaJvity❒ lowermissrate,lessvariaJonamongprograms
❒ onlyimportantforsmall“C/b”
SmallerassociaJvity❒ lowercost,fasterhitJme
hit rate
a
~5
holding C and b constant
Lecture 13 Slide 27 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Direct Mapped Caches
tag idx b.o.
= Tag
match
(hit?)
Multiplexor de
code
r
= Tag
Match
(hit?)
deco
der
tag index
block index
Don’t forget to check the valid/state bits
Lecture 13 Slide 28 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
tag blk.offset
Fully Associative Cache
= = =
= Multiplexor
Associative Search
Tag
Lecture 13 Slide 29 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
N-Way Set Associative Cache
tag idx b.o.
= Tag match
deco
der
= Tag match
Multiplexor
deco
der
a set a way (bank)
Cache Size = N x 2B+b
Lecture 13 Slide 30 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Associative Block Replacement Whichblockinasettoreplaceonamiss?Ideally—Belady’salgorithm,replacetheblockthat“will”beaccessedthefurthestinthefuture
❒ Howdoyouimplementit?
ApproximaJons:Leastrecentlyused—LRU
❒ opJmized(assume)fortemporallocality (expensiveformorethan2-way)
Notmostrecentlyused—NMRU❒ trackMRU,randomselectfromothers,goodcompromise
Random❒ nearlyasgoodasLRU,simpler(usuallypseudo-random)
HowmuchcanblockreplacementpolicymaUer?
Lecture 13 Slide 31 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Example: a=2, C=1kB, b=4B, word-size=2B Basic Solution
data 0
128-lines x
4-bytes
data 1
128-lines x
4-bytes
tag0
128-l x
23-b
v0 “ x
1-b
tag1
128-l x
23-b
v1 “ x
1-b
tag PA[31:9]
PA[0]
b.o. PA[1]
idx PA[8:2]
7
idx 7
idx 7
idx 7
idx
= tag
23
hit0
=
hit1
2-1-mux 2-1-mux b.o.
2-1-muxd hit0 hit1
HIT DATA
hit0
hi
t1
16
Lecture 13 Slide 32 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Write Policies WritesaremoreinteresJng
❒ onreads,datacanbeaccessedinparallelwithtagcompare❒ onwrites,needstwosteps❒ isturn-aroundJmeimportantforwrites? cacheopSmizaSonoFendeferwritesforreads
ChoicesofWritePolicies❒ Onwritehits,updatememory?
❍ Yes:write-through+nocoherenceissue,+immediateobservability,-morebandwidth
❍ No:write-back❒ Onwritemisses,allocateacacheblockframe?
❍ Yes:write-allocate❍ No:no-write-allocate
Lecture 13 Slide 33 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Write Policies (Cont.) Write-through
❒ updatememoryoneachwrite❒ keepsmemoryup-to-date❒ traffic/reference=fwrites,e.g.0.20 independentofcacheperformance(missrate)
Write-back❒ updatememoryonlyonblockreplacement❒ manycachelinesareonlyreadandneverwritento❒ add“dirty”bittostatusword
❍ originallyclearedawerreplacement❍ setwhenablockframeiswritento❍ onlywritebackadirtyblock,and“drop”cleanblocksw/omemoryupdate
❒ traffic/reference=fdirtyxmissxB❍ e.g.,traffic/reference=1/2x0.05x4=0.1
Lecture 13 Slide 34 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Store Buffers
BufferCPUwrites❒ allowsreadstoproceed❒ stallonlywhenfull❒ datadependence?
❍ Whathappensondependentloads/stores?
CPU $
Lecture 13 Slide 35 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Writeback Buffers
Betweenwrite-backcacheandnextlevel1.Movereplaced,dirtyblockstobuffer2.Readnewline3.Movereplaceddatatomemory
Usuallyonlyneed1or2write-backbufferentries
$ $$/Memory
Lecture 13 Slide 36 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
“Harvard” vs. “Princeton” Unified(someSmesknownasPrinceton)
❒ lesscostly,dynamicresponse,handleswritestoinstrucJons
SplitIandD(someSmesknownasHarvard)❒ mostoftheJmecodeanddatadon’tmix❒ 2xbandwidth,placeclosetoI/Dports❒ cancustomizesize(I-footprintgenerallysmallerthand-footprint),nointerferencebetweenI/D
❒ self-modifyingcodecancause“coherence”problems
CachesshouldbesplitforfrequentsimultaneousI&Daccess❒ nolongeraquesJonin“high-performance”on-chipL-1caches