eecs 470 lecture 13 basic caches · lecture 13 eecs 470 slide 1 © wenisch 2016 -- portions ©...
TRANSCRIPT
Lecture 13 Slide 1 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
EECS470Lecture13BasicCaches
Winter2019
Prof.RonaldDreslinski
h8p://www.eecs.umich.edu/courses/eecs470
Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin.
Lecture 13 Slide 2 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Readings ForToday:
❒ H&P2.1
ForWednesday:❒ H&P2.2,2.3,B.3❒ N.Jouppi.Improvingdirect-mappedcacheperformance…
Lecture 13 Slide 3 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Memory Systems: Basic Caches
Lecture 13 Slide 4 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Memory Systems
Basiccaches❒ introducAon❒ fundamentalquesAons❒ cachesize,blocksize,associaAvity
Advancedcaches
Mainmemory
Virtualmemory
Start today
Lecture 13 Slide 5 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Motivation
Wantmemorytoappear:❒ asfastasCPU❒ aslargeasrequiredbyalloftherunningapplicaAons
1
10
100
1000
10000
1985 1990 1995 2000 2005 2010
Perf
orm
ance
Processor
Memory
Lecture 13 Slide 6 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
LargerFaster
Memory Hierarchy Makecommoncasefast:
❒ common:temporal&spaAallocality❒ fast:smallermoreexpensivememory
Registers
Caches
Memory
Disk (MEMS?)
Lecture 13 Slide 7 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Storage Hierarchies Storagesarelayeredbyhierarchiesinorderof
❒ increasinglatency(ti) ti<ti+1❒ increasingsize(si)
⇒decreaseunitcost(ci) si<si+1,ci>ci+1❒ decreasingbandwidth(bi) bi>bi+1❒ increasingxferunit(xi) xi<xi+1
Level0Registers
Level1(nlevelsof)Caches
Level2MainMemory(PrimaryStorage)
Level3Disks(SecondaryStorage)
Level4TapeBackup(TerAaryStorage)
ISA feature Memory Abstractions
Level 2.5: Flash?
Level 1.5: NVRAM?
Lecture 13 Slide 8 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Processor/Memory Boundaries
I-Unit E-Unit
L1 I-Cache L1 D-Cache
L2 Cache (SRAM on-chip)
D-TLB I-TLB
Regs
Main Memory (DRAM)
Processor
L3 Cache (SRAM off-chip)
Lecture 13 Slide 9 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Caches AnautomaAcallymanagedhierarchy
“Ahidingplace,esp.ofgoods,treasure,etc.”--OED
Keeprecentlyaccessedblock❒ temporallocality
Breakmemoryintoblocks(severalbytes)andtransferdatato/fromcacheinblocks
❒ spaAallocality
AlotofarchitecturesoptforsoFwaremanagedscratch-padmemoryinsteade.g.Cray-1,embeddedprocessors,Why??
CPU
$
Memory
Lecture 13 Slide 10 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Cache (Abstractly) Keeprecentlyaccessedblockin“blockframe”
❒ state(e.g.,valid)❒ addresstag❒ data
address state
bookkeepingoverhead
data
mulAplebytesperblockframetoamorAzeoverhead
Lecture 13 Slide 11 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Cache (Abstractly) Onmemoryread
ifincomingaddresscorrespondstooneofthestoredaddresstagthen❍ HIT❍ returndata
else❍ MISS❍ choose&displaceacurrentblockinuse❍ fetchnew(referenced)blockfrommemoryintoframe❍ returndata
- Whereandhowtolookforablock?(Blockplacement)- Whichblockisreplacedonamiss?(Blockreplacement)- Whathappensonawrite?Writestrategy(Later)- Whatiskept?(Bookkeeping,data)
Lecture 13 Slide 12 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Terminology block(cacheline)—minimumunitthatmaybepresent
hit—blockisfoundinthecache
miss—blockisnotfoundinthecache
missraAo—fracAonofreferencesthatmiss
hitAme—Ametoaccessthecache
misspenalty❒ Ametoreplaceblockinthecache+delivertoupperlevel❒ accessAme—Ametogetfirstword❒ transferAme—Ameforremainingwords
Lecture 13 Slide 13 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Cache Performance Assume
❒ CacheaccessAmeisequalto1cycle❒ CachemissraAois0.01❒ Cachemisspenaltyis20cycles
MeanaccessAme
=CacheaccessAme+missraAo*misspenalty
=1+0.01*20=1.2
Typically❒ level-1is16K-64K,level-2is512K-4M,memoryis128M-4G❒ level-1asfastastheprocessor(increasingly2-cycles)❒ level-1is1/10000capacitybutcontains98%ofreferences
MemoizaSon&amorSzaSon
Lecture 13 Slide 14 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Fundamental Cache Parameters that affects miss rate
Cachesize (C)
Blocksize (b)
CacheassociaAvity (a)
Lecture 13 Slide 15 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Cache Size Cachesizeisthetotaldata(notincludingtag)capacity
❒ biggercanexploittemporallocalitybener❒ notALWAYSbener
Toolargeacache❒ smallerisfaster=>biggerisslower❒ accessAmemaydegradecriAcalpath
Toosmallacache❒ don’texploittemporallocalitywell❒ usefuldataconstantlyreplaced
hit rate
C
“working set” size
holding b and a constant
Lecture 13 Slide 16 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Block Size Blocksizeisthedatathatis
❒ associatedwithanaddresstag❒ notnecessarilytheunitoftransferbetweenhierarchies(sub-blocking)
Toosmallblocks❒ don’texploitspaAallocalitywell❒ haveinordinatetagoverhead
Toolargeblocks❒ uselessdatatransferred❒ usefuldatapermanentlyreplaced—toofewtotal#blocks
b holding C and a constant
Lecture 13 Slide 17 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Associativity
Fully-associaAveblockgoesinanyframe
(thinkallframesin1set)
Direct-mappedblockgoesinexactly
oneframe
(think1frameperset)
Set-associaAveablockgoesinany
frameinexactlyoneset
(framesgroupedintosets)
Wheredoesblock12(b’1100)go?
0123
01234567
01010101
01234567
BlockSet/BlockSet
Lecture 13 Slide 18 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Impact of Associativity TypicalvaluesforassociaAvity
❒ 1,2-,4-,8-wayassociaAve
LargerassociaAvity❒ lowermissrate,lessvariaAonamongprograms
❒ onlyimportantforsmall“C/b”
SmallerassociaAvity❒ lowercost,fasterhitAme
hit rate
a
~5
holding C and b constant
Lecture 13 Slide 19 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Direct Mapped Caches
tag idx b.o.
= Tag
match
(hit?)
Multiplexor de
code
r
= Tag
Match
(hit?)
deco
der
tag index
block index
Don’t forget to check the valid/state bits
Lecture 13 Slide 20 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
tag blk.offset
Fully Associative Cache
= = =
= Multiplexor
Associative Search
Tag
Lecture 13 Slide 21 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
N-Way Set Associative Cache
tag idx b.o.
= Tag match
deco
der
= Tag match
Multiplexor
deco
der
a set a way (bank)
Cache Size = N x 2B+b
Lecture 13 Slide 22 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Mark Hill’s DM vs. SA: “Bigger & Dumber is Better”
tavg=thit+missraAoxtmiss❒ comparableDMandSAcacheswithsametmiss❒ but,associaAvitythatminimizestavgisosensmallerthanassociaAvitythatminimizesmissraAo
remember:
diff(tcache)=tcache(SA)-tcache(DM)≥0 (SAneedsslowerclock)
diff(miss)=miss(SA)-miss(DM)≤0 (DMmissesmore)
e.g.,Ifdiff(tcache)=0=>SAbener,butassumingdiff(miss)=-1%,tmiss=20 ⇒ifdiff(tcache)>0.2cyclethenSAloses
Lecture 13 Slide 23 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Associative Block Replacement Whichblockinasettoreplaceonamiss?Ideally—Belady’salgorithm,replacetheblockthat“will”beaccessedthefurthestinthefuture
❒ Howdoyouimplementit?
ApproximaAons:Leastrecentlyused—LRU
❒ opAmized(assume)fortemporallocality (expensiveformorethan2-way)
Notmostrecentlyused—NMRU❒ trackMRU,randomselectfromothers,goodcompromise
Random❒ nearlyasgoodasLRU,simpler(usuallypseudo-random)
HowmuchcanblockreplacementpolicymaVer?
Lecture 13 Slide 24 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Example: a=2, C=1kB, b=4B, word-size=2B Basic Solution
data 0
128-lines x
4-bytes
data 1
128-lines x
4-bytes
tag0
128-l x
23-b
v0 “ x
1-b
tag1
128-l x
23-b
v1 “ x
1-b
tag PA[31:9]
PA[0]
b.o. PA[1]
idx PA[8:2]
7
idx 7
idx 7
idx 7
idx
= tag
23
hit0
=
hit1
2-1-mux 2-1-mux b.o.
2-1-muxd hit0 hit1
HIT DATA
hit0
hi
t1
16
Lecture 13 Slide 25 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Write Policies WritesaremoreinteresAng
❒ onreads,datacanbeaccessedinparallelwithtagcompare❒ onwrites,needstwosteps❒ isturn-aroundAmeimportantforwrites? cacheopSmizaSonoFendeferwritesforreads
ChoicesofWritePolicies❒ Onwritehits,updatememory?
❍ Yes:write-through+nocoherenceissue,+immediateobservability,-morebandwidth
❍ No:write-back❒ Onwritemisses,allocateacacheblockframe?
❍ Yes:write-allocate❍ No:no-write-allocate
Lecture 13 Slide 26 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Write Policies (Cont.) Write-through
❒ updatememoryoneachwrite❒ keepsmemoryup-to-date❒ traffic/reference=fwrites,e.g.0.20 independentofcacheperformance(missrate)
Write-back❒ updatememoryonlyonblockreplacement❒ manycachelinesareonlyreadandneverwrinento❒ add“dirty”bittostatusword
❍ originallyclearedaserreplacement❍ setwhenablockframeiswrinento❍ onlywritebackadirtyblock,and“drop”cleanblocksw/omemoryupdate
❒ traffic/reference=fdirtyxmissxB❍ e.g.,traffic/reference=1/2x0.05x4=0.1
Lecture 13 Slide 27 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Store Buffers
BufferCPUwrites❒ allowsreadstoproceed❒ stallonlywhenfull❒ datadependence?
❍ Whathappensondependentloads/stores?
CPU $
Lecture 13 Slide 28 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Writeback Buffers
Betweenwrite-backcacheandnextlevel1.Movereplaced,dirtyblockstobuffer2.Readnewline3.Movereplaceddatatomemory
Usuallyonlyneed1or2write-backbufferentries
$ $$/Memory
Lecture 13 Slide 29 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
“Harvard” vs. “Princeton” Unified(someSmesknownasPrinceton)
❒ lesscostly,dynamicresponse,handleswritestoinstrucAons
SplitIandD(someSmesknownasHarvard)❒ mostoftheAmecodeanddatadon’tmix❒ 2xbandwidth,placeclosetoI/Dports❒ cancustomizesize(I-footprintgenerallysmallerthand-footprint),nointerferencebetweenI/D
❒ self-modifyingcodecancause“coherence”problems
CachesshouldbesplitforfrequentsimultaneousI&Daccess❒ nolongeraquesAonin“high-performance”on-chipL-1caches