eecs 470 lecture 15

36
Lecture 13 Slide 1 EECS 470 © Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar EECS 470 Lecture 15 Basic Caches Winter 2022 Prof. Ronald Dreslinski h6p://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin.

Upload: others

Post on 26-May-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EECS 470 Lecture 15

Lecture 13 Slide 1 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

EECS470Lecture15BasicCaches

Winter2022

Prof.RonaldDreslinski

h6p://www.eecs.umich.edu/courses/eecs470

Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin.

Page 2: EECS 470 Lecture 15

Lecture 13 Slide 2 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Readings ForToday:

❒  H&P2.1

ForThursday:❒  H&P2.2,2.3,B.3❒  N.Jouppi.Improvingdirect-mappedcacheperformance…

Page 3: EECS 470 Lecture 15

Lecture 13 Slide 3 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Announcements MidtermGradesreleased.

Ifyouaremorethan2Std.Dev.fromthemean,pleaseemailmetosetupaJmetochat.

LookforHW4tobereleasedtomorrowsomeJme

Page 4: EECS 470 Lecture 15

Lecture 13 Slide 4 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Staff Midterm Outcome Lot’sofsmallsuggesJons,hereisalistofacJonableoneswewilltrytoaddress:

1)  Fixthewebsite/calendar2)  MoreGSI’s(lessacJonablethissemester)3)  Grade’sbacksooner4)  Officehoursqueueslong

Page 5: EECS 470 Lecture 15

Lecture 12 Slide 5 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Wide Fetch - Non-sequential TworelatedquesJons

q  Howmanybranchespredictedpercycle?q  CanwefetchfrommulJpletakenbranchespercycle?

Simplest,mostcommonorganizaJon:“1”and“No”q  OnepredicJon,discardpost-branchinsnsifpredicJonis“Taken”–  LowerseffecJvefetchwidthandIPCq  AveragenumberofinstrucJonspertakenbranch?

q  Assume:20%branches,50%taken→~10instrucJonsq  Considera10-instrucJonloopbodywithan8-issueprocessor

q  Withoutsmarterfetch,ILPislimitedto5(not8)

Compilercanhelpq  Unrollloops,reducetakenbranchfrequency

Page 6: EECS 470 Lecture 15

Lecture 12 Slide 6 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Multiple Branch Predictions IssueswithmulJplebranchpredicJons:

q  LatencyresulJngfromsequenJalpredicJonsq  LaterpredicJonsbasedonstale/speculaJvehistoryq  Don’tforget,0.95x0.95x0.95=0.85

BTB

BTB

BTB

Fetch address

Block 1 Block 2 Block 3

Page 7: EECS 470 Lecture 15

Lecture 12 Slide 7 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Examples of Multi-Branch Predictors

bn b0 BHR

PHT

p0 p1 p2

How do you update this thing after a branch resolves?

Page 8: EECS 470 Lecture 15

Lecture 12 Slide 8 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Examples of Multi-Branch Predictors

bn b0 BHR

bn:2 bn-1:1

bn-2:0

b1 b0

p0

b0 p0

p0 p1

p1 p2

PHT

2n-2 x 4 entries

Page 9: EECS 470 Lecture 15

Lecture 12 Slide 9 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Multiple Predicted Taken Branches

IssueswithmulJpletakenbranches:q  LonglatencywithmulJplesequenJalI-cacheaccessesq  or,mulJ-portedI-cachewithsloweraccesslatencyq  or,mulJ-bankedI-cachetoapproximatemulJ-port

Block 2 FA

Block 1 FA

Block 3 FA

Block 1 instructions

Block 2 instructions

Block 3 instructions

Multi-ported I-cache

Page 10: EECS 470 Lecture 15

Lecture 12 Slide 10 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Instruction Alignment and Collapsing

Issueswithalignmentandcollapsing:q  Misalignmentbetweenfetchgroupandcacheline.q  Packingofvariable-sizedblocksintofetchbuffer.

I-cache Port 1

I-cache Port 2

I-cache Port 3

Fetch buffer

Page 11: EECS 470 Lecture 15

Lecture 13 Slide 11 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Memory Systems: Basic Caches

Page 12: EECS 470 Lecture 15

Lecture 13 Slide 12 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Memory Systems

Basiccaches❒  introducJon❒  fundamentalquesJons❒  cachesize,blocksize,associaJvity

Advancedcaches

Mainmemory

Virtualmemory

Start today

Page 13: EECS 470 Lecture 15

Lecture 13 Slide 13 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Motivation

Wantmemorytoappear:❒  asfastasCPU❒  aslargeasrequiredbyalloftherunningapplicaJons

1

10

100

1000

10000

1985 1990 1995 2000 2005 2010

Perf

orm

ance

Processor

Memory

Page 14: EECS 470 Lecture 15

Lecture 13 Slide 14 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

LargerFaster

Memory Hierarchy Makecommoncasefast:

❒  common:temporal&spaJallocality❒  fast:smallermoreexpensivememory

Registers

Caches

Memory

Disk (MEMS?)

Page 15: EECS 470 Lecture 15

Lecture 13 Slide 15 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Storage Hierarchies Storagesarelayeredbyhierarchiesinorderof

❒  increasinglatency(ti) ti<ti+1❒  increasingsize(si)

⇒decreaseunitcost(ci) si<si+1,ci>ci+1❒  decreasingbandwidth(bi) bi>bi+1❒  increasingxferunit(xi) xi<xi+1

Level0Registers

Level1(nlevelsof)Caches

Level2MainMemory(PrimaryStorage)

Level3Disks(SecondaryStorage)

Level4TapeBackup(TerJaryStorage)

ISA feature Memory Abstractions

Level 2.5: Flash?

Level 1.5: NVRAM?

Page 16: EECS 470 Lecture 15

Lecture 13 Slide 16 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Processor/Memory Boundaries

I-Unit E-Unit

L1 I-Cache L1 D-Cache

L2 Cache (SRAM on-chip)

D-TLB I-TLB

Regs

Main Memory (DRAM)

Processor

L3 Cache (SRAM off-chip)

Page 17: EECS 470 Lecture 15

Lecture 13 Slide 17 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Caches AnautomaJcallymanagedhierarchy

“Ahidingplace,esp.ofgoods,treasure,etc.”--OED

Keeprecentlyaccessedblock❒  temporallocality

Breakmemoryintoblocks(severalbytes)andtransferdatato/fromcacheinblocks

❒  spaJallocality

AlotofarchitecturesoptforsoFwaremanagedscratch-padmemoryinsteade.g.Cray-1,embeddedprocessors,Why??

CPU

$

Memory

Page 18: EECS 470 Lecture 15

Lecture 13 Slide 18 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Cache (Abstractly) Keeprecentlyaccessedblockin“blockframe”

❒  state(e.g.,valid)❒  addresstag❒  data

address state

bookkeepingoverhead

data

mulJplebytesperblockframetoamorJzeoverhead

Page 19: EECS 470 Lecture 15

Lecture 13 Slide 19 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Cache (Abstractly) Onmemoryread

ifincomingaddresscorrespondstooneofthestoredaddresstagthen❍  HIT❍  returndata

else❍  MISS❍  choose&displaceacurrentblockinuse❍  fetchnew(referenced)blockfrommemoryintoframe❍  returndata

- Whereandhowtolookforablock?(Blockplacement)- Whichblockisreplacedonamiss?(Blockreplacement)- Whathappensonawrite?Writestrategy(Later)- Whatiskept?(Bookkeeping,data)

Page 20: EECS 470 Lecture 15

Lecture 13 Slide 20 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Terminology block(cacheline)—minimumunitthatmaybepresent

hit—blockisfoundinthecache

miss—blockisnotfoundinthecache

missraJo—fracJonofreferencesthatmiss

hitJme—Jmetoaccessthecache

misspenalty❒  Jmetoreplaceblockinthecache+delivertoupperlevel❒  accessJme—Jmetogetfirstword❒  transferJme—Jmeforremainingwords

Page 21: EECS 470 Lecture 15

Lecture 13 Slide 21 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Cache Performance Assume

❒  CacheaccessJmeisequalto1cycle❒  CachemissraJois0.01❒  Cachemisspenaltyis20cycles

MeanaccessJme

=CacheaccessJme+missraJo*misspenalty

=1+0.01*20=1.2

Typically❒  level-1is16K-64K,level-2is512K-4M,memoryis128M-4G❒  level-1asfastastheprocessor(increasingly2-cycles)❒  level-1is1/10000capacitybutcontains98%ofreferences

MemoizaSon&amorSzaSon

Page 22: EECS 470 Lecture 15

Lecture 13 Slide 22 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Fundamental Cache Parameters that affects miss rate

Cachesize (C)

Blocksize (b)

CacheassociaJvity (a)

Page 23: EECS 470 Lecture 15

Lecture 13 Slide 23 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Cache Size Cachesizeisthetotaldata(notincludingtag)capacity

❒  biggercanexploittemporallocalitybeter❒  notALWAYSbeter

Toolargeacache❒  smallerisfaster=>biggerisslower❒  accessJmemaydegradecriJcalpath

Toosmallacache❒  don’texploittemporallocalitywell❒  usefuldataconstantlyreplaced

hit rate

C

“working set” size

holding b and a constant

Page 24: EECS 470 Lecture 15

Lecture 13 Slide 24 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Block Size Blocksizeisthedatathatis

❒  associatedwithanaddresstag❒  notnecessarilytheunitoftransferbetweenhierarchies(sub-blocking)

Toosmallblocks❒  don’texploitspaJallocalitywell❒  haveinordinatetagoverhead

Toolargeblocks❒  uselessdatatransferred❒  usefuldatapermanentlyreplaced—toofewtotal#blocks

b holding C and a constant

Page 25: EECS 470 Lecture 15

Lecture 13 Slide 25 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Associativity

Fully-associaJveblockgoesinanyframe

(thinkallframesin1set)

Direct-mappedblockgoesinexactly

oneframe

(think1frameperset)

Set-associaJveablockgoesinany

frameinexactlyoneset

(framesgroupedintosets)

Wheredoesblock12(b’1100)go?

0123

01234567

01010101

01234567

BlockSet/BlockSet

Page 26: EECS 470 Lecture 15

Lecture 13 Slide 26 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Impact of Associativity TypicalvaluesforassociaJvity

❒  1,2-,4-,8-wayassociaJve

LargerassociaJvity❒  lowermissrate,lessvariaJonamongprograms

❒  onlyimportantforsmall“C/b”

SmallerassociaJvity❒  lowercost,fasterhitJme

hit rate

a

~5

holding C and b constant

Page 27: EECS 470 Lecture 15

Lecture 13 Slide 27 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Direct Mapped Caches

tag idx b.o.

= Tag

match

(hit?)

Multiplexor de

code

r

= Tag

Match

(hit?)

deco

der

tag index

block index

Don’t forget to check the valid/state bits

Page 28: EECS 470 Lecture 15

Lecture 13 Slide 28 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

tag blk.offset

Fully Associative Cache

= = =

= Multiplexor

Associative Search

Tag

Page 29: EECS 470 Lecture 15

Lecture 13 Slide 29 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

N-Way Set Associative Cache

tag idx b.o.

= Tag match

deco

der

= Tag match

Multiplexor

deco

der

a set a way (bank)

Cache Size = N x 2B+b

Page 30: EECS 470 Lecture 15

Lecture 13 Slide 30 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Associative Block Replacement Whichblockinasettoreplaceonamiss?Ideally—Belady’salgorithm,replacetheblockthat“will”beaccessedthefurthestinthefuture

❒  Howdoyouimplementit?

ApproximaJons:Leastrecentlyused—LRU

❒  opJmized(assume)fortemporallocality (expensiveformorethan2-way)

Notmostrecentlyused—NMRU❒  trackMRU,randomselectfromothers,goodcompromise

Random❒  nearlyasgoodasLRU,simpler(usuallypseudo-random)

HowmuchcanblockreplacementpolicymaUer?

Page 31: EECS 470 Lecture 15

Lecture 13 Slide 31 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Example: a=2, C=1kB, b=4B, word-size=2B Basic Solution

data 0

128-lines x

4-bytes

data 1

128-lines x

4-bytes

tag0

128-l x

23-b

v0 “ x

1-b

tag1

128-l x

23-b

v1 “ x

1-b

tag PA[31:9]

PA[0]

b.o. PA[1]

idx PA[8:2]

7

idx 7

idx 7

idx 7

idx

= tag

23

hit0

=

hit1

2-1-mux 2-1-mux b.o.

2-1-muxd hit0 hit1

HIT DATA

hit0

hi

t1

16

Page 32: EECS 470 Lecture 15

Lecture 13 Slide 32 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Write Policies WritesaremoreinteresJng

❒  onreads,datacanbeaccessedinparallelwithtagcompare❒  onwrites,needstwosteps❒  isturn-aroundJmeimportantforwrites? cacheopSmizaSonoFendeferwritesforreads

ChoicesofWritePolicies❒  Onwritehits,updatememory?

❍  Yes:write-through+nocoherenceissue,+immediateobservability,-morebandwidth

❍  No:write-back❒  Onwritemisses,allocateacacheblockframe?

❍  Yes:write-allocate❍  No:no-write-allocate

Page 33: EECS 470 Lecture 15

Lecture 13 Slide 33 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Write Policies (Cont.) Write-through

❒  updatememoryoneachwrite❒  keepsmemoryup-to-date❒  traffic/reference=fwrites,e.g.0.20 independentofcacheperformance(missrate)

Write-back❒  updatememoryonlyonblockreplacement❒  manycachelinesareonlyreadandneverwritento❒  add“dirty”bittostatusword

❍  originallyclearedawerreplacement❍  setwhenablockframeiswritento❍  onlywritebackadirtyblock,and“drop”cleanblocksw/omemoryupdate

❒  traffic/reference=fdirtyxmissxB❍  e.g.,traffic/reference=1/2x0.05x4=0.1

Page 34: EECS 470 Lecture 15

Lecture 13 Slide 34 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Store Buffers

BufferCPUwrites❒  allowsreadstoproceed❒  stallonlywhenfull❒  datadependence?

❍  Whathappensondependentloads/stores?

CPU $

Page 35: EECS 470 Lecture 15

Lecture 13 Slide 35 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

Writeback Buffers

Betweenwrite-backcacheandnextlevel1.Movereplaced,dirtyblockstobuffer2.Readnewline3.Movereplaceddatatomemory

Usuallyonlyneed1or2write-backbufferentries

$ $$/Memory

Page 36: EECS 470 Lecture 15

Lecture 13 Slide 36 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

“Harvard” vs. “Princeton” Unified(someSmesknownasPrinceton)

❒  lesscostly,dynamicresponse,handleswritestoinstrucJons

SplitIandD(someSmesknownasHarvard)❒  mostoftheJmecodeanddatadon’tmix❒  2xbandwidth,placeclosetoI/Dports❒  cancustomizesize(I-footprintgenerallysmallerthand-footprint),nointerferencebetweenI/D

❒  self-modifyingcodecancause“coherence”problems

CachesshouldbesplitforfrequentsimultaneousI&Daccess❒  nolongeraquesJonin“high-performance”on-chipL-1caches