cs252 spring 2017 graduate computer architecture lecture...
TRANSCRIPT
WU UCB CS252 SP17
CS252 Spring 2017Graduate Computer Architecture
Lecture 13:Cache Coherence Part 2
Multithreading Part 1Lisa Wu, Krste Asanovic
http://inst.eecs.berkeley.edu/~cs252/sp17
WU UCB CS252 SP17
Last Time in Lecture 12• Reviewed store policies and cache read/write
policies• Write through vs. write back• Write allocate vs. write no allocate
• Shared memory multiprocessor cache coherence
• Snoopy protocols: MSI, MESI• Intervention• False Sharing
2
WU UCB CS252 SP17
Review: Cache Coherence vs. Memory ConsistencyFor a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. As part of supporting a memory consistency model, many machines also provide cache coherence protocols that ensure that multiple cached copies of data are kept up-to-date.
~”A Primer on Memory Consistency and Cache Coherence”, D. J. Sorin, M. D. Hill, and D. A. Wood
3
WU UCB CS252 SP17
Cache Coherence: Directory Protocol
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
ScalableApproach:Directories
§ Everymemorylinehasassociateddirectoryinformation- keepstrackofcopiesofcachedlinesandtheirstates- onamiss,finddirectoryentry,lookitup,andcommunicateonlywiththenodesthathavecopiesifnecessary
- inscalablenetworks,communicationwithdirectoryandcopiesisthroughnetworktransactions
§ Manyalternativesfororganizingdirectoryinformation
5
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
DirectoryCacheProtocol
6
§ Assumptions:Reliablenetwork,FIFOmessagedeliverybetweenanygivensource-destinationpair
CPU
Cache
Interconnection Network
Directory Controller
DRAM Bank
Directory Controller
DRAM Bank
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
Directory Controller
DRAM Bank
Directory Controller
DRAM Bank
DataTagStat.
Each line in cache has state field plus tag
DataStat. Directry
Each line in memory has state field plus bit vector directory with one bit per processor
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
CacheStates
§ Foreachcacheline,thereare4possiblestates:- C-invalid(=Nothing):Theaccesseddataisnotresidentinthecache.
- C-shared(=Sh):Theaccesseddataisresidentinthecache,andpossiblyalsocachedatothersites.Thedatainmemoryisvalid.
- C-modified(=Ex):Theaccesseddataisexclusivelyresidentinthiscache,andhasbeenmodified.Memorydoesnothavethemostup-to-datedata.
- C-transient(=Pending):Theaccesseddataisinatransientstate(forexample,thesitehasjustissuedaprotocolrequest,buthasnotreceivedthecorrespondingprotocolreply).
7
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
Homedirectorystates
§ Foreachmemoryline,thereare4possiblestates:- R(dir):Thememorylineissharedbythesitesspecifiedindir(dirisasetofsites).Thedatainmemoryisvalidinthisstate.Ifdirisempty(i.e.,dir=ε),thememorylineisnotcachedbyanysite.
-W(id):Thememorylineisexclusivelycachedatsiteid,andhasbeenmodifiedatthatsite.Memorydoesnothavethemostup-to-datedata.
- TR(dir):Thememorylineisinatransientstatewaitingfortheacknowledgementstotheinvalidationrequeststhatthehomesitehasissued.
- TW(id):Thememorylineisinatransientstatewaitingforalineexclusivelycachedatsiteid(i.e.,inC-modifiedstate)tomakethememorylineatthehomesiteup-to-date.
8
WU UCB CS252 SP17
DAP.F96 37
Directory Protocol MessagesMessage type Source Destination MsgRead miss Local cache Home directory P, A
– Processor P reads data at address A; send data and make P a read sharer
Write miss Local cache Home directory P, A– Processor P writes data at address A;
send data and make P the exclusive ownerInvalidate Home directory Remote caches A
– Invalidate a shared copy at address A.Fetch Home directory Remote cache A
– Fetch the block at address A and send it to its home directoryFetch/Invalidate Home directory Remote cache A
– Fetch the block at address A and send it to its home directory; invalidate the block in the cache
Data value reply Home directory Local cache Data– Return a data value from the home memory
Data write-back Remote cache Home directory A, Data– Write-back a data value for address A
9
Dave Patterson, CS252, Fall 1996
WU UCB CS252 SP17
DAP.F96 40
Example Directory Protocol• Message sent to directory causes two actions:
– Update the directory– More messages to satisfy request
• Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are:
– Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared.
– Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner.
• Block is Shared => the memory value is up-to-date:– Read miss: requesting processor is sent back the data from
memory & requesting processor is added to the sharing set.– Write miss: requesting processor is sent the value. All processors
in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.
10
Dave Patterson, CS252, Fall 1996
WU UCB CS252 SP17
DAP.F96 41
Example Directory Protocol• Block is Exclusive: current value of the block is held in
the cache of the processor identified by the set Sharers (the owner) => three possible directory requests:
– Read miss: owner processor sent data fetch message, which causes state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy).
– Data write-back: owner processor is replacing the block and hence must write it back. This makes the memory copy up-to-date (the home directory essentially becomes the owner), the block is now uncached, and the Sharer set is empty.
– Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.
11
Dave Patterson, CS252, Fall 1996
WU UCB CS252 SP17
DAP.F96 43
Example
P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value
P1: Write 10 to A1
P1: Read A1P2: Read A1
P2: Write 40 to A2
P2: Write 20 to A1
A1 and A2 map to the same cache block
12
Dave Patterson, CS252, Fall 1996
WU UCB CS252 SP17
DAP.F96 44
Example
P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0
P1: Read A1P2: Read A1
P2: Write 40 to A2
P2: Write 20 to A1
A1 and A2 map to the same cache block
13
Dave Patterson, CS252, Fall 1996
WU UCB CS252 SP17
DAP.F96 45
Example
P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10P2: Read A1
P2: Write 40 to A2
P2: Write 20 to A1
A1 and A2 map to the same cache block
14
Dave Patterson, CS252, Fall 1996
WU UCB CS252 SP17
DAP.F96 46
Example
P2: Write 20 to A1
A1 and A2 map to the same cache block
P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10
1010
P2: Write 40 to A2 10
15
Dave Patterson, CS252, Fall 1996
WU UCB CS252 SP17
DAP.F96 47
Example
P2: Write 20 to A1
A1 and A2 map to the same cache block
P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 10
16
Dave Patterson, CS252, Fall 1996
WU UCB CS252 SP17
DAP.F96 48
Example
P2: Write 20 to A1
A1 and A2 map to the same cache block
P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0
WrBk P2 A1 20 A1 Unca. { } 20Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0
17
Dave Patterson, CS252, Fall 1996
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
Readmiss,touncached orsharedline
18
Directory Controller
DRAM Bank
CPU
Cache
1Load request at head of
CPU->Cache queue.
2Load misses in cache.
3Send ShReqmessage to directory.
4Message received at directory controller.
5Access state and directory for line. Line’s state is R, with zero or more
sharers.
6Update directory by setting bit for new processor sharer.
7 Send ShRep message with contents of cache line.
8 ShRep arrives at cache.
9
Update cache tag and data and return load data to CPU.
Interconnection Network
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
Writemiss,toreadsharedline
19
Directory Controller
DRAM Bank
CPU
Cache
1Store request at head of
CPU->Cache queue.
2Store misses in cache.
3Send ExReq message to directory.
4ExReq message received
at directory controller.
5Access state and directory for line. Line’s state is R, with some
set of sharers.
6 Send one InvReqmessage to each sharer.
11
ExRep arrives at cache
12
Update cache tag and data, then store data
from CPU
Interconnection Network
CPU
Cache
7
InvReq arrives at cache.8
Invalidate cache line.
Send InvRepto directory.
9InvRep received. Clear down sharer bit.
10 When no more sharers, send ExRep to cache.
Multiple sharers
CPU
Cache
CPU
Cache
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
ConcurrencyManagement
§ Protocolwouldbeeasytodesignifonlyonetransactioninflightacrossentiresystem
§ But,wantgreaterthroughputanddon’twanttohavetocoordinateacrossentiresystem
§ Greatcomplexityinmanagingmultipleoutstandingconcurrenttransactionstocachelines- Canhavemultiplerequestsinflighttosamecacheline!
20
WU UCB CS252 SP17
Multithreading:Intro to MT and SMT
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
Multithreading
§ Difficulttocontinuetoextractinstruction-levelparallelism(ILP)fromasinglesequentialthreadofcontrol
§ Manyworkloadscanmakeuseofthread-levelparallelism(TLP)- TLPfrommultiprogramming(runindependentsequentialjobs)
- TLPfrommultithreadedapplications(runonejobfasterusingparallelthreads)
§ MultithreadingusesTLPtoimproveutilizationofasingleprocessor
22
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
Multithreading
Howcanweguaranteenodependenciesbetweeninstructionsinapipeline?
Onewayistointerleaveexecutionofinstructionsfromdifferentprogramthreadsonsamepipeline
23
F D X MWt0 t1 t2 t3 t4 t5 t6 t7 t8
T1:LD x1,0(x2)T2:ADD x7,x1,x4T3:XORI x5,x4,12T4:SD 0(x7),x5T1:LD x5,12(x1)
t9
F D X MWF D X MW
F D X MWF D X MW
Interleave4threads,T1-T4,onnon-bypassed5-stagepipe
Priorinstructioninathreadalwayscompleteswrite-backbeforenextinstructioninsamethreadreadsregisterfile
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
CDC6600PeripheralProcessors(Cray,1964)
§ Firstmultithreadedhardware§ 10“virtual”I/Oprocessors§ Fixedinterleaveonsimplepipeline§ Pipelinehas100nscycletime§ Eachvirtualprocessorexecutesoneinstructionevery1000ns§ Accumulator-basedinstructionsettoreduceprocessorstate
24
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
SimpleMultithreadedPipeline
§ Havetocarrythreadselectdownpipelinetoensurecorrectstatebitsread/writtenateachpipestage
§ Appearstosoftware(includingOS)asmultiple,albeitslower,CPUs
25
+1
2 Thread select
PC1PC1PC1PC1
I$ IR GPR1GPR1GPR1GPR1
X
Y
2
D$
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
MultithreadingCosts
§ Eachthreadrequiresitsownuserstate- PC- GPRs
§ Also,needsitsownsystemstate- Virtual-memorypage-table-baseregister- Exception-handlingregisters
§ Otheroverheads:- Additionalcache/TLBconflictsfromcompetingthreads- (oraddlargercache/TLBcapacity)- MoreOSoverheadtoschedulemorethreads(wheredoallthesethreadscomefrom?)
26
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
ThreadSchedulingPolicies
27
§ Fixedinterleave(CDC6600PPUs,1964)- EachofNthreadsexecutesoneinstructioneveryNcycles- Ifthreadnotreadytogoinitsslot,insertpipelinebubble
§ Software-controlledinterleave(TIASCPPUs,1971)-OSallocatesSpipelineslotsamongstNthreads- HardwareperformsfixedinterleaveoverSslots,executingwhicheverthreadisinthatslot
§ Hardware-controlledthreadscheduling(HEP,1982)- Hardwarekeepstrackofwhichthreadsarereadytogo- Picksnextthreadtoexecutebasedonhardwarepriorityscheme
WU UCB CS252 SP17
Issue Slots:Vertical vs. Horizontal Waste
28
“Simultaneous Multithreading: Maximizing On-Chip Parallelism”, D. M. Tullsen, S. J. Eggers, and H. M. Levy, University of Washington, ISCA 1995
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
SimultaneousMultithreading(SMT)forOoO Superscalars
§ Techniquespresentedsofarhaveallbeen“vertical”multithreadingwhereeachpipelinestageworksononethreadatatime
§ SMTusesfine-graincontrolalreadypresentinsideanOoO superscalartoallowinstructionsfrommultiplethreadstoenterexecutiononsameclockcycle.Givesbetterutilizationofmachineresources.
29
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
Formostapps,mostexecutionunitslieidleinanOoOsuperscalar
30
From:Tullsen,Eggers,andLevy,“SimultaneousMultithreading:MaximizingOn-chipParallelism”,ISCA1995.
Foran8-waysuperscalar.
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
SuperscalarMachineEfficiency
31
Issuewidth
Time
Completelyidlecycle(verticalwaste)
Instructionissue
Partiallyfilledcycle,i.e.,IPC<4(horizontalwaste)
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
VerticalMultithreading
32
Cycle-by-cycleinterleavingremovesverticalwaste,butleavessomehorizontalwaste
Issuewidth
Time
Secondthreadinterleavedcycle-by-cycle
Instructionissue
Partiallyfilledcycle,i.e.,IPC<4(horizontalwaste)
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
ChipMultiprocessing(CMP)
33
§ Whatistheeffectofsplittingintomultipleprocessors?- reduceshorizontalwaste,- leavessomeverticalwaste,and- putsupperlimitonpeakthroughputofeachthread.
Issuewidth
Time
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
IdealSuperscalarMultithreading[Tullsen,Eggers,Levy,UW,1995]
34
§ Interleavemultiplethreadstomultipleissueslotswithnorestrictions
Issuewidth
Time
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
SMTadaptationtoparallelismtype
35
Forregionswithhighthread-levelparallelism(TLP)entiremachinewidthissharedbyallthreads
Issuewidth
Time
Issuewidth
Time
Forregionswithlowthread-levelparallelism(TLP)entiremachinewidthisavailableforinstruction-levelparallelism(ILP)
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
MultithreadedDesignDiscussion
36
§Wanttobuildamultithreadedprocessor,howshouldeachcomponentbechangedandwhatarethetradeoffs?§L1caches(instructionanddata)§L2caches§Branchpredictor§TLB§Physicalregisterfile
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
Summary:MultithreadedCategories
37
Time(processorcycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing
SimultaneousMultithreading
Thread1Thread2
Thread3Thread4
Thread5Idleslot
©KrsteAsanovic,2015CS252,Fall2015,Lecture13
Acknowledgements
§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Krste Asanovic (UCB)- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)
38