cs252 spring 2017 graduate computer architecture lecture...

WU UCB CS252 SP17

CS252 Spring 2017Graduate Computer Architecture

Lecture 13:Cache Coherence Part 2

Multithreading Part 1Lisa Wu, Krste Asanovic

http://inst.eecs.berkeley.edu/~cs252/sp17

WU UCB CS252 SP17

Last Time in Lecture 12• Reviewed store policies and cache read/write

policies• Write through vs. write back• Write allocate vs. write no allocate

• Shared memory multiprocessor cache coherence

• Snoopy protocols: MSI, MESI• Intervention• False Sharing

2

WU UCB CS252 SP17

Review: Cache Coherence vs. Memory ConsistencyFor a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. As part of supporting a memory consistency model, many machines also provide cache coherence protocols that ensure that multiple cached copies of data are kept up-to-date.

~”A Primer on Memory Consistency and Cache Coherence”, D. J. Sorin, M. D. Hill, and D. A. Wood

3

WU UCB CS252 SP17

Cache Coherence: Directory Protocol

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

ScalableApproach:Directories

§ Everymemorylinehasassociateddirectoryinformation- keepstrackofcopiesofcachedlinesandtheirstates- onamiss,finddirectoryentry,lookitup,andcommunicateonlywiththenodesthathavecopiesifnecessary

- inscalablenetworks,communicationwithdirectoryandcopiesisthroughnetworktransactions

§ Manyalternativesfororganizingdirectoryinformation

5


DirectoryCacheProtocol

6

§ Assumptions:Reliablenetwork,FIFOmessagedeliverybetweenanygivensource-destinationpair

CPU

Cache

Interconnection Network

Directory Controller

DRAM Bank


DRAM Bank

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache


DRAM Bank


DRAM Bank

DataTagStat.

Each line in cache has state field plus tag

DataStat. Directry

Each line in memory has state field plus bit vector directory with one bit per processor


CacheStates

§ Foreachcacheline,thereare4possiblestates:- C-invalid(=Nothing):Theaccesseddataisnotresidentinthecache.

- C-shared(=Sh):Theaccesseddataisresidentinthecache,andpossiblyalsocachedatothersites.Thedatainmemoryisvalid.

- C-modified(=Ex):Theaccesseddataisexclusivelyresidentinthiscache,andhasbeenmodified.Memorydoesnothavethemostup-to-datedata.

- C-transient(=Pending):Theaccesseddataisinatransientstate(forexample,thesitehasjustissuedaprotocolrequest,buthasnotreceivedthecorrespondingprotocolreply).

7


Homedirectorystates

§ Foreachmemoryline,thereare4possiblestates:- R(dir):Thememorylineissharedbythesitesspecifiedindir(dirisasetofsites).Thedatainmemoryisvalidinthisstate.Ifdirisempty(i.e.,dir=ε),thememorylineisnotcachedbyanysite.

-W(id):Thememorylineisexclusivelycachedatsiteid,andhasbeenmodifiedatthatsite.Memorydoesnothavethemostup-to-datedata.

- TR(dir):Thememorylineisinatransientstatewaitingfortheacknowledgementstotheinvalidationrequeststhatthehomesitehasissued.

- TW(id):Thememorylineisinatransientstatewaitingforalineexclusivelycachedatsiteid(i.e.,inC-modifiedstate)tomakethememorylineatthehomesiteup-to-date.

8

WU UCB CS252 SP17

DAP.F96 37

Directory Protocol MessagesMessage type Source Destination MsgRead miss Local cache Home directory P, A

– Processor P reads data at address A; send data and make P a read sharer

Write miss Local cache Home directory P, A– Processor P writes data at address A;

send data and make P the exclusive ownerInvalidate Home directory Remote caches A

– Invalidate a shared copy at address A.Fetch Home directory Remote cache A

– Fetch the block at address A and send it to its home directoryFetch/Invalidate Home directory Remote cache A

– Fetch the block at address A and send it to its home directory; invalidate the block in the cache

Data value reply Home directory Local cache Data– Return a data value from the home memory

Data write-back Remote cache Home directory A, Data– Write-back a data value for address A

9

Dave Patterson, CS252, Fall 1996

WU UCB CS252 SP17

DAP.F96 40

Example Directory Protocol• Message sent to directory causes two actions:

– Update the directory– More messages to satisfy request

• Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are:

– Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared.

– Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner.

• Block is Shared => the memory value is up-to-date:– Read miss: requesting processor is sent back the data from

memory & requesting processor is added to the sharing set.– Write miss: requesting processor is sent the value. All processors

in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.

10


WU UCB CS252 SP17

DAP.F96 41

Example Directory Protocol• Block is Exclusive: current value of the block is held in

the cache of the processor identified by the set Sharers (the owner) => three possible directory requests:

– Read miss: owner processor sent data fetch message, which causes state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy).

– Data write-back: owner processor is replacing the block and hence must write it back. This makes the memory copy up-to-date (the home directory essentially becomes the owner), the block is now uncached, and the Sharer set is empty.

– Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

11


WU UCB CS252 SP17

DAP.F96 43

Example

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1

P1: Read A1P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1

A1 and A2 map to the same cache block

12


WU UCB CS252 SP17

DAP.F96 44

Example


P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1


13


WU UCB CS252 SP17

DAP.F96 45

Example



P1: Read A1 Excl. A1 10P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1


14


WU UCB CS252 SP17

DAP.F96 46

Example

P2: Write 20 to A1




P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10

1010

P2: Write 40 to A2 10

15


WU UCB CS252 SP17

DAP.F96 47

Example

P2: Write 20 to A1





Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 10

16


WU UCB CS252 SP17

DAP.F96 48

Example

P2: Write 20 to A1





Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0

WrBk P2 A1 20 A1 Unca. { } 20Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0

17



Readmiss,touncached orsharedline

18


DRAM Bank

CPU

Cache

1Load request at head of

CPU->Cache queue.

2Load misses in cache.

3Send ShReqmessage to directory.

4Message received at directory controller.

5Access state and directory for line. Line’s state is R, with zero or more

sharers.

6Update directory by setting bit for new processor sharer.

7 Send ShRep message with contents of cache line.

8 ShRep arrives at cache.

9

Update cache tag and data and return load data to CPU.



Writemiss,toreadsharedline

19


DRAM Bank

CPU

Cache

1Store request at head of

CPU->Cache queue.

2Store misses in cache.

3Send ExReq message to directory.

4ExReq message received

at directory controller.

5Access state and directory for line. Line’s state is R, with some

set of sharers.

6 Send one InvReqmessage to each sharer.

11

ExRep arrives at cache

12

Update cache tag and data, then store data

from CPU


CPU

Cache

7

InvReq arrives at cache.8

Invalidate cache line.

Send InvRepto directory.

9InvRep received. Clear down sharer bit.

10 When no more sharers, send ExRep to cache.

Multiple sharers

CPU

Cache

CPU

Cache


ConcurrencyManagement

§ Protocolwouldbeeasytodesignifonlyonetransactioninflightacrossentiresystem

§ But,wantgreaterthroughputanddon’twanttohavetocoordinateacrossentiresystem

§ Greatcomplexityinmanagingmultipleoutstandingconcurrenttransactionstocachelines- Canhavemultiplerequestsinflighttosamecacheline!

20

WU UCB CS252 SP17

Multithreading:Intro to MT and SMT


Multithreading

§ Difficulttocontinuetoextractinstruction-levelparallelism(ILP)fromasinglesequentialthreadofcontrol

§ Manyworkloadscanmakeuseofthread-levelparallelism(TLP)- TLPfrommultiprogramming(runindependentsequentialjobs)

- TLPfrommultithreadedapplications(runonejobfasterusingparallelthreads)

§ MultithreadingusesTLPtoimproveutilizationofasingleprocessor

22


Multithreading

Howcanweguaranteenodependenciesbetweeninstructionsinapipeline?

Onewayistointerleaveexecutionofinstructionsfromdifferentprogramthreadsonsamepipeline

23

F D X MWt0 t1 t2 t3 t4 t5 t6 t7 t8

T1:LD x1,0(x2)T2:ADD x7,x1,x4T3:XORI x5,x4,12T4:SD 0(x7),x5T1:LD x5,12(x1)

t9

F D X MWF D X MW

F D X MWF D X MW

Interleave4threads,T1-T4,onnon-bypassed5-stagepipe

Priorinstructioninathreadalwayscompleteswrite-backbeforenextinstructioninsamethreadreadsregisterfile


CDC6600PeripheralProcessors(Cray,1964)

§ Firstmultithreadedhardware§ 10“virtual”I/Oprocessors§ Fixedinterleaveonsimplepipeline§ Pipelinehas100nscycletime§ Eachvirtualprocessorexecutesoneinstructionevery1000ns§ Accumulator-basedinstructionsettoreduceprocessorstate

24


SimpleMultithreadedPipeline

§ Havetocarrythreadselectdownpipelinetoensurecorrectstatebitsread/writtenateachpipestage

§ Appearstosoftware(includingOS)asmultiple,albeitslower,CPUs

25

+1

2 Thread select

PC1PC1PC1PC1

I$ IR GPR1GPR1GPR1GPR1

X

Y

2

D$


MultithreadingCosts

§ Eachthreadrequiresitsownuserstate- PC- GPRs

§ Also,needsitsownsystemstate- Virtual-memorypage-table-baseregister- Exception-handlingregisters

§ Otheroverheads:- Additionalcache/TLBconflictsfromcompetingthreads- (oraddlargercache/TLBcapacity)- MoreOSoverheadtoschedulemorethreads(wheredoallthesethreadscomefrom?)

26


ThreadSchedulingPolicies

27

§ Fixedinterleave(CDC6600PPUs,1964)- EachofNthreadsexecutesoneinstructioneveryNcycles- Ifthreadnotreadytogoinitsslot,insertpipelinebubble

§ Software-controlledinterleave(TIASCPPUs,1971)-OSallocatesSpipelineslotsamongstNthreads- HardwareperformsfixedinterleaveoverSslots,executingwhicheverthreadisinthatslot

§ Hardware-controlledthreadscheduling(HEP,1982)- Hardwarekeepstrackofwhichthreadsarereadytogo- Picksnextthreadtoexecutebasedonhardwarepriorityscheme

WU UCB CS252 SP17

Issue Slots:Vertical vs. Horizontal Waste

28

“Simultaneous Multithreading: Maximizing On-Chip Parallelism”, D. M. Tullsen, S. J. Eggers, and H. M. Levy, University of Washington, ISCA 1995


SimultaneousMultithreading(SMT)forOoO Superscalars

§ Techniquespresentedsofarhaveallbeen“vertical”multithreadingwhereeachpipelinestageworksononethreadatatime

§ SMTusesfine-graincontrolalreadypresentinsideanOoO superscalartoallowinstructionsfrommultiplethreadstoenterexecutiononsameclockcycle.Givesbetterutilizationofmachineresources.

29


Formostapps,mostexecutionunitslieidleinanOoOsuperscalar

30

From:Tullsen,Eggers,andLevy,“SimultaneousMultithreading:MaximizingOn-chipParallelism”,ISCA1995.

Foran8-waysuperscalar.


SuperscalarMachineEfficiency

31

Issuewidth

Time

Completelyidlecycle(verticalwaste)

Instructionissue

Partiallyfilledcycle,i.e.,IPC<4(horizontalwaste)


VerticalMultithreading

32

Cycle-by-cycleinterleavingremovesverticalwaste,butleavessomehorizontalwaste

Issuewidth

Time

Secondthreadinterleavedcycle-by-cycle

Instructionissue

Partiallyfilledcycle,i.e.,IPC<4(horizontalwaste)


ChipMultiprocessing(CMP)

33

§ Whatistheeffectofsplittingintomultipleprocessors?- reduceshorizontalwaste,- leavessomeverticalwaste,and- putsupperlimitonpeakthroughputofeachthread.

Issuewidth

Time


IdealSuperscalarMultithreading[Tullsen,Eggers,Levy,UW,1995]

34

§ Interleavemultiplethreadstomultipleissueslotswithnorestrictions

Issuewidth

Time


SMTadaptationtoparallelismtype

35

Forregionswithhighthread-levelparallelism(TLP)entiremachinewidthissharedbyallthreads

Issuewidth

Time

Issuewidth

Time

Forregionswithlowthread-levelparallelism(TLP)entiremachinewidthisavailableforinstruction-levelparallelism(ILP)


MultithreadedDesignDiscussion

36

§Wanttobuildamultithreadedprocessor,howshouldeachcomponentbechangedandwhatarethetradeoffs?§L1caches(instructionanddata)§L2caches§Branchpredictor§TLB§Physicalregisterfile


Summary:MultithreadedCategories

37

Time(processorcycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread1Thread2

Thread3Thread4

Thread5Idleslot


Acknowledgements

§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Krste Asanovic (UCB)- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)

38

cs252 spring 2017 graduate computer architecture lecture...

Documents