cs 152 computer architecture and engineering cs252 ... · 14 oracle/sun niagara processors §...

42
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 14 – Mul=threading Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152

Upload: others

Post on 11-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture

Lecture14–Mul=threading

KrsteAsanovicElectricalEngineeringandComputerSciences

UniversityofCaliforniaatBerkeley

http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152

Page 2: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

LastTimeLecture13:VLIW

§  InaclassicVLIW,compilerisresponsibleforavoidingallhazards->simplehardware,complexcompiler.LaterVLIWsaddedmoredynamichardwareinterlocks

§ UseloopunrollingandsoIwarepipeliningforloops,traceschedulingformoreirregularcode

§  StaJcschedulingdifficultinpresenceofunpredictablebranchesandvariablelatencymemory

2

Page 3: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Mul=threading

§ DifficulttoconJnuetoextractinstrucJon-levelparallelism(ILP)fromasinglesequenJalthreadofcontrol

§ Manyworkloadscanmakeuseofthread-levelparallelism(TLP)– TLPfrommulJprogramming(runindependentsequenJaljobs)

– TLPfrommulJthreadedapplicaJons(runonejobfasterusingparallelthreads)

§ MulJthreadingusesTLPtoimproveuJlizaJonofasingleprocessor

3

Page 4: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Mul=threading

HowcanweguaranteenodependenciesbetweeninstrucJonsinapipeline?

OnewayistointerleaveexecuJonofinstrucJonsfromdifferentprogramthreadsonsamepipeline

4

F D XMWt0 t1 t2 t3 t4 t5 t6 t7 t8

T1:LD x1,0(x2) T2:ADD x7,x1,x4 T3:XORI x5,x4,12 T4:SD 0(x7),x5 T1:LD x5,12(x1)

t9

F D XMWF D X M W

F D XMWF D XMW

Interleave4threads,T1-T4,onnon-bypassed5-stagepipe

Priorinstruc:oninathreadalwayscompleteswrite-backbeforenextinstruc:oninsamethreadreadsregisterfile

Page 5: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CDC6600PeripheralProcessors(Cray,1964)

§  FirstmulJthreadedhardware§  10“virtual”I/Oprocessors§  Fixedinterleaveonsimplepipeline§  Pipelinehas100nscycleJme§  EachvirtualprocessorexecutesoneinstrucJonevery1000ns§  Accumulator-basedinstrucJonsettoreduceprocessorstate

5

Page 6: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SimpleMul=threadedPipeline

§ Havetocarrythreadselectdownpipelinetoensurecorrectstatebitsread/wri_enateachpipestage

§ AppearstosoIware(includingOS)asmulJple,albeitslower,CPUs

6

+1

2 Thread select

PC 1 PC

1 PC 1 PC

1 I$ IR GPR1 GPR1 GPR1 GPR1

X

Y

2

D$

Page 7: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Mul=threadingCosts

§ Eachthreadrequiresitsownuserstate–  PC–  GPRs

§ Also,needsitsownsystemstate–  Virtual-memorypage-table-baseregister–  ExcepJon-handlingregisters

§ Otheroverheads:–  AddiJonalcache/TLBconflictsfromcompeJngthreads–  (oraddlargercache/TLBcapacity)–  MoreOSoverheadtoschedulemorethreads(wheredoallthesethreadscomefrom?)

7

Page 8: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

ThreadSchedulingPolicies

§  Fixedinterleave(CDC6600PPUs,1964)–  EachofNthreadsexecutesoneinstrucJoneveryNcycles–  Ifthreadnotreadytogoinitsslot,insertpipelinebubble

§  SoIware-controlledinterleave(TIASCPPUs,1971)–  OSallocatesSpipelineslotsamongstNthreads–  HardwareperformsfixedinterleaveoverSslots,execuJngwhicheverthreadisinthatslot

§ Hardware-controlledthreadscheduling(HEP,1982)–  Hardwarekeepstrackofwhichthreadsarereadytogo–  Picksnextthreadtoexecutebasedonhardwarepriorityscheme

8

Page 9: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

DenelcorHEP(BurtonSmith,1982)

FirstcommercialmachinetousehardwarethreadinginmainCPU–  120threadsperprocessor–  10MHzclockrate–  Upto8processors–  precursortoTeraMTA(MulJthreadedArchitecture)

9

Page 10: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CS252

TeraMTA(1990-)

§ Upto256processors§ Upto128acJvethreadsperprocessor§  Processorsandmemorymodulespopulateasparse3DtorusinterconnecJonfabric

§  Flat,sharedmainmemory–  Nodatacache–  Sustainsonemainmemoryaccesspercycleperprocessor

§ GaAslogicinprototype,1KW/processor@260MHz–  SecondversionCMOS,MTA-2,50W/processor–  Newerversion,XMT,fitsintoAMDOpteronsocket,runsat500MHz

–  Newestversion,XMT2,hashighermemorybandwidthandcapacity

10

Page 11: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CS252

MTAPipeline

11

A

W

C

W

M

InstFetch

Mem

oryPo

ol

RetryPool

Interconnec=onNetwork

WritePo

ol

W

Memorypipeline

IssuePool• Everycycle,oneVLIWinstrucJonfromoneacJvethreadislaunchedintopipeline

• InstrucJonpipelineis21cycleslong

• MemoryoperaJonsincur~150cyclesoflatency

AssumingasinglethreadissuesoneinstrucJonevery21cycles,andclockrateis260MHz…

Whatissingle-threadperformance?

EffecJvesingle-threadissuerateis260/21=12.4MIPS

Page 12: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Coarse-GrainMul=threading

§  TeraMTAdesignedforsupercompuJngapplicaJonswithlargedatasetsandlowlocality–  Nodatacache–  Manyparallelthreadsneededtohidelargememorylatency

§ OtherapplicaJonsaremorecachefriendly–  Fewpipelinebubblesifcachemostlyhashits–  Justaddafewthreadstohideoccasionalcachemisslatencies–  Swapthreadsoncachemisses

12

Page 13: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

MITAlewife(1990)

13

§ ModifiedSPARCchips–  registerwindowsholddifferentthreadcontexts

§ Uptofourthreadspernode§ Threadswitchonlocalcachemiss

Page 14: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

IBMPowerPCRS64-IV(2000)

§ Commercialcoarse-grainmulJthreadingCPU§ BasedonPowerPCwithquad-issuein-orderfive-stagepipeline

§  EachphysicalCPUsupportstwovirtualCPUs§ OnL2cachemiss,pipelineisflushedandexecuJonswitchestosecondthread–  shortpipelineminimizesflushpenalty(4cycles),smallcomparedtomemoryaccesslatency

–  flushpipelinetosimplifyexcepJonhandling

14

Page 15: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Oracle/SunNiagaraprocessors

§  Targetisdatacentersrunningwebserversanddatabases,withmanyconcurrentrequests

§  ProvidemulJplesimplecoreseachwithmulJplehardwarethreads,reducedenergy/operaJonthoughmuchlowersinglethreadperformance

§ Niagara-1[2004],8cores,4threads/core§ Niagara-2[2007],8cores,8threads/core§ Niagara-3[2009],16cores,8threads/core§  T4[2011],8cores,8threads/core§  T5[2012],16cores,8threads/core§ M5[2012],6cores,8threads/core§ M6[2013],12cores,8threads/core

15

Page 16: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Oracle/SunNiagara-3,“RainbowFalls”2009

16

Page 17: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

OracleM6-2013

17

Page 18: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

OracleM6-2013

18

Page 19: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

OracleM6-2013

19

Page 20: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CS152Administrivia

§  PS3duetoday§  PS4outWednesdayMarch18,dueFridayApril3§  Lab3dueMondayApril6

20

Page 21: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CS252

CS252Administrivia

§ Readingtoday3:30onZoom,linkonpiazza

21

Page 22: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SimultaneousMul=threading(SMT)forOoOSuperscalars

§  Techniquespresentedsofarhaveallbeen“verJcal”mulJthreadingwhereeachpipelinestageworksononethreadataJme

§  SMTusesfine-graincontrolalreadypresentinsideanOoOsuperscalartoallowinstrucJonsfrommulJplethreadstoenterexecuJononsameclockcycle.Givesbe_eruJlizaJonofmachineresources.

22

Page 23: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Formostapps,mostexecu=onunitslieidleinanOoOsuperscalar

23

From:Tullsen,Eggers,andLevy,“SimultaneousMulJthreading:MaximizingOn-chipParallelism”,ISCA1995.

Foran8-waysuperscalar.

Page 24: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SuperscalarMachineEfficiency

24

Issuewidth

Time

Completelyidlecycle(ver:calwaste)

Instruc:onissue

Par:allyfilledcycle,i.e.,IPC<4(horizontalwaste)

Page 25: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Ver=calMul=threading

25

§ Cycle-by-cycleinterleavingremovesverJcalwaste,butleavessomehorizontalwaste

Issuewidth

Time

Secondthreadinterleavedcycle-by-cycle

Instruc:onissue

Par:allyfilledcycle,i.e.,IPC<4(horizontalwaste)

Page 26: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

ChipMul=processing(CMP)

26

§ WhatistheeffectofsplixngintomulJpleprocessors?–  reduceshorizontalwaste,–  leavessomeverJcalwaste,and–  putsupperlimitonpeakthroughputofeachthread.

Issuewidth

Time

Page 27: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

IdealSuperscalarMul=threading[Tullsen,Eggers,Levy,UW,1995]

27

§  InterleavemulJplethreadstomulJpleissueslotswithnorestricJons

Issuewidth

Time

Page 28: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

O-o-OSimultaneousMul=threading[Tullsen,Eggers,Emer,Levy,Stamm,Lo,DEC/UW,1996]

§ AddmulJplecontextsandfetchenginesandallowinstrucJonsfetchedfromdifferentthreadstoissuesimultaneously

§ UJlizewideout-of-ordersuperscalarprocessorissuequeuetofindinstrucJonstoissuefrommulJplethreads

§ OOOinstrucJonwindowalreadyhasmostofthecircuitryrequiredtoschedulefrommulJplethreads

§ AnysinglethreadcanuJlizewholemachine

28

Page 29: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SMTadapta=ontoparallelismtype

29

Forregionswithhighthread-levelparallelism(TLP)enJremachinewidthissharedbyallthreads

Issuewidth

Time

Issuewidth

Time

Forregionswithlowthread-levelparallelism(TLP)enJremachinewidthisavailableforinstrucJon-levelparallelism(ILP)

Page 30: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Pen=um-4Hyperthreading(2002)

§  FirstcommercialSMTdesign(2-waySMT)§  Logicalprocessorssharenearlyallresourcesofthephysicalprocessor

–  Caches,execuJonunits,branchpredictors§  Dieareaoverheadofhyperthreading~5%§  Whenonelogicalprocessorisstalled,theothercanmakeprogress

–  NologicalprocessorcanuseallentriesinqueueswhentwothreadsareacJve

§  ProcessorrunningonlyoneacJvesoIwarethreadrunsatapproximatelysamespeedwithorwithouthyperthreading

§  HyperthreadingdroppedonOoOP6basedfollowonstoPenJum-4(PenJum-M,CoreDuo,Core2Duo),unJlrevivedwithNehalemgeneraJonmachinesin2008.

§  IntelAtom(in-orderx86core)hastwo-wayverJcalmulJthreading–  Hyperthreading==(SMTforIntelOoO&VerJcalforIntelInO)

30

Page 31: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

IBMPower4

31

Single-threadedpredecessortoPower5.8execuJonunitsinout-of-orderengine,eachmayissueaninstrucJoneachcycle.

Page 32: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

32

Power 4

Power 5

2 fetch (PC), 2 initial decodes

2 commits (architected register sets)

Page 33: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Power5dataflow...

33

Whyonly2threads?With4,oneofthesharedresources(physicalregisters,cache,memorybandwidth)wouldbepronetobo_leneck

Page 34: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Ini=alPerformanceofSMT

§  PenJum-4ExtremeSMTyields1.01speedupforSPECint_ratebenchmarkand1.07forSPECfp_rate–  PenJum-4isdual-threadedSMT–  SPECRaterequiresthateachSPECbenchmarkberunagainstavendor-selectednumberofcopiesofthesamebenchmark

§ RunningonPenJum-4eachof26SPECbenchmarkspairedwitheveryother(262runs)speed-upsfrom0.90to1.58;averagewas1.20

§  Power5,8-processorserver1.23fasterforSPECint_ratewithSMT,1.16fasterforSPECfp_rate

§  Power5running2copiesofeachappspeedupbetween0.89and1.41–  Mostgainedsome–  Fl.Pt.appshadmostcacheconflictsandleastgains

34

Page 35: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SMTPerformance:Applica=onInterac=on

35Bulpin et al, “Multiprogramming Performance of Pentium 4 with Hyper-Threading”

Solongastheyaren’tbangingon

theL2too.

Notaffectedbyotherprograms

YourfavoritebenchmarkfromLab2

Page 36: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SMTPerformance:Applica=onInterac=on

36Bulpin et al, “Multiprogramming Performance of Pentium 4 with Hyper-Threading”

Doesn’tplaynice

YourfavoritebenchmarkfromLab2

Page 37: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

37Bulpin et al, “Multiprogramming Performance of Pentium 4 with Hyper-Threading”

SMTPerformance:Applica=onInterac=on

VerysensiJvetosecondprogram

Page 38: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

IcountChoosingPolicy

38

Whydoesthisenhancethroughput?

FetchfromthreadwiththeleastinstrucJonsinflight.

Page 39: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Summary:Mul=threadedCategories

39

Time(processorcycle) Superscalar Fine-Grained Coarse-Grained Mul=processing

SimultaneousMul=threading

Thread1Thread2

Thread3Thread4

Thread5Idleslot

Page 40: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Mul=threadedDesignDiscussion

40

§ WanttobuildamulJthreadedprocessor,howshouldeachcomponentbechangedandwhatarethetradeoffs?§ L1caches(instrucJonanddata)§ L2caches§ Branchpredictor§ TLB§ Physicalregisterfile

Page 41: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SMT&Security

41

§ Mosthardwarea_acksrelyonsharedhardwareresourcestoestablishaside-channel– Eg.Sharedoutercaches,DRAMrowbuffers

§  SMTgivesa_ackershigh-BWaccesstopreviouslyprivatehardwareresourcesthataresharedbyco-residentthreads:

§  TLBs:TLBleed(June,‘18)§  L1caches:CacheBleed(2016)§  FuncJonalunitports:PortSmash(Nov,’18)OpenBSD6.4àDisabledHTinBIOS,AMDSMTtofollow

Page 42: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Acknowledgements

§  ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:–  Arvind(MIT)–  Sco_Beamer(UCB)–  JoelEmer(Intel/MIT)–  JamesHoe(CMU)–  JohnKubiatowicz(UCB)–  DavidPa_erson(UCB)

42