cs 152 computer architecture and engineering cs252 ... · 14 oracle/sun niagara processors §...

CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture

Lecture14–Mul=threading

KrsteAsanovicElectricalEngineeringandComputerSciences

UniversityofCaliforniaatBerkeley

http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152

LastTimeLecture13:VLIW

§  InaclassicVLIW,compilerisresponsibleforavoidingallhazards->simplehardware,complexcompiler.LaterVLIWsaddedmoredynamichardwareinterlocks

§ UseloopunrollingandsoIwarepipeliningforloops,traceschedulingformoreirregularcode

§  StaJcschedulingdifficultinpresenceofunpredictablebranchesandvariablelatencymemory

Mul=threading

§ DifficulttoconJnuetoextractinstrucJon-levelparallelism(ILP)fromasinglesequenJalthreadofcontrol

§ Manyworkloadscanmakeuseofthread-levelparallelism(TLP)– TLPfrommulJprogramming(runindependentsequenJaljobs)

– TLPfrommulJthreadedapplicaJons(runonejobfasterusingparallelthreads)

§ MulJthreadingusesTLPtoimproveuJlizaJonofasingleprocessor

Mul=threading

HowcanweguaranteenodependenciesbetweeninstrucJonsinapipeline?

OnewayistointerleaveexecuJonofinstrucJonsfromdifferentprogramthreadsonsamepipeline

F D XMWt0 t1 t2 t3 t4 t5 t6 t7 t8

T1:LD x1,0(x2) T2:ADD x7,x1,x4 T3:XORI x5,x4,12 T4:SD 0(x7),x5 T1:LD x5,12(x1)

F D XMWF D X M W

F D XMWF D XMW

Interleave4threads,T1-T4,onnon-bypassed5-stagepipe

Priorinstruc:oninathreadalwayscompleteswrite-backbeforenextinstruc:oninsamethreadreadsregisterfile

CDC6600PeripheralProcessors(Cray,1964)

§  FirstmulJthreadedhardware§  10“virtual”I/Oprocessors§  Fixedinterleaveonsimplepipeline§  Pipelinehas100nscycleJme§  EachvirtualprocessorexecutesoneinstrucJonevery1000ns§  Accumulator-basedinstrucJonsettoreduceprocessorstate

SimpleMul=threadedPipeline

§ Havetocarrythreadselectdownpipelinetoensurecorrectstatebitsread/wri_enateachpipestage

§ AppearstosoIware(includingOS)asmulJple,albeitslower,CPUs

2 Thread select

PC 1 PC

1 PC 1 PC

1 I$ IR GPR1 GPR1 GPR1 GPR1

Mul=threadingCosts

§ Eachthreadrequiresitsownuserstate–  PC–  GPRs

§ Also,needsitsownsystemstate–  Virtual-memorypage-table-baseregister–  ExcepJon-handlingregisters

§ Otheroverheads:–  AddiJonalcache/TLBconflictsfromcompeJngthreads–  (oraddlargercache/TLBcapacity)–  MoreOSoverheadtoschedulemorethreads(wheredoallthesethreadscomefrom?)

ThreadSchedulingPolicies

§  Fixedinterleave(CDC6600PPUs,1964)–  EachofNthreadsexecutesoneinstrucJoneveryNcycles–  Ifthreadnotreadytogoinitsslot,insertpipelinebubble

§  SoIware-controlledinterleave(TIASCPPUs,1971)–  OSallocatesSpipelineslotsamongstNthreads–  HardwareperformsfixedinterleaveoverSslots,execuJngwhicheverthreadisinthatslot

§ Hardware-controlledthreadscheduling(HEP,1982)–  Hardwarekeepstrackofwhichthreadsarereadytogo–  Picksnextthreadtoexecutebasedonhardwarepriorityscheme

DenelcorHEP(BurtonSmith,1982)

FirstcommercialmachinetousehardwarethreadinginmainCPU–  120threadsperprocessor–  10MHzclockrate–  Upto8processors–  precursortoTeraMTA(MulJthreadedArchitecture)

TeraMTA(1990-)

§ Upto256processors§ Upto128acJvethreadsperprocessor§  Processorsandmemorymodulespopulateasparse3DtorusinterconnecJonfabric

§  Flat,sharedmainmemory–  Nodatacache–  Sustainsonemainmemoryaccesspercycleperprocessor

§ GaAslogicinprototype,1KW/processor@260MHz–  SecondversionCMOS,MTA-2,50W/processor–  Newerversion,XMT,fitsintoAMDOpteronsocket,runsat500MHz

–  Newestversion,XMT2,hashighermemorybandwidthandcapacity

MTAPipeline

InstFetch

RetryPool

Interconnec=onNetwork

WritePo

Memorypipeline

IssuePool• Everycycle,oneVLIWinstrucJonfromoneacJvethreadislaunchedintopipeline

• InstrucJonpipelineis21cycleslong

• MemoryoperaJonsincur~150cyclesoflatency

AssumingasinglethreadissuesoneinstrucJonevery21cycles,andclockrateis260MHz…

Whatissingle-threadperformance?

EffecJvesingle-threadissuerateis260/21=12.4MIPS

Coarse-GrainMul=threading

§  TeraMTAdesignedforsupercompuJngapplicaJonswithlargedatasetsandlowlocality–  Nodatacache–  Manyparallelthreadsneededtohidelargememorylatency

§ OtherapplicaJonsaremorecachefriendly–  Fewpipelinebubblesifcachemostlyhashits–  Justaddafewthreadstohideoccasionalcachemisslatencies–  Swapthreadsoncachemisses

MITAlewife(1990)

§ ModifiedSPARCchips–  registerwindowsholddifferentthreadcontexts

§ Uptofourthreadspernode§ Threadswitchonlocalcachemiss

IBMPowerPCRS64-IV(2000)

§ Commercialcoarse-grainmulJthreadingCPU§ BasedonPowerPCwithquad-issuein-orderfive-stagepipeline

§  EachphysicalCPUsupportstwovirtualCPUs§ OnL2cachemiss,pipelineisflushedandexecuJonswitchestosecondthread–  shortpipelineminimizesflushpenalty(4cycles),smallcomparedtomemoryaccesslatency

–  flushpipelinetosimplifyexcepJonhandling

Oracle/SunNiagaraprocessors

§  Targetisdatacentersrunningwebserversanddatabases,withmanyconcurrentrequests

§  ProvidemulJplesimplecoreseachwithmulJplehardwarethreads,reducedenergy/operaJonthoughmuchlowersinglethreadperformance

§ Niagara-1[2004],8cores,4threads/core§ Niagara-2[2007],8cores,8threads/core§ Niagara-3[2009],16cores,8threads/core§  T4[2011],8cores,8threads/core§  T5[2012],16cores,8threads/core§ M5[2012],6cores,8threads/core§ M6[2013],12cores,8threads/core

Oracle/SunNiagara-3,“RainbowFalls”2009

OracleM6-2013

CS152Administrivia

§  PS3duetoday§  PS4outWednesdayMarch18,dueFridayApril3§  Lab3dueMondayApril6

CS252Administrivia

§ Readingtoday3:30onZoom,linkonpiazza

SimultaneousMul=threading(SMT)forOoOSuperscalars

§  Techniquespresentedsofarhaveallbeen“verJcal”mulJthreadingwhereeachpipelinestageworksononethreadataJme

§  SMTusesfine-graincontrolalreadypresentinsideanOoOsuperscalartoallowinstrucJonsfrommulJplethreadstoenterexecuJononsameclockcycle.Givesbe_eruJlizaJonofmachineresources.

Formostapps,mostexecu=onunitslieidleinanOoOsuperscalar

From:Tullsen,Eggers,andLevy,“SimultaneousMulJthreading:MaximizingOn-chipParallelism”,ISCA1995.

Foran8-waysuperscalar.

SuperscalarMachineEfficiency

Issuewidth

Completelyidlecycle(ver:calwaste)

Instruc:onissue

Par:allyfilledcycle,i.e.,IPC<4(horizontalwaste)

Ver=calMul=threading

§ Cycle-by-cycleinterleavingremovesverJcalwaste,butleavessomehorizontalwaste

Issuewidth

Secondthreadinterleavedcycle-by-cycle

Instruc:onissue

Par:allyfilledcycle,i.e.,IPC<4(horizontalwaste)

ChipMul=processing(CMP)

§ WhatistheeffectofsplixngintomulJpleprocessors?–  reduceshorizontalwaste,–  leavessomeverJcalwaste,and–  putsupperlimitonpeakthroughputofeachthread.

Issuewidth

IdealSuperscalarMul=threading[Tullsen,Eggers,Levy,UW,1995]

§  InterleavemulJplethreadstomulJpleissueslotswithnorestricJons

Issuewidth

O-o-OSimultaneousMul=threading[Tullsen,Eggers,Emer,Levy,Stamm,Lo,DEC/UW,1996]

§ AddmulJplecontextsandfetchenginesandallowinstrucJonsfetchedfromdifferentthreadstoissuesimultaneously

§ UJlizewideout-of-ordersuperscalarprocessorissuequeuetofindinstrucJonstoissuefrommulJplethreads

§ OOOinstrucJonwindowalreadyhasmostofthecircuitryrequiredtoschedulefrommulJplethreads

§ AnysinglethreadcanuJlizewholemachine

SMTadapta=ontoparallelismtype

Forregionswithhighthread-levelparallelism(TLP)enJremachinewidthissharedbyallthreads

Issuewidth

Forregionswithlowthread-levelparallelism(TLP)enJremachinewidthisavailableforinstrucJon-levelparallelism(ILP)

Pen=um-4Hyperthreading(2002)

§  FirstcommercialSMTdesign(2-waySMT)§  Logicalprocessorssharenearlyallresourcesofthephysicalprocessor

–  Caches,execuJonunits,branchpredictors§  Dieareaoverheadofhyperthreading~5%§  Whenonelogicalprocessorisstalled,theothercanmakeprogress

–  NologicalprocessorcanuseallentriesinqueueswhentwothreadsareacJve

§  ProcessorrunningonlyoneacJvesoIwarethreadrunsatapproximatelysamespeedwithorwithouthyperthreading

§  HyperthreadingdroppedonOoOP6basedfollowonstoPenJum-4(PenJum-M,CoreDuo,Core2Duo),unJlrevivedwithNehalemgeneraJonmachinesin2008.

§  IntelAtom(in-orderx86core)hastwo-wayverJcalmulJthreading–  Hyperthreading==(SMTforIntelOoO&VerJcalforIntelInO)

IBMPower4

Single-threadedpredecessortoPower5.8execuJonunitsinout-of-orderengine,eachmayissueaninstrucJoneachcycle.

Power 4

Power 5

2 fetch (PC), 2 initial decodes

2 commits (architected register sets)

Power5dataflow...

Whyonly2threads?With4,oneofthesharedresources(physicalregisters,cache,memorybandwidth)wouldbepronetobo_leneck

Ini=alPerformanceofSMT

§  PenJum-4ExtremeSMTyields1.01speedupforSPECint_ratebenchmarkand1.07forSPECfp_rate–  PenJum-4isdual-threadedSMT–  SPECRaterequiresthateachSPECbenchmarkberunagainstavendor-selectednumberofcopiesofthesamebenchmark

§ RunningonPenJum-4eachof26SPECbenchmarkspairedwitheveryother(262runs)speed-upsfrom0.90to1.58;averagewas1.20

§  Power5,8-processorserver1.23fasterforSPECint_ratewithSMT,1.16fasterforSPECfp_rate

§  Power5running2copiesofeachappspeedupbetween0.89and1.41–  Mostgainedsome–  Fl.Pt.appshadmostcacheconflictsandleastgains

SMTPerformance:Applica=onInterac=on

35Bulpin et al, “Multiprogramming Performance of Pentium 4 with Hyper-Threading”

Solongastheyaren’tbangingon

theL2too.

Notaffectedbyotherprograms

YourfavoritebenchmarkfromLab2

Doesn’tplaynice

YourfavoritebenchmarkfromLab2

VerysensiJvetosecondprogram

IcountChoosingPolicy

Whydoesthisenhancethroughput?

FetchfromthreadwiththeleastinstrucJonsinflight.

Summary:Mul=threadedCategories

Time(processorcycle) Superscalar Fine-Grained Coarse-Grained Mul=processing

SimultaneousMul=threading

Thread1Thread2

Thread3Thread4

Thread5Idleslot

Mul=threadedDesignDiscussion

§ WanttobuildamulJthreadedprocessor,howshouldeachcomponentbechangedandwhatarethetradeoffs?§ L1caches(instrucJonanddata)§ L2caches§ Branchpredictor§ TLB§ Physicalregisterfile

SMT&Security

§ Mosthardwarea_acksrelyonsharedhardwareresourcestoestablishaside-channel– Eg.Sharedoutercaches,DRAMrowbuffers

§  SMTgivesa_ackershigh-BWaccesstopreviouslyprivatehardwareresourcesthataresharedbyco-residentthreads:

§  TLBs:TLBleed(June,‘18)§  L1caches:CacheBleed(2016)§  FuncJonalunitports:PortSmash(Nov,’18)OpenBSD6.4àDisabledHTinBIOS,AMDSMTtofollow

Acknowledgements

§  ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:–  Arvind(MIT)–  Sco_Beamer(UCB)–  JoelEmer(Intel/MIT)–  JamesHoe(CMU)–  JohnKubiatowicz(UCB)–  DavidPa_erson(UCB)

cs 152 computer architecture and engineering cs252 ... · 14 oracle/sun niagara processors §...

Documents

aula 03 datacenters

cs252 spring 2017 graduate computer architecture...

mini formation - datacenters

cs252 spring 2017 graduate computer architecture lecture 9...

cs252 spring 2017 graduate computer...

cloud datacenters

cs252 graduate computer architecture lecture 26 distributed...

1.redes datacenters

cs252%graduate%computer%architecture%...

cs252%graduate%computer%architecture%...

06 datacenters pge

cs252%graduate%computer%architecture%...

restwarmte uit datacenters - nldigital · 2. colocatie...

datacenters bestpractices

cs252%graduate%computer%architecture% fall%2015%...

partnership bootup datacenters

cs252: computing lab ii

cs252: systems programming

smart datacenters

datacenters écologiques (“green”)