projects - github pages

Projects

•  Worktogetherfortheimplementa1on–  Discussionanddebugging,butnotthecodeitself•  Eachsubmityourownimplementa1onandreport•  Presenta1on–  Onepresenta1on

•  Addi1onalmee1ng1me

1

NameAdi0Pa0l Project1AparnaPuram Project1ErikHoggard Project1JacquesBreaux Project1KaramAbughalieh Project2KevinWeinert Project2MarkEasterly Project2NathanSketch Project2ShayanMukhtar Project2

Lecture16:ParallelArchitecture–ThreadLevelParallelism

ConcurrentandMul0coreProgramming

DepartmentofComputerScienceandEngineeringYonghongYan

[email protected]/~yan

2

Topics(Part1)

•  Introduc1on•  Principlesofparallelalgorithmdesign(Chapter3)•  Programmingonsharedmemorysystem(Chapter7)–  OpenMP–  Cilk/Cilkplus–  PThread,mutualexclusion,locks,synchroniza0ons•  Analysisofparallelprogramexecu1ons(Chapter5)–  PerformanceMetricsforParallelSystems•  Execu0onTime,Overhead,Speedup,Efficiency,Cost

–  ScalabilityofParallelSystems–  Useofperformancetools

3

Topics(Part2)

•  Parallelarchitecturesandhardware–  Parallelcomputerarchitectures•  Threadlevelparallelismanddatalevelparallelism

–  Memoryhierarchyandcachecoherency•  ManycoreGPUarchitecturesandprogramming–  GPUsarchitectures–  CUDAprogramming–  Introduc1ontooffloadingmodelinOpenMP•  Programmingonlargescalesystems(Chapter6)–  MPI(pointtopointandcollec0ves)–  Introduc1ontoPGASlanguages,UPCandChapel•  Parallelalgorithms(Chapter8,9&10)

4

Moore’sLaw

Source:hZp://en.wikipedia.org/wki/Images:Moores_law.svg

•  Long-termtrendonthenumberoftransistorperintegratedcircuit•  Numberoftransistorsdoubleevery~18month

BinaryCodeandInstruc0ons

6

StagestoExecuteanInstruc0on

7

Pipeline

8

PipelineandSuperscalar

9

Whatdowedowiththatmanytransistors?

•  Op1mizingtheexecu1onofasingleinstruc1onstreamthrough–  Pipelining•  Overlaptheexecu1onofmul1pleinstruc1ons•  Example:allRISCarchitectures;Intelx86underneaththehood

–  Out-of-orderexecu1on:•  Allowinstruc1onstoovertakeeachotherinaccordancewithcodedependencies(RAW,WAW,WAR)

•  Example:allcommercialprocessors(Intel,AMD,IBM,SUN)–  Branchpredic1onandspecula1veexecu1on:•  Reducethenumberofstallcyclesduetounresolvedbranches

•  Example:(nearly)allcommercialprocessors

Whatdowedowiththatmanytransistors?(II)

–  Mul1-issueprocessors:•  Allowmul1pleinstruc1onstostartexecu1onperclockcycle•  Superscalar(Intelx86,AMD,…)vs.VLIWarchitectures

–  VLIW/EPICarchitectures:•  Allowcompilerstoindicateindependentinstruc1onsperissuepacket

•  Example:IntelItanium–  Vectorunits:•  Allowfortheefficientexpressionandexecu1onofvectoropera1ons

•  Example:SSE-SSE4,AVXinstruc1ons

Limita0onsofop0mizingasingleinstruc0onstream(II)

•  Problem:withinasingleinstruc1onstreamwedonotfindenoughindependentinstruc1onstoexecutesimultaneouslydueto–  datadependencies–  limita1onsofspecula1veexecu1onacrossmul1plebranches–  difficul1estodetectmemorydependenciesamonginstruc1on

(aliasanalysis)•  Consequence:significantnumberoffunc1onalunitsareidlingat

anygiven1me•  Ques1on:Canwemaybeexecuteinstruc1onsfromanother

instruc1onsstream–  Anotherthread?–  Anotherprocess?

The“Future”ofMoore’sLaw

•  ThechipsaredownforMoore’slaw–  hZp://www.nature.com/news/the-chips-are-down-for-moore-

s-law-1.19338•  SpecialReport:50YearsofMoore'sLaw–  hZp://spectrum.ieee.org/sta1c/special-report-50-years-of-

moores-law•  Moore’slawreallyisdeadthis1me–  hZp://arstechnica.com/informa1on-technology/2016/02/

moores-law-really-is-dead-this-1me/•  Reboo1ngtheITRevolu1on:ACalltoAc1on(SIA/SRC,2015)–  hZps://www.semiconductors.org/clientuploads/Resources/

RITR%20WEB%20version%20FINAL.pdf

13

Thread-levelparallelism

•  Problemsforexecu1nginstruc1onsfrommul1plethreadsatthesame1me–  Theinstruc1onsineachthreadmightusethesameregister

names–  Eachthreadhasitsownprogramcounter•  Virtualmemorymanagementallowsfortheexecu1onofmul1plethreadsandsharingofthemainmemory

•  Whentoswitchbetweendifferentthreads:–  Finegrainmul1threading:switchesbetweeneveryinstruc1on–  Coursegrainmul1threading:switchesonlyoncostlystalls(e.g.

level2cachemisses)

SimultaneousMul0-Threading(SMT)

•  ConvertThread-levelparallelismtoinstruc1on-levelparallelism

Superscalar CourseMT FineMT SMT

Simultaneousmul0-threading(II)

•  DynamicallyscheduledprocessorsalreadyhavemosthardwaremechanismsinplacetosupportSMT(e.g.registerrenaming)

•  Requiredaddi1onalhardware:–  Registerfileperthread–  Programcounterperthread•  Opera1ngsystemview:–  IfaCPUsupportsnsimultaneousthreads,theOpera1ng

Systemviewsthemasnprocessors–  OSdistributesmost1meconsumingthreads‘fairly’acrossthe

nprocessorsthatitsees.

ExampleforSMTarchitectures(I)

•  IntelHyperthreading:–  FirstreleasedforIntelXeonprocessorfamilyin2002–  SupportstwoarchitecturalsetsperCPU,–  Eacharchitecturalsethasitsown•  Generalpurposeregisters•  Controlregisters•  Interruptcontrolregisters•  Machinestateregisters

–  Addslessthan5%totherela1vechipsizeReference:D.T.Marret.al.“Hyper-ThreadingTechnologyArchitectureandMicroarchitecture”,IntelTechnologyJournal,6(1),2002,pp.4-15.xp://download.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf

ExampleforSMTarchitectures(II)

•  IBMPower5–  SamepipelineasIBMPower4processorbutwithSMTsupport–  Furtherimprovements:•  Increaseassocia1vityoftheL1instruc1oncache•  IncreasethesizeoftheL2andL3caches•  Addseparateinstruc1onprefetchandbufferingunitsforeachSMT

•  Increasethesizeofissuequeues•  Increasethenumberofvirtualregistersusedinternallybytheprocessor.

SimultaneousMul0-Threading

•  Workswellif–  Numberofcomputeintensivethreadsdoesnotexceedthenumberof

threadssupportedinSMT–  Threadshavehighlydifferentcharacteris1cs(e.g.onethreaddoingmostly

integeropera1ons,anothermainlydoingfloa1ngpointopera1ons)•  Doesnotworkwellif–  Threadstrytou1lizethesamefunc1onunits–  Assignmentproblems:•  e.g.adualprocessorsystem,eachprocessorsuppor1ng2threadssimultaneously(OSthinksthereare4processors)

•  2computeintensiveapplica1onprocessesmightenduponthesameprocessorinsteadofdifferentprocessors(OSdoesnotseethedifferencebetweenSMTandrealprocessors!)

Synchroniza0onbetweenprocessors

•  Requiredonalllevelsofmul1-threadedprogramming–  Lock/unlock–  Mutualexclusion–  Barriersynchroniza1on

•  Keyhardwarecapability:*cp++–  Uninterruptableinstruc1oncapableofautoma1callyretrieving

orchangingavalue

RaceCondi0onintcount=0;int*cp=&count;….*cp++;/*bytwothreads*/

21Picturesfromwikipedia:hZp://en.wikipedia.org/wiki/Race_condi1on

SimpleExample(IIIb)

void *thread_func (void *arg){ int * cp (int *) arg; pthread_mutex_lock (&mymutex); *cp++; // read, increment and write shared variable pthread_mutex_unlock (&mymutex); return NULL; }

Synchroniza0on

•  Lock/unlockopera1onsonthehardwarelevel,e.g.–  Lockreturning1iflockisfree/available–  Lockreturning0iflockisunavailable•  Implementa1onusingatomicexchange(compareandswap)–  Processsetsthevalueofaregister/memoryloca1ontothe

requiredopera1on–  Se�ngthevaluemustnotbeinterruptedinordertoavoid

racecondi1ons–  Accessbymul1pleprocesses/threadswillberesolvedbywrite

serializa1on

Synchroniza0on(II)

•  Othersynchroniza1onprimi1ves:–  Test-and-set–  Fetch-and-increment•  Problemswithallthreealgorithms:–  Requireareadandwriteopera1oninasingle,uninterruptable

sequence–  Hardwarecannotallowanyopera1onsbetweenthereadand

thewriteopera1on–  Complicatescachecoherence–  Mustnotdeadlock

Loadlinked/storecondi0onal

•  Pairofinstruc1onswherethesecondinstruc1onreturnsavalueindica1ng,whetherthepairofinstruc1onswasexecutedasiftheinstruc1onswereatomic

•  Specialpairofloadandstoreopera1ons–  Loadlinked(LL)–  Storecondi8onal(SC):returns1ifsuccessful,0otherwise•  Storecondi1onalreturnsanerrorif–  Contentsofmemoryloca1onspecifiedbyLLchangedbefore

callingSC–  Processorexecutesacontextswitch

Loadlinked/storecondi0onal(II)

•  AssemblercodesequencetoatomicallyexchangethecontentsofregisterR4andthememoryloca1onspecifiedbyR1

try: MOV R3, R4

LL R2, 0(R1)

SC R3, 0(R1)

BEQZ R3, try

MOV R4, R2

Loadlinked/storecondi0onal(III)

•  Implemen1ngfetch-and-incrementusingloadlinkedandcondi1onalstore

try: LL R2, 0(R1) DADDUI R3, R2, #1 SC R3, 0(R1)

BEQZ R3, try •  Implementa1onofLL/SCbyusingaspecialLinkRegister,whichcontainstheaddressoftheopera1on

Spinlocks

•  Alockthataprocessorcon1nuouslytriestoacquire,spinningaroundinaloopun1litsucceeds.

•  Trivialimplementa1on

DADDUI R2, R0, #1

lockit: EXCH R2, 0(R1) !atomic exchange

BNEZ R2, lockit

•  SincetheEXCHopera1onincludesareadandamodifyopera1on–  Valuewillbeloadedintothecache•  Goodifonlyoneprocessortriestoaccessthelock•  Badifmul1pleprocessorsinanSMPtrytogetthelock(cachecoherence)

–  EXCHincludesawriteaZempt,whichwillleadtoawrite-missforSMPs

Spinlocks(II)

•  ForcachecoherentSMPs,slightmodifica1onofthelooprequired

lockit: LD R2, 0(R1) !load the lock

BNEZ R2, lockit !lock available?

DADDUI R2, R0, #1 !load locked value

EXCH R2, 0(R1) !atomic exchange

BNEZ R2, lockit !EXCH successful?

Spinlocks(III)

•  …orusingLL/SClockit: LL R2, 0(R1) !load the lock

BNEZ R2, lockit !lock available?

DADDUI R2, R0, #1 !load locked value

SC R2, 0(R1) !atomic exchange

BNEZ R2, lockit !SC successful?

projects - github pages

Documents