cs 61c: great ideas in computer architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/l19 tlp...

64
CS 61C: Great Ideas in Computer Architecture Lecture 19: Thread-Level Parallel Processing Krste Asanović & Randy H. Katz http://inst.eecs.berkeley.edu/~cs61c/fa17 1 11/2/17 Fall 2017 - Lecture #19

Upload: others

Post on 20-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

CS61C:GreatIdeasinComputerArchitecture

Lecture19:Thread-LevelParallelProcessing

Krste Asanović &RandyH.Katz

http://inst.eecs.berkeley.edu/~cs61c/fa17

111/2/17 Fall2017 - Lecture#19

Page 2: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

211/2/17 Fall2017 - Lecture#19

Page 3: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

ImprovingPerformance1. Increaseclockratefs

− Reachedpracticalmaximumfortoday’stechnology− <5GHzforgeneralpurposecomputers

2. LowerCPI(cyclesperinstruction)− SIMD,“instructionlevelparallelism”

3. Performmultipletaskssimultaneously− MultipleCPUs,eachexecutingdifferentprogram− Tasksmayberelated

§ E.g.eachCPUperformspartofabigmatrixmultiplication− orunrelated

§ E.g.distributedifferentwebhttprequestsoverdifferentcomputers§ E.g.runpptx (viewlectureslides)andbrowser(youtube)simultaneously

4. Doalloftheabove:− Highfs,SIMD,multipleparalleltasks

Today’slecture

311/2/17 Fall2017 - Lecture#19

Page 4: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssignedtocomputere.g.,Search“Katz”

• ParallelThreadsAssignedtocoree.g.,Lookup,Ads

• ParallelInstructions>[email protected].,5pipelinedinstructions

• ParallelData>[email protected].,Addof4pairsofwords

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

Core

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

Projects3and5!

411/2/17 Fall2017 - Lecture#19

Page 5: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

ParallelComputerArchitectures

Severalseparatecomputers,somemeansforcommunication(e.g.,Ethernet)

Massivearrayofcomputers,fastcommunicationbetweenprocessors

Multi-coreCPU:1datapathinsinglechip

shareL3cache,memory,peripheralsExample:Hivemachines

GPU“graphicsprocessingunit”

511/2/17 Fall2017 - Lecture#19

Page 6: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Example:CPUwithTwoCoresProcessor“Core”1

Control

DatapathPC

Registers(ALU)

MemoryInput

Output

Bytes

I/O-MemoryInterfaces

Processor0MemoryAccesses

Processor“Core”2

Control

DatapathPC

Registers(ALU)

Processor1MemoryAccesses

611/2/17 Fall2017 - Lecture#19

Page 7: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

MultiprocessorExecutionModel

• Eachprocessor(core)executesitsowninstructions• Separate resources(notshared)

− Datapath(PC,registers,ALU)− Highestlevelcaches(e.g.,1st and2nd)

• Shared resources− Memory(DRAM)− Often3rd levelcache

§ Oftenonsamesiliconchip§ Butnotarequirement

• Nomenclature− “MultiprocessorMicroprocessor”− Multicoreprocessor

§ E.g.,fourcoreCPU(centralprocessingunit)§ Executesfourdifferentinstructionstreamssimultaneously

711/2/17 Fall2017 - Lecture#19

Page 8: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

TransitiontoMulticore

Sequential App Performance

811/2/17 Fall2017 - Lecture#19

Page 9: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Pixel2vs.iPhone8

911/2/17 Fall2017 - Lecture#19

Page 10: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Pixel2vs.iPhone8

1011/2/17 Fall2017 - Lecture#19

ALUs nm MHz GFlops

2.35Ghz+1.9Ghz,64BitOcta-Core

Page 11: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Pixel2vs.iPhone8

1111/2/17 Fall2017 - Lecture#19

Page 12: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Pixel2vs.iPhone8

1211/2/17 Fall2017 - Lecture#19

Page 13: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

MultiprocessorExecutionModel

• Sharedmemory− Each“core”hasaccesstotheentirememoryintheprocessor− Specialhardwarekeepscachesconsistent(nextlecture!)− Advantages:

§ Simplifiescommunicationinprogramviasharedvariables− Drawbacks:

§ Doesnotscalewell:o “Slow”memorysharedbymany“customers”(cores)o Maybecomebottleneck(Amdahl’sLaw)

• Twowaystouseamultiprocessor:− Job-levelparallelism

§ Processorsworkonunrelatedproblems§ Nocommunicationbetweenprograms

− Partitionworkofsingletaskbetweenseveralcores§ E.g.,eachperformspartoflargematrixmultiplication

1311/2/17 Fall2017 - Lecture#19

Page 14: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

ParallelProcessing• It’sdifficult!• It’sinevitable

− Onlypathtoincreaseperformance− Onlypathtolowerenergyconsumption(improvebatterylife)

• Inmobilesystems(e.g.,smartphones,tablets)− Multiplecores− Dedicatedprocessors,e.g.,

§ Motionprocessor,imageprocessor,neuralprocessoriniPhone8+X§ GPU(graphicsprocessingunit)

• Warehouse-scalecomputers(nextweek!)− Multiple“nodes”

§ “Boxes”withseveralCPUs,disksperbox− MIMD(multi-core)andSIMD(e.g.AVX)ineachnode

1411/2/17 Fall2017 - Lecture#19

Page 15: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

1511/2/17 Fall2017 - Lecture#19

PotentialParallelPerformance(assumingsoftwarecanuseit)

Year Cores SIMD bits /Core Core *SIMD bits

Total, e.g.FLOPs/Cycle

2003 2 128 256 42005 4 128 512 82007 6 128 768 122009 8 128 1024 162011 10 256 2560 402013 12 256 3072 482015 14 512 7168 1122017 16 512 8192 1282019 18 1024 18432 2882021 20 1024 20480 320

2.5X 8X 20X

MIMD SIMD MIMD&SIMD+2/

2yrs2X/4yrs

12years

20xin12years201/12 =1.28xà 28%peryearor2xevery3years!

IF(!)wecanuseit

Page 16: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

1611/2/17 Fall2017 - Lecture#19

Page 17: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

ProgramsRunningonmyComputerPID TTY TIME CMD220 ?? 0:04.34 /usr/libexec/UserEventAgent (Aqua)222 ?? 0:10.60 /usr/sbin/distnoted agent224 ?? 0:09.11 /usr/sbin/cfprefsd agent229 ?? 0:04.71 /usr/sbin/usernoted230 ?? 0:02.35 /usr/libexec/nsurlsessiond232 ?? 0:28.68 /System/Library/PrivateFrameworks/CalendarAgent.framework/Executables/CalendarAgent234 ?? 0:04.36 /System/Library/PrivateFrameworks/GameCenterFoundation.framework/Versions/A/gamed235 ?? 0:01.90 /System/Library/CoreServices/cloudphotosd.app/Contents/MacOS/cloudphotosd236 ?? 0:49.72 /usr/libexec/secinitd239 ?? 0:01.66 /System/Library/PrivateFrameworks/TCC.framework/Resources/tccd240 ?? 0:12.68 /System/Library/Frameworks/Accounts.framework/Versions/A/Support/accountsd241 ?? 0:09.56 /usr/libexec/SafariCloudHistoryPushAgent242 ?? 0:00.27 /System/Library/PrivateFrameworks/CallHistory.framework/Support/CallHistorySyncHelper243 ?? 0:00.74 /System/Library/CoreServices/mapspushd244 ?? 0:00.79 /usr/libexec/fmfd246 ?? 0:00.09 /System/Library/PrivateFrameworks/AskPermission.framework/Versions/A/Resources/askpermissiond248 ?? 0:01.03 /System/Library/PrivateFrameworks/CloudDocsDaemon.framework/Versions/A/Support/bird249 ?? 0:02.50 /System/Library/PrivateFrameworks/IDS.framework/identityservicesd.app/Contents/MacOS/identityservicesd250 ?? 0:04.81 /usr/libexec/secd254 ?? 0:24.01 /System/Library/PrivateFrameworks/CloudKitDaemon.framework/Support/cloudd258 ?? 0:04.73 /System/Library/PrivateFrameworks/TelephonyUtilities.framework/callservicesd267 ?? 0:02.15 /System/Library/CoreServices/AirPlayUIAgent.app/Contents/MacOS/AirPlayUIAgent --launchd271 ?? 0:03.91 /usr/libexec/nsurlstoraged274 ?? 0:00.90 /System/Library/PrivateFrameworks/CommerceKit.framework/Versions/A/Resources/storeaccountd282 ?? 0:00.09 /usr/sbin/pboard283 ?? 0:00.90

/System/Library/PrivateFrameworks/InternetAccounts.framework/Versions/A/XPCServices/com.apple.internetaccounts.xpc/Contents/MacOS/com.apple.internetaccounts285 ?? 0:04.72 /System/Library/Frameworks/ApplicationServices.framework/Frameworks/ATS.framework/Support/fontd291 ?? 0:00.25 /System/Library/Frameworks/Security.framework/Versions/A/Resources/CloudKeychainProxy.bundle/Contents/MacOS/CloudKeychainProxy292 ?? 0:09.54 /System/Library/CoreServices/CoreServicesUIAgent.app/Contents/MacOS/CoreServicesUIAgent293 ?? 0:00.29

/System/Library/PrivateFrameworks/CloudPhotoServices.framework/Versions/A/Frameworks/CloudPhotoServicesConfiguration.framework/Versions/A/XPCServices/com.apple.CloudPhotosConfiguration.xpc/Contents/MacOS/com.apple.CloudPhotosConfiguration

297 ?? 0:00.84 /System/Library/PrivateFrameworks/CloudServices.framework/Resources/com.apple.sbd302 ?? 0:26.11 /System/Library/CoreServices/Dock.app/Contents/MacOS/Dock303 ?? 0:09.55 /System/Library/CoreServices/SystemUIServer.app/Contents/MacOS/SystemUIServer

…156total at this momentHow does mylaptopdothis?

Imagine doing 156assignments all at the same time!1711/2/17 Fall2017 - Lecture#19

ps -x

Page 18: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Threads• Sequentialflowofinstructionsthatperformssometask

− Uptonowwejustcalledthisa“program”

• Eachthreadhas:− DedicatedPC(programcounter)− Separateregisters− Accessesthesharedmemory

• Eachphysicalcoreprovidesone(ormore)− Hardwarethreads thatactivelyexecuteinstructions− Eachexecutesone“hardwarethread”

• Operatingsystemmultiplexesmultiple− Softwarethreads ontotheavailablehardwarethreads− Allthreadsexceptthosemappedtohardwarethreadsarewaiting

1811/2/17 Fall2017 - Lecture#19

Page 19: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

OperatingSystemThreads

Giveillusionofmany“simultaneously”activethreads1. Multiplexsoftwarethreadsontohardwarethreads:

a) Switchoutblockedthreads(e.g.,cachemiss,userinput,networkaccess)b) Timer(e.g.,switchactivethreadevery1ms)

2. Removeasoftwarethreadfromahardwarethreadbya) Interruptingitsexecutionb) SavingitsregistersandPCtomemory

3. Startexecutingadifferentsoftwarethreadbya) Loadingitspreviouslysavedregistersintoahardwarethread’sregistersb) JumpingtoitssavedPC

1911/2/17 Fall2017 - Lecture#19

Page 20: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Example:FourCoresThreadpool:Listofthreadscompetingforprocessor

OSmapsthreadstocoresandscheduleslogical(software)threads

Core2

Each“Core”activelyrunsoneinstructionstreamatatime

Core1 Core3 Core4

2011/2/17 Fall2017 - Lecture#19

Page 21: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Multithreading

• Typicalscenario:− Activethreadencounterscachemiss− Activethreadwaits~ 1000cyclesfordatafromDRAM−à switchoutandrundifferentthreaduntildataavailable

• Problem−Mustsavecurrentthreadstateandloadnewthreadstate

§ PC,allregisters(couldbemany,e.g.AVX)−àmustperformswitchin≪1000cycles

• Canhardwarehelp?−Moore’sLaw:transistorsareplenty

2111/2/17 Fall2017 - Lecture#19

Page 22: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

• TwocopiesofPCandRegistersinsideprocessorhardware

• Looksidenticaltotwoprocessorstosoftware(hardwarethread0,hardwarethread1)

• Hyperthreading:• Boththreadscanbeactivesimultaneously

HardwareAssistedSoftwareMultithreading

22

MemoryInput

Output

Bytes

I/O-MemoryInterfaces

Processor(1 Core,2Threads)

Control

DatapathPC0

Registers0

(ALU)

PC1

Registers1

CS61c Lecture19:ThreadLevelParallelProcessing

Page 23: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Multithreading

• Logicalthreads− ≈1%morehardware,≈10%(?)betterperformance

§ Separateregisters§ Sharedatapath,ALU(s),caches

• Multicore− =>DuplicateProcessors− ≈50%morehardware,≈2Xbetterperformance?

• Modernmachinesdoboth−Multiplecoreswithmultiplethreads percore

2311/2/17 Fall2017 - Lecture#19

Page 24: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Randy’sLaptop

$ sysctl -a | grep hw

hw.physicalcpu: 2hw.logicalcpu: 4hw.l1icachesize: 32,768 hw.l1dcachesize: 32,768hw.l2cachesize: 262,144hw.l3cachesize: 4,194,304

• 2Cores• 4Threadstotal

2411/2/17 Fall2017 - Lecture#19

Page 25: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Example:6Cores,24LogicalThreads

Threadpool:Listofthreadscompetingforprocessor

OSmapsthreadstocoresandscheduleslogical(software)threads

Thread1Core2

Thread2

Thread3

Thread4

Thread1Core6

Thread2

Thread3

Thread4

Thread1Core4

Thread2

Thread3

Thread4

Thread1Core5

Thread2

Thread3

Thread4

Thread1Core3

Thread2

Thread3

Thread4

Thread1Core1

Thread2

Thread3

Thread4

4Logicalthreadspercore(hardware)thread2511/2/17 Fall2017 - Lecture#19

Page 26: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Break!

2611/2/17 Fall2017 - Lecture#19

Page 27: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

2711/2/17 Fall2017 - Lecture#19

Page 28: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

LanguagesSupportingParallelProgramming

ActorScript Concurrent Pascal JoCaml OrcAda Concurrent ML Join OzAfnix Concurrent Haskell Java PictAlef Curry Joule ReiaAlice CUDA Joyce SALSAAPL E LabVIEW ScalaAxum Eiffel Limbo SISALChapel Erlang Linda SRCilk Fortan 90 MultiLisp Stackless PythonClean Go Modula-3 SuperPascalClojure Io Occam VHDLConcurrent C Janus occam-π XC

Whichonetopick?2811/2/17 Fall2017 - Lecture#19

Page 29: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

WhySoManyParallelProgrammingLanguages?

• Why“intrinsics”?− TOIntel:fixyour#()&$!Compiler!

• It’shappening...but− SIMDfeaturesarecontinuallyaddedtocompilers(Intel,gcc)− Intenseareaofresearch− Researchprogress:

§ 20+yearstotranslateCintogood(fast!)assembly§ HowlongtotranslateCintogood(fast!)parallelcode?

o Generalproblemisveryhardtosolveo Presentstate:specializedsolutionsforspecificcaseso Youropportunitytobecomefamous!

2911/2/17 Fall2017 - Lecture#19

Page 30: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

ParallelProgrammingLanguages

• Numberofchoicesisindicationof− Nouniversalsolution

§ Needsareveryproblemspecific− E.g.,

§ Scientificcomputing/machinelearning(matrixmultiply)§ Webserver:handlemanyunrelatedrequestssimultaneously§ Input/output:it’sallhappeningsimultaneously!

• Specializedlanguagesfordifferenttasks− Someareeasiertouse(forsomeproblems)− Noneisparticularly”easy”touse

• 61C− Parallellanguageexamplesforhigh-performancecomputing−OpenMP

3011/2/17 Fall2017 - Lecture#19

Page 31: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

ParallelLoops

• Serialexecution:for (int i=0; i<100; i++) {

…}

• ParallelExecution:

for (int i=0; i<25; i++) { …

}

for (int i=25; i<50; i++) {

…}

for (int i=50; i<75; i++) {

…}

for (int i=75; i<100; i++) {

…}

3111/2/17 Fall2017 - Lecture#19

Page 32: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Parallelfor inOpenMP

#include <omp.h>

#pragma omp parallel forfor (int i=0; i<100; i++) {

…}

3211/2/17 Fall2017 - Lecture#19

Page 33: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

OpenMPExample

$ gcc-5 -fopenmp for.c;./a.outthread 0, i = 0thread 1, i = 3thread 2, i = 6thread 3, i = 8thread 0, i = 1thread 1, i = 4thread 2, i = 7thread 3, i = 9thread 0, i = 2thread 1, i = 501 02 03 14 15 16 27 28 39 40

3311/2/17 Fall2017 - Lecture#19

Page 34: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

OpenMP

• Cextension:nonewlanguagetolearn• Multi-threaded,shared-memoryparallelism

− CompilerDirectives,#pragma− RuntimeLibraryRoutines,#include <omp.h>

• #pragma− IgnoredbycompilersunawareofOpenMP− Samesourceformultiplearchitectures

§ E.g.,sameprogramfor1&16cores

• Onlyworkswithsharedmemory

3411/2/17 Fall2017 - Lecture#19

Page 35: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

OpenMPProgrammingModel• Fork- JoinModel:

• OpenMPprogramsbeginassingleprocess(masterthread)− Sequentialexecution

• Whenparallelregionisencountered− Masterthread“forks” intoteamofparallelthreads− Executedsimultaneously− Atendofparallelregion,parallelthreads”join”,leavingonlymasterthread

• Processrepeatsforeachparallelregion− Amdahl’sLaw?

3511/2/17 Fall2017 - Lecture#19

Page 36: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

WhatKindofThreads?

• OpenMPthreadsareoperatingsystem(software)threads• OSwillmultiplexrequestedOpenMPthreadsontoavailablehardwarethreads• Hopefullyeachgetsarealhardwarethreadtorunon,sonoOS-leveltime-multiplexing• Butothertasksonmachinecompeteforhardwarethreads!• Be“careful”(?)whentimingresultsforProject3!

− 5AM?− Jobqueue?

3611/2/17 Fall2017 - Lecture#19

Page 37: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Example2:Computingp

http://openmp.org/mp-documents/omp-hands-on-SC08.pdf3711/2/17 Fall2017 - Lecture#19

Page 38: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Sequentialp

pi = 3.142425985001

• Resemblesp,butnotveryaccurate• Let’sincreasenum_steps andparallelize

3811/2/17 Fall2017 - Lecture#19

Page 39: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Parallelize(1)…

• Problem:eachthreadsneedsaccesstothesharedvariablesum

• Coderunssequentially…

3911/2/17 Fall2017 - Lecture#19

Page 40: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Parallelize(2)…

sum[0] sum[1]

1. Computesum[0]andsum[1]

inparallel

2. Computesum = sum[0] + sum[1]

sequentially

4011/2/17 Fall2017 - Lecture#19

Page 41: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Parallelp

4111/2/17 Fall2017 - Lecture#19

Page 42: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

TrialRun

i = 1, id = 1i = 0, id = 0i = 2, id = 2i = 3, id = 3i = 5, id = 1i = 4, id = 0i = 6, id = 2i = 7, id = 3i = 9, id = 1i = 8, id = 0pi = 3.142425985001

4211/2/17 Fall2017 - Lecture#19

Page 43: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Scaleup:num_steps = 106

pi = 3.141592653590

Youverify howmany digitsarecorrect …

4311/2/17 Fall2017 - Lecture#19

Page 44: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

CanWeParallelizeComputingsum?

Summationinsideparallelsection• Insignificantspeedupinthisexample,but…• pi = 3.138450662641• Wrong!And value changes between runs?!• What’s going on?

AlwayslookingforwaystobeatAmdahl’sLaw…

4411/2/17 Fall2017 - Lecture#19

Page 45: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

PeerInstructionWhatarethepossiblevaluesof*(x1) afterexecutingthiscodebytwoconcurrent threads?

# *(x1) = 100lw x2,0(x1)addi x2,x2,1sw x2,0(x1)

Answer *(x1)

RED 100 or101GREEN 101ORANGE 101or102YELLOW 100or101or102

4511/2/17 Fall2017 - Lecture#19

Page 46: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

• Operationisreallypi = pi + sum[id]

• Whatif>1threadsreadscurrent(same)valueofpi,computesthesum,storestheresultbacktopi?

• Eachprocessorreadssameintermediatevalueofpi!

• Resultdependsonwhogetstherewhen• A“race”à resultisnot

deterministic

What’sGoingOn?

4611/2/17 Fall2017 - Lecture#19

Page 47: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Administrivia

• Homework4(Caches,FloatingPoint)duetomorrow at11:59pm• Project2-2dueMonday

− ProjectOfficehoursthatMondaywillbewellstaffed!− TestyourCPUthoroughly!

§ WriteprogramswithVenusandloadthemintoyourcircuit

• Project3willbereleasedMondaynight− Atwo-weekperformanceproject− Canearnextracreditfromtheperformancecontest(Project5)

• MidtermscoreswillbereleasedbeforeTuesdayonGradescope

4711/2/17 Fall2017 - Lecture#19

Page 48: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Break!

4811/2/17 Fall2017 - Lecture#19

Page 49: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

4911/2/17 Fall2017 - Lecture#19

Page 50: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Synchronization

• Problem:− Limitaccesstosharedresourceto1actoratatime− E.g.only1personpermittedtoeditafileatatime

§ otherwisechangesbyseveralpeoplegetallmixedup

• Solution:• Taketurns:

• Onlyonepersonget’sthemicrophone&talksatatime

• Alsogoodpracticeforclassrooms,btw…

5011/2/17 Fall2017 - Lecture#19

Page 51: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Locks

• Computersuselockstocontrolaccesstosharedresources− Servespurposeofmicrophoneinexample− Alsoreferredtoas“semaphore”

• Usuallyimplementedwithavariable− int lock;

§ 0forunlocked§ 1forlocked

5111/2/17 Fall2017 - Lecture#19

Page 52: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

SynchronizationwithLocks// wait for lock releasedwhile (lock != 0) ;// lock == 0 now (unlocked)

// set locklock = 1;

// access shared resource ... // e.g. pi// sequential execution! (Amdahl ...)

// release locklock = 0;

5211/2/17 Fall2017 - Lecture#19

Page 53: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

LockSynchronization

Thread1

while (lock != 0) ;

lock = 1;

// critical section

lock = 0;

Thread2

while (lock != 0) ;

lock = 1; // critical sectionlock = 0;

• Thread2findslocknotset,beforethread1setsit

• Boththreadsbelievetheygotandsetthelock!

Tryasyoulike,thisproblemhasnosolution,notevenattheassemblylevel.

Unlessweintroducenewinstructions,thatis!5311/2/17 Fall2017 - Lecture#19

Page 54: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

HardwareSynchronization

• Solution:−Atomicread/write−Read&writeinsingleinstruction

§ Nootheraccesspermittedbetweenreadandwrite−Note:

§ Mustusesharedmemory (multiprocessing)• Commonimplementations:

−Atomicswapofregister↔memory−Pairofinstructionsfor“linked”readandwrite

§ writefailsifmemorylocationhasbeen“tampered”withafterlinkedread• RISCVhasvariationsofboth,butforsimplicitywewillfocusontheformer

5411/2/17 Fall2017 - Lecture#19

Page 55: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

RISCVAtomicMemoryOperations(AMOs)

• AMOsatomicallyperformanoperationonanoperandinmemoryandsetthedestinationregistertotheoriginalmemoryvalue• R-TypeInstructionFormat:Add,And,Or,Swap,Xor,Max,Max Unsigned,Min,Min Unsigned

5511/2/17 Fall2017 - Lecture#19

Loadfromaddressinrs1to“t”rd =”t”,i.e.,thevalueinmemoryStoreataddressinrs1thecalculation“t”<operation>rs2aq andrl insureinorderexecution

amoadd.w rd,rs2,(rs1):t = M[x[rs1]]; x[rd] = t; M[x[rs1]] = t + x[rs2]

Page 56: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

RISCVCriticalSection

• Assumethatthelockisinmemorylocationstoredinregistera0• Thelockis“set”ifitis1;itis“free”ifitis0(it’sinitialvalue)

li t0, 1 # Get 1 to set lockTry: amoswap.w.aq t1, t0, (a0) # t1 gets old lock value

# while we set it to 1bnez t1, Try # if it was already 1, another

# thread has the lock,# so we need to try again

… critical section goes here …amoswap.w.rl x0, x0, (a0) # store 0 in lock to release

5611/2/17 Fall2017 - Lecture#19

Page 57: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

LockSynchronization

BrokenSynchronization

while (lock != 0) ;

lock = 1;

// critical section

lock = 0;

Fix(lockisatlocation(a0))

li t0, 1Try amoswap.w.aq t1, t0, (a0)

bnez t1, TryLocked:

# critical section

Unlock:amoswap.w.rl x0, x0, (a0)

5711/2/17 Fall2017 - Lecture#19

Page 58: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

5811/2/17 Fall2017 - Lecture#19

Page 59: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

OpenMPLocks

5911/2/17 Fall2017 - Lecture#19

Page 60: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

SynchronizationinOpenMP

• Typicallyareusedinlibrariesofhigherlevelparallelprogrammingconstructs• E.g.OpenMPoffers$pragmasforcommoncases:

− critical− atomic− barrier− ordered

• OpenMPoffersmanymorefeatures− Seeonlinedocumentation−Ortutorialat

§ http://openmp.org/mp-documents/omp-hands-on-SC08.pdf

6011/2/17 Fall2017 - Lecture#19

Page 61: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

OpenMP CriticalSection

6111/2/17 Fall2017 - Lecture#19

Page 62: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

TheTroublewithLocks…• …isdead-locks• Consider2cookssharingakitchen

− Eachcooksamealthatrequiressaltandpepper(locks)− Cook1grabssalt− Cook2grabspepper− Cook1noticess/heneedspepper

§ it’snotthere,sos/hewaits− Cook2realizess/heneedssalt

§ it’snotthere,sos/hewaits

• Anotsocommoncauseofcookstarvation− Butdeadlocksarepossibleinparallelprograms− Verydifficulttodebug

§ malloc/free iseasy…

6211/2/17 Fall2017 - Lecture#19

Page 63: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

6311/2/17 Fall2017 - Lecture#19

Page 64: CS 61C: Great Ideas in Computer Architecture …inst.eecs.berkeley.edu/~cs61c/fa17/lec/19/L19 TLP (1up).pdfProjects 3 and 5! 11/2/17 Fall 2017-Lecture #19 4 Parallel Computer Architectures

And,inConclusion,…• Sequentialsoftwareexecutionspeedislimited• Parallelprocessingistheonlypathtohigherperformance

− SIMD:instructionlevelparallelism§ ImplementedinallhighperformanceCPUstoday(x86,ARM,…)§ Partiallysupportedbycompilers

− MIMD:threadlevelparallelism§ Multicoreprocessors§ SupportedbyOperatingSystems(OS)§ Requiresprogrammerinterventiontoexploitatsingleprogramlevel

o E.g.OpenMP− SIMD&MIMDformaximumperformance

• Synchronization− Requireshardwaresupport:specializedassemblyinstructions− Typicallyusehigher-levelsupport− Bewareofdeadlocks

6411/2/17 Fall2017 - Lecture#19