eecs 570 lecture 2 message passing & shared memory
TRANSCRIPT
![Page 1: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/1.jpg)
Lecture 2 Slide 1EECS 570
EECS570Lecture2MessagePassing&SharedMemoryWinter2018
Prof.SatishNarayanasamy
http://www.eecs.umich.edu/courses/eecs570/
Slides developed in part by Drs. Adve, Falsafi, Martin, Musuvathi, Narayanasamy, Nowatzyk, Wenisch, Sarkar, Mikko Lipasti, Jim Smith, John Shen, Mark Hill, David Wood, Guri Sohi, Jim Smith, Natalie Enright Jerger, Michel Dubois, Murali Annavaram, Per Stenström, and probably others
IntelParagonXP/S
![Page 2: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/2.jpg)
Lecture 2 Slide 2EECS 570
AnnouncementsDiscussionthisFriday.•Willdiscusscourseproject
![Page 3: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/3.jpg)
Lecture 2 Slide 3EECS 570
ReadingsFortoday
❒ DavidWoodandMarkHill.“Cost-EffectiveParallelComputing,”IEEEComputer,1995.
❒ MarkHilletal.“21st CenturyComputerArchitecture.”CCCWhitePaper,2012.
ForWednesday1/13:❒ Seileretal.Larrabee:AMany-Corex86Architecturefor
VisualComputing.Siggraph 2008.
![Page 4: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/4.jpg)
Lecture 2 Slide 4EECS 570
Sequential Merge Sort
16MBinput(32-bitintegers)
Recurse(left)
Recurse(right)
Copybacktoinputarray
Mergetoscratcharray
Time
SequentialExecution
![Page 5: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/5.jpg)
Lecture 2 Slide 5EECS 570
Parallel Merge Sort (as Parallel Directed Acyclic Graph)
16MBinput(32-bitintegers)
Recurse(left) Recurse(right)
Copybacktoinputarray
Mergetoscratcharray
Time ParallelExecution
![Page 6: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/6.jpg)
Lecture 2 Slide 6EECS 570
Parallel DAG for Merge Sort (2-core)
SequentialSort
Merge
SequentialSort
Time
![Page 7: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/7.jpg)
Lecture 2 Slide 7EECS 570
Parallel DAG for Merge Sort (4-core)
![Page 8: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/8.jpg)
Lecture 2 Slide 8EECS 570
Parallel DAG for Merge Sort (8-core)
![Page 9: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/9.jpg)
Lecture 2 Slide 9EECS 570
The DAG Execution Model of a Parallel Computation
• Givenaninput,dynamicallycreateaDAG
• Nodesrepresentsequentialcomputation❒ Weightedbytheamountofwork
• Edgesrepresentdependencies:❒ NodeAà NodeBmeansthatBcannotbescheduledunless
Aisfinished
![Page 10: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/10.jpg)
Lecture 2 Slide 10EECS 570
Sorting 16 elements in four cores
![Page 11: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/11.jpg)
Lecture 2 Slide 11EECS 570
Sorting 16 elements in four cores(4 element arrays sorted in constant time)
1 16
1 8
1
1
1 8
1
1
![Page 12: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/12.jpg)
Lecture 2 Slide 12EECS 570
Performance Measures• GivenagraphG,aschedulerS,andPprocessors
• Tp(S) :TimeonPprocessorsusingschedulerS
• Tp :TimeonPprocessorsusingbestscheduler
• T1 :Timeonasingleprocessor(sequentialcost)
• T∞ :Timeassuminginfiniteresources
![Page 13: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/13.jpg)
Lecture 2 Slide 13EECS 570
Work and Depth
• T1=Work❒ Thetotalnumberofoperationsexecutedbyacomputation
• T∞=Depth❒ Thelongestchainofsequentialdependencies(criticalpath)
intheparallelDAG
![Page 14: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/14.jpg)
Lecture 2 Slide 14EECS 570
T∞ (Depth): Critical Path Length(Sequential Bottleneck)
![Page 15: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/15.jpg)
Lecture 2 Slide 15EECS 570
T1 (work): Time to Run Sequentially
![Page 16: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/16.jpg)
Lecture 2 Slide 16EECS 570
Sorting 16 elements in four cores(4 element arrays sorted in constant time)
1 16
1 8
1
1
1 8
1
1
Work=Depth=
![Page 17: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/17.jpg)
Lecture 2 Slide 17EECS 570
Some Useful Theorems
![Page 18: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/18.jpg)
Lecture 2 Slide 18EECS 570
Work Law• “Youcannotavoidworkbyparallelizing”
T1 / P ≤ TP
![Page 19: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/19.jpg)
Lecture 2 Slide 19EECS 570
Work Law• “Youcannotavoidworkbyparallelizing”
T1 / P ≤ TP
Speedup = T1 / TP
![Page 20: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/20.jpg)
Lecture 2 Slide 20EECS 570
Work Law• “Youcannotavoidworkbyparallelizing”
• Canspeedupbemorethan2whenwegofrom1-coreto2-coreinpractice?
T1 / P ≤ TP
Speedup = T1 / TP
![Page 21: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/21.jpg)
Lecture 2 Slide 21EECS 570
Depth Law•Moreresourcesshouldmakethingsfaster• Youarelimitedbythesequentialbottleneck
TP ≥ T∞
![Page 22: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/22.jpg)
Lecture 2 Slide 22EECS 570
Amount of Parallelism
Parallelism = T1 / T∞
![Page 23: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/23.jpg)
Lecture 2 Slide 23EECS 570
Maximum Speedup Possible
Parallelism
“speedupisboundedabovebyavailableparallelism”
Speedup T1 / TP ≤ T1 / T∞
![Page 24: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/24.jpg)
Lecture 2 Slide 24EECS 570
Greedy Scheduler
• IfmorethanPnodescanbescheduled,pickanysubsetofsizeP
• IflessthanPnodescanbescheduled,schedulethemall
![Page 25: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/25.jpg)
Lecture 2 Slide 25EECS 570
MoreReading:http://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/scribe/lec4/lecture4.pdf
![Page 26: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/26.jpg)
Lecture 2 Slide 26EECS 570
Performance of the Greedy Scheduler
Worklaw T1 / P ≤ TP
Depthlaw T∞ ≤ TP
TP(Greedy) ≤ T1 / P + T∞
![Page 27: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/27.jpg)
Lecture 2 Slide 27EECS 570
Greedy is optimal within factor of 2
Worklaw T1 / P ≤ TP
Depthlaw T∞ ≤ TP
TP ≤ TP(Greedy) ≤ 2 TP
![Page 28: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/28.jpg)
Lecture 2 Slide 28EECS 570
Work/Depth of Merge Sort (Sequential Merge)
•WorkT1 : O(n log n)• DepthT∞ : O(n)
❒ TakesO(n) timetomergen elements
• Parallelism:❒ T1 / T∞ = O(log n) à reallybad!
![Page 29: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/29.jpg)
Lecture 2 Slide 29EECS 570
Main Message• AnalyzetheWorkandDepthofyouralgorithm• ParallelismisWork/Depth• TrytodecreaseDepth
❒ thecriticalpath❒ a sequential bottleneck
• IfyouincreaseDepth❒ betterincreaseWorkbyalotmore!
![Page 30: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/30.jpg)
Lecture 2 Slide 30EECS 570
Amdahl’s law• Sortingtakes70%oftheexecutiontimeofasequentialprogram
• Youreplacethesortingalgorithmwithonethatscalesperfectlyonmulti-corehardware
• Howmanycoresdoyouneedtogeta4xspeed-upontheprogram?
![Page 31: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/31.jpg)
Lecture 2 Slide 31EECS 570
Amdahl’s law, 𝑓 = 70%
f =theparallel portionofexecution1 - f =thesequential portionofexecution
c =numberofcoresused
Speedup(f, c) = 1 / ( 1 – f) + f / c
![Page 32: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/32.jpg)
Lecture 2 Slide 32EECS 570
Amdahl’s law, 𝑓 = 70%
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Speedu
p
#cores
Desired4xspeedup
Speedupachieved(perfectscalingon70%)
![Page 33: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/33.jpg)
Lecture 2 Slide 33EECS 570
Amdahl’s law, 𝑓 = 70%
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Speedu
p
#cores
Desired4xspeedup
Speedupachieved(perfectscalingon70%)
Limitasc→∞=1/(1-f)=3.33
![Page 34: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/34.jpg)
Lecture 2 Slide 34EECS 570
Amdahl’s law, 𝑓 = 10%
0.94
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
1.12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Speedu
p
#cores
Speedupachievedwithperfectscaling
Amdahl’slawlimit,just1.11x
![Page 35: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/35.jpg)
Lecture 2 Slide 35EECS 570
Amdahl’s law, 𝑓 = 98%
0
10
20
30
40
50
60
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103
109
115
121
127
Speedu
p
#cores
![Page 36: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/36.jpg)
Lecture 2 Slide 36EECS 570
Lesson• Speedupislimitedbysequential code
• Evenasmallpercentageofsequential codecangreatlylimitpotentialspeedup
![Page 37: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/37.jpg)
Lecture 2 Slide 37EECS 570
21st CenturyComputer Architecture
A CCC community white paperhttp://cra.org/ccc/docs/init/21stcenturyarchitecturewhitepap
er.pdf
SlidesfromM.Hill,HPCA2014Keynote
![Page 38: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/38.jpg)
Lecture 2 Slide 38EECS 570
20th Century ICT Set Up
• Information&CommunicationTechnology(ICT)HasChangedOurWorld❒ <longlistomitted>
• Requiredinnovationsinalgorithms,applications,programminglanguages,…,&systemsoftware
• Key(invisible)enablers(cost-)performancegains❒ Semiconductortechnology(“Moore’sLaw”)❒ Computerarchitecture(~80xperDanowitz etal.)
![Page 39: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/39.jpg)
Lecture 2 Slide 39EECS 570
Enablers: Technology + Architecture
Danowitz etal.,CACM04/2012,Figure1
Technology
Architecture
![Page 40: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/40.jpg)
Lecture 2 Slide 40EECS 570
21st Century ICT Promises More
Data-centricpersonalizedhealthcare Computation-drivenscientificdiscovery
Muchmore:known&unknownHumannetworkanalysis
![Page 41: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/41.jpg)
Lecture 2 Slide 41EECS 570
21st Century App Characteristics
BIG DATA
Whither enablers of future (cost-)performance gains?
ALWAYS ONLINE
SECURE/PRIVATE
![Page 42: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/42.jpg)
Lecture 2 Slide 42EECS 570
Technology’s Challenges 1/2Late 20th Century The New Reality
Moore’s Law —2× transistors/chip
Transistor count still 2× BUT…
Dennard Scaling —~constant power/chip
Gone. Can’t repeatedly double power/chip
42
![Page 43: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/43.jpg)
Lecture 2 Slide 43EECS 570
Technology’s Challenges 2/2Late 20th Century The New Reality
Moore’s Law —2× transistors/chip
Transistor count still 2× BUT…
Dennard Scaling —~constant power/chip
Gone. Can’t repeatedly double power/chip
Modest (hidden)transistor unreliability
Increasing transistor unreliability can’t be hidden
Focus on computation over communication
Communication (energy) more expensive than computation
1-time costs amortized via mass market
One-time cost much worse &want specialized platforms
How should architects step up as technology falters?
![Page 44: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/44.jpg)
Lecture 2 Slide 44EECS 570
“Timeline” from DARPA ISATSy
stem
Cap
abilit
y (lo
g)
80s 90s 00s 10s 20s 30s 40s
Fallow Period
50sSource:AdvancingComputerSystemswithoutTechnologyProgress,
ISATOutbrief (http://www.cs.wisc.edu/~markhill/papers/isat2012_ACSWTP.pdf)
MarkD.HillandChristosKozyrakis,DARPA/ISATWorkshop,March26-27,2012.ApprovedforPublicRelease,DistributionUnlimited
TheviewsexpressedarethoseoftheauthoranddonotreflecttheofficialpolicyorpositionoftheDepartmentofDefenseortheU.S.Government.
![Page 45: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/45.jpg)
Lecture 2 Slide 45EECS 570
21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer
Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …
Cross-Cutting:
Break current layers with new interfaces
Performance via invisible instr.-level parallelism
Energy First● Parallelism● Specialization● Cross-layer design
Predictable technologies: CMOS, DRAM, & disks
New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication
X
X
![Page 46: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/46.jpg)
Lecture 2 Slide 46EECS 570
21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer
Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …
Cross-Cutting:
Break current layers with new interfaces
Performance via invisible instr.-level parallelism
Energy First● Parallelism● Specialization● Cross-layer design
Predictable technologies: CMOS, DRAM, & disks
New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication
X
![Page 47: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/47.jpg)
Lecture 2 Slide 47EECS 570
21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer
Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …
Cross-Cutting:
Break current layers with new interfaces
Performance via invisible instr.-level parallelism
Energy First● Parallelism● Specialization● Cross-layer design
Predictable technologies: CMOS, DRAM, & disks
New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication
![Page 48: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/48.jpg)
Lecture 2 Slide 48EECS 570
21st Century Comp Architecture20th Century 21st CenturySingle-chip ingeneric computer
Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …
Cross-Cutting:
Break current layers with new interfaces
Performance via invisible instr.-level parallelism
Energy First● Parallelism● Specialization● Cross-layer design
Predictable technologies: CMOS, DRAM, & disks
New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication
![Page 49: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/49.jpg)
Lecture 2 Slide 49EECS 570
21st Century Comp Architecture20th Century 21st CenturySingle-chip instand-alone computer
Architecture as Infrastructure: Spanning sensors to cloudsPerformance + security, privacy, availability, programmability, …
Cross-Cutting:
Break current layers with new interfaces
Performance via invisible instr.-level parallelism
Energy First● Parallelism● Specialization● Cross-layer design
Predictable technologies: CMOS, DRAM, & disks
New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication
![Page 50: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/50.jpg)
Lecture 2 Slide 50EECS 570
Cost-Effective Computing[Wood & Hill, IEEE Computer 1995]
Premise:Isn’tspeedup(P)<Pinefficient?❒ Ifonlythroughputmatters,usePcomputersinstead…
• Keyobservation:muchofacomputer’scostisNOTCPULetCostup(P)=Cost(P)/Cost(1)Parallelcomputingiscost-effectiveif:
❒ Speedup(P)>Costup(P)
• E.g.,forSGIPowerChallenge w/500MB❒ Costup(32)=8.6
![Page 51: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/51.jpg)
Lecture 2 Slide 51EECS 570
Parallel Programming Models and Interfaces
![Page 52: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/52.jpg)
Lecture 2 Slide 52EECS 570
Programming Models• Highlevelparadigmforexpressinganalgorithm
❒ Examples:❍ Functional❍ Sequential,procedural❍ Sharedmemory❍MessagePassing
• Embodiedinlanguagesthatsupportconcurrentexecution❒ Incorporatedintolanguageconstructs❒ Incorporatedaslibrariesaddedtoexistingsequentiallanguage
• Toplevelfeatures:(Forconventionalmodels– sharedmemory,messagepassing)❒ Multiplethreadsareconceptuallyvisibletoprogrammer❒ Communication/synchronizationarevisibletoprogrammer
![Page 53: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/53.jpg)
Lecture 2 Slide 53EECS 570
An Incomplete Taxonomy
• VLIW• SIMD• Vector• DataFlow• GPU• MapReduce /WSC• SystolicArray• Reconfigurable/FPGA• …
MP Systems
SharedMemory
Bus-based
SwitchingFabric-based
Incoherent Coherent
MessagePassing
COTS-based
SwitchingFabric-based
Emerging&LunaticFringe
![Page 54: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/54.jpg)
Lecture 2 Slide 54EECS 570
Programming Model Elements• ForbothSharedMemoryandMessagePassing• Processesandthreads
❒ Process: Asharedaddressspaceandoneormorethreadsofcontrol❒ Thread: Aprogramsequencerandprivateaddressspace❒ Task:Lessformalterm– partofanoveralljob❒ Created,terminated,scheduled,etc.
• Communication❒ Passingofdata
• Synchronization❒ Communicatingcontrolinformation❒ Toassurereliable,deterministiccommunication
![Page 55: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/55.jpg)
Lecture 2 Slide 55EECS 570
Historical View
Joinat:
Programwith:
P P P
M M M
IO IO IO
I/O(Network)
Messagepassing
P P P
M M M
IO IO IO
Memory
SharedMemory
P P P
M M M
IO IO IO
Processor
Dataflow,SIMD,VLIW,CUDA,
otherdataparallel
![Page 56: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/56.jpg)
Lecture 2 Slide 56EECS 570
Message PassingProgramming Model
![Page 57: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/57.jpg)
Lecture 2 Slide 57EECS 570
Message Passing Programming Model
• Userlevelsend/receiveabstraction❒ Matchvialocalbuffer(x,y),process(Q,P),andtag(t)❒ Neednaming/synchronizationconventions
ProcessPlocaladdressspace
ProcessQlocaladdressspace
match
Addressx
Addressy
sendx,Q,t
recv y,P,t
![Page 58: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/58.jpg)
Lecture 2 Slide 58EECS 570
Message Passing Architectures
• Cannotdirectlyaccessmemoryofanothernode
• IBMSP-2,IntelParagon,Myrinet QuadricsQSW
• Clusterofworkstations(e.g.,MPIonfluxcluster)
CPU($)Mem
Interconnect
NicCPU($)Mem Nic
CPU($)Mem Nic
CPU($)Mem Nic
![Page 59: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/59.jpg)
Lecture 2 Slide 59EECS 570
MPI –Message Passing Interface API
• Awidelyusedstandard❒ Foravarietyofdistributedmemorysystems
• SMPClusters,workstationclusters,MPPs,heterogeneoussystems
• AlsoworksonSharedMemoryMPs❒EasytoemulatedistributedmemoryonsharedmemoryHW
• Canbeusedwithanumberofhighlevellanguages
• AvailableintheFluxclusteratMichigan
![Page 60: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/60.jpg)
Lecture 2 Slide 60EECS 570
Processes and Threads• Lotsofflexibility(advantageofmessagepassing)
1. Multiplethreadssharinganaddressspace2. Multipleprocessessharinganaddressspace3. Multipleprocesseswithdifferentaddressspaces
• anddifferentOSes
• 1and2easilyimplementedonsharedmemoryHW(withsingleOS)❒ Processandthreadcreation/managementsimilartosharedmemory
• 3probablymorecommoninpractice❒ Processcreationoftenexternaltoexecutionenvironment;e.g.shellscript❒ HardforuserprocessononesystemtocreateprocessonanotherOS
![Page 61: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/61.jpg)
Lecture 2 Slide 61EECS 570
Communication and Synchronization• Combinedinthemessagepassingparadigm
❒ Synchronizationofmessagespartofcommunicationsemantics
• Point-to-pointcommunication❒ Fromoneprocesstoanother
• Collectivecommunication❒ Involvesgroupsofprocesses❒e.g.,broadcast
![Page 62: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/62.jpg)
Lecture 2 Slide 62EECS 570
Message Passing: Send()
• Send(<what>,<where-to>,<how>)
• What:❒ Adatastructureorobjectinuserspace❒ Abufferallocatedfromspecialmemory❒ Awordorsignal
• Where-to:❒ Aspecificprocessor❒ Asetofspecificprocessors❒ Aqueue,dispatcher,scheduler
• How:❒ Asynchronouslyvs.synchronously❒ Typed❒ In-ordervs.out-of-order❒ Prioritized
![Page 63: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/63.jpg)
Lecture 2 Slide 63EECS 570
Message Passing: Receive()
• Receive(<data>,<info>,<what>,<how>)
• Data:mechanismtoreturnmessagecontent❒ Abufferallocatedintheuserprocess❒ Memoryallocatedelsewhere
• Info:meta-infoaboutthemessage❒ Sender-ID❒ Type,Size,Priority❒ Flowcontrolinformation
• What:receiveonlycertainmessages❒ Sender-ID,Type,Priority
• How:❒ Blockingvs.non-blocking
![Page 64: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/64.jpg)
Lecture 2 Slide 64EECS 570
Synchronous vs Asynchronous
• SynchronousSend❒ Stalluntilmessagehasactuallybeenreceived❒ Impliesamessageacknowledgementfromreceivertosender
• SynchronousReceive❒ Stalluntilmessagehasactuallybeenreceived
• AsynchronousSendandReceive❒ Senderandreceivercanproceedregardless❒Returnsrequesthandlethatcanbetestedformessagereceipt❒Requesthandlecanbetestedtoseeifmessagehasbeensent/received
![Page 65: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/65.jpg)
Lecture 2 Slide 65EECS 570
Deadlock• Blockingcommunicationsmaydeadlock
• Requirescareful(safe)orderingofsends/receives
<Process 0> <Process 1>Send(Process1, Message); Receive (Process0, Message);Receive(Process1, Message); Send (Process0, Message);
<Process 0> <Process 1>Send(Process1, Message); Send(Process0, Message);Receive(Process1, Message); Receive(Process0, Message);
![Page 66: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/66.jpg)
Lecture 2 Slide 66EECS 570
Message Passing Paradigm Summary
ProgrammingModel(Software)pointofview:
• Disjoint,separatenamespaces
• “Sharednothing”
• Communicationviaexplicit,typedmessages:send&receive
![Page 67: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/67.jpg)
Lecture 2 Slide 67EECS 570
Message Passing Paradigm Summary
ComputerEngineering(Hardware)pointofview:
• Treatinter-processcommunicationasI/Odevice
• Criticalissues:❒ HowtooptimizeAPIoverhead
❒ Minimizecommunicationlatency
❒ Buffermanagement:howtodealwithearly/unsolicitedmessages,messagetyping,high-levelflowcontrol
❒ Eventsignaling&synchronization
❒ Librarysupportforcommonfunctions(barriersynchronization,taskdistribution,scatter/gather,datastructuremaintenance)
![Page 68: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/68.jpg)
Lecture 2 Slide 68EECS 570
Shared Memory Programming Model
![Page 69: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/69.jpg)
Lecture 2 Slide 69EECS 570
Shared-Memory Model
P1 P2 P3 P4
MemorySystem
❒ Multipleexecutioncontextssharingasingleaddressspace❍ Multipleprograms(MIMD)❍ Ormorefrequently:multiplecopiesofoneprogram(SPMD)
❒ Implicit(automatic)communicationvialoadsandstores❒ Theoreticalfoundation:PRAMmodel
![Page 70: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/70.jpg)
Lecture 2 Slide 70EECS 570
Global Shared Physical Address Space
• Communication,sharing,synchronizationvialoads/storestosharedvariables
• Facilitiesforaddresstranslationbetweenlocal/globaladdressspaces
• RequiresOSsupporttomaintainthismapping
Sharedportionof
addressspacePrivate
portionofaddressspace
Commonphysical
addressspace
Pn private
P2private
P1private
P0private
storeP0
loadPn
![Page 71: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/71.jpg)
Lecture 2 Slide 71EECS 570
Why Shared Memory?Pluses
❒ Forapplicationslookslikemultitaskinguniprocessor❒ ForOSonlyevolutionaryextensionsrequired❒ EasytodocommunicationwithoutOS❒ Softwarecanworryaboutcorrectnessfirstthenperformance
Minuses❒ Propersynchronizationiscomplex❒ Communicationisimplicitsohardertooptimize❒ Hardwaredesignersmustimplement
Result❒ Traditionallybus-basedSymmetricMultiprocessors(SMPs),
andnowCMPsarethemostsuccessparallelmachinesever❒ Andthefirstwithmulti-billion-dollarmarkets
![Page 72: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/72.jpg)
Lecture 2 Slide 72EECS 570
Thread-Level Parallelism
• Thread-levelparallelism(TLP)❒ Collectionofasynchronoustasks:notstartedandstoppedtogether❒ Datasharedloosely,dynamically
• Example:database/webserver(eachqueryisathread)❒ accts isshared,can’tregisterallocateevenifitwerescalar❒ id andamt areprivatevariables,registerallocatedtor1,r2
struct acct_t { int bal; };shared struct acct_t accts[MAX_ACCT];int id,amt;if (accts[id].bal >= amt){
accts[id].bal -= amt;spew_cash();
}
0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
![Page 73: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/73.jpg)
Lecture 2 Slide 73EECS 570
Synchronization• Mutualexclusion :locks,…
• Order :barriers,signal-wait,…
• Implementedusingread/write/modifytosharedlocation❒ Language-level:
❍ libraries(e.g.,locksinpthread)❍ Programmerscanwritecustomsynchronizations
❒ HardwareISA❍ E.g.,test-and-set
• OSprovidessupportformanagingthreads❒ scheduling,fork,join,futex signal/wait
We’llcoversynchronizationinmoredetailinafewweeks
![Page 74: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/74.jpg)
Lecture 2 Slide 74EECS 570
Paired vs. Separate Processor/Memory?• Separateprocessor/memory
❒ Uniformmemoryaccess(UMA):equallatencytoallmemory+ Simplesoftware,doesn’tmatterwhereyouputdata– Lowerpeakperformance❒ Bus-basedUMAscommon:symmetricmulti-processors(SMP)
• Pairedprocessor/memory❒ Non-uniformmemoryaccess(NUMA):fastertolocalmemory– Morecomplexsoftware:whereyouputdatamatters+ Higherpeakperformance:assumingproperdataplacement
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)Mem
CPU($)Mem
CPU($)Mem
CPU($)MemR RRR
![Page 75: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/75.jpg)
Lecture 2 Slide 75EECS 570
Shared vs. Point-to-Point Networks• Sharednetwork:e.g.,bus(left)
+ Lowlatency– Lowbandwidth:doesn’tscalebeyond~16processors+ Sharedpropertysimplifiescachecoherenceprotocols(later)
• Point-to-pointnetwork:e.g.,meshorring(right)– Longerlatency:mayneedmultiple“hops”tocommunicate+ Higherbandwidth:scalesto1000sofprocessors– Cachecoherenceprotocolsarecomplex
CPU($)Mem
CPU($)Mem R
CPU($)Mem R
CPU($)MemR
CPU($)MemR
CPU($)Mem
CPU($)Mem
CPU($)Mem RRRR
![Page 76: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/76.jpg)
Lecture 2 Slide 76EECS 570
Implementation #1: Snooping Bus MP
• Twobasicimplementations
• Bus-basedsystems❒ Typicallysmall:2–8(maybe16)processors❒ Typicallyprocessorssplitfrommemories(UMA)
❍ Sometimesmultipleprocessorsonsinglechip(CMP)❍ Symmetricmultiprocessors(SMPs)❍ Common,Iuseoneeveryday
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
![Page 77: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/77.jpg)
Lecture 2 Slide 77EECS 570
Implementation #2: Scalable MP
• Generalpoint-to-pointnetwork-basedsystems❒ Typicallyprocessor/memory/routerblocks(NUMA)
❍ Glueless MP:noneedforadditional“glue”chips❒ Canbearbitrarilylarge:1000’sofprocessors
❍ Massivelyparallelprocessors(MPPs)❒ Inrealityonlygovernment(DoD)hasMPPs…
❍ Companieshavemuchsmallersystems:32–64processors❍ Scalablemulti-processors
CPU($)Mem R
CPU($)Mem R
CPU($)MemR
CPU($)MemR
![Page 78: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/78.jpg)
Lecture 2 Slide 78EECS 570
Cache Coherence
• Two$100withdrawalsfromaccount#241attwoATMs❒ Eachtransactionmapstothreadondifferentprocessor❒ Trackaccts[241].bal (addressisinr3)
Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
Processor 1
0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
CPU0 MemCPU1
![Page 79: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/79.jpg)
Lecture 2 Slide 79EECS 570
No-Cache, No-Problem
• ScenarioI:processorshavenocaches❒ Noproblem
Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
Processor 1
0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
500500
400
400
300
![Page 80: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/80.jpg)
Lecture 2 Slide 80EECS 570
Cache Incoherence
• ScenarioII:processorshavewrite-backcaches❒ Potentially3copiesofaccts[241].bal:memory,p0$,p1$❒ Cangetincoherent(inconsistent)
Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
Processor 1
0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash
500V:500 500
D:400 500
D:400 500V:500
D:400 500D:400
![Page 81: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/81.jpg)
Lecture 2 Slide 81EECS 570
Snooping Cache-Coherence ProtocolsBusprovidesserializationpoint
Eachcachecontroller“snoops”allbustransactions❒ takeactiontoensurecoherence
❍ invalidate❍ update❍ supplyvalue
❒ dependsonstateoftheblockandtheprotocol
![Page 82: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/82.jpg)
Lecture 2 Slide 82EECS 570
Scalable Cache Coherence• Scalablecachecoherence:twopartsolution
• PartI:busbandwidth❒ Replacenon-scalablebandwidthsubstrate(bus)…❒ …withscalablebandwidthone(point-to-pointnetwork,e.g.,mesh)
• PartII:processorsnoopingbandwidth❒ Interesting:mostsnoopsresultinnoaction❒ Replacenon-scalablebroadcastprotocol(spameveryone)…❒ …withscalabledirectoryprotocol (onlyspamprocessorsthatcare)
• WewillcoverthisinUnit3
![Page 83: EECS 570 Lecture 2 Message Passing & Shared Memory](https://reader034.vdocuments.site/reader034/viewer/2022042619/586a10f81a28ab432d8b5a18/html5/thumbnails/83.jpg)
Lecture 2 Slide 83EECS 570
Shared Memory Summary• Shared-memorymultiprocessors
+ Simplesoftware:easydatasharing,handlesbothDLPandTLP– Complexhardware:mustprovideillusionofglobaladdress
space
• Twobasicimplementations❒ Symmetric(UMA)multi-processors(SMPs)
❍ Underlyingcommunicationnetwork:bus(ordered)+ Low-latency,simpleprotocolsthatrelyonglobalorder– Low-bandwidth,poorscalability
❒ Scalable(NUMA)multi-processors(MPPs)❍ Underlyingcommunicationnetwork:point-to-point
(unordered)+ Scalablebandwidth– Higher-latency,complexprotocols