“openmp does not scale”
TRANSCRIPT
8
RuudvanderPasDis.nguishedEngineer
ArchitectureandPerformanceSPARCMicroelectronics,Oracle
SantaClara,CA,USA
“OpenMP Does Not Scale”
9
Agendan The Myth n Deep Trouble n Get Real n The Wrapping
10
TheMyth
11
“A myth, in popular use, is something that is widely believed but false.” ......
(source: www.reference.com)
12
“OpenMPDoesNotScale”
ACommonMyth
AProgrammingModelCanNot“NotScale”
WhatCanNotScale:
TheImplementaDonTheSystemVersusTheResourceRequirements
Or.....You
1313
Hmmm....WhatDoesThatReallyMean?
14
SomeQues.onsICouldAsk“Doyoumeanyouwroteaparallelprogram,usingOpenMPanditdoesn'tperform?”“Isee.DidyoumakesuretheprogramwasfairlywellopDmizedinsequenDalmode?”
15
SomeQues.onsICouldAsk
“Oh.Youjustthinkitshouldandyouusedallthecores.HaveyouesDmatedthespeedupusingAmdahl'sLaw?”
“Oh.Youdidn't.Bytheway,whydoyouexpecttheprogramtoscale?”
“No,thislawisnotanewEUfinancialbailoutplan.Itissomethingelse.”
16
SomeQues.onsICouldAsk“Iunderstand.Youcan'tknoweverything.HaveyouatleastusedatooltoidenDfythemostDmeconsumingpartsinyourprogram?”
“Oh.Youdidn't.Didyouminimizethenumberofparallelregionsthen?”
“Oh.Youdidn't.Itjustworkedfinethewayitwas.
“Oh.Youdidn't.Youjustparallelizedallloopsintheprogram.Didyoutrytoavoidparallelizinginnermostloopsinaloopnest?”
17
MoreQues.onsICouldAsk“Didyouatleastusethenowaitclausetominimizetheuseofbarriers?”“Oh.You'veneverheardofabarrier.Mightbeworthtoreadupon.”“Doallthreadsroughlyperformthesameamountofwork?”
“Youdon'tknow,butthinkitisokay.Ihopeyou'reright.”
18
IDon’tGiveUpThatEasily“DidyoumakeopDmaluseofprivatedata,ordidyousharemostofit?”“Oh.Youdidn't.Sharingisjusteasier.Isee.
19
IDon’tGiveUpThatEasily
“You'veneverheardofthateither.Howunfortunate.CouldthereperhapsbeanyfalsesharingaffecDngperformance?”“Oh.Neverheardofthateither.MaycomehandytolearnaliYlemoreaboutboth.”
“Youseemtobeusingacc-NUMAsystem.Didyoutakethatintoaccount?”
20
TheGrassIsAlwaysGreener...“So,whatdidyoudonexttoaddresstheperformance?”
“SwitchedtoMPI.Isee.DoesthatperformanybeYerthen?”
“Oh. You don't know. You're still debugging the code.”
21
GoingIntoPedan.cMode
“Whileyou'rewaiDngforyourMPIdebugruntofinish(areyousureitdoesn'thangbytheway?),pleaseallowmetotalkaliYlemoreaboutOpenMPandPerformance.”
22
DeepTrouble
23
n The transparency and ease of use of OpenMP are a mixed blessing
à Makes things pretty easy
à May mask performance bottlenecks
n In the ideal world, an OpenMP application “just performs well”
n Unfortunately, this is not always the case
OpenMPAndPerformance/1
24
n Two of the more obscure things that can negatively impact performance are cc-NUMA effects and False Sharing
n Neither of these are restricted to OpenMP
à They come with shared memory programming on modern
cache based systems
à But they might show up because you used OpenMP
à In any case they are important enough to cover here
OpenMPAndPerformance/2
2525
ConsideraDonsforcc-NUMA
26
AGenericcc-NUMAArchitecture
CacheCoherentInterconnect
ProcessorMem
ory
ProcessorMem
ory
Main Issue: How To Distribute The Data ?
Local Access (fast)
Remote Access (slower)
27
n Important aspect on cc-NUMA systems
à If not optimal, longer memory access times and hotspots
n OpenMP 4.0 does provide support for cc-NUMA
à Placement under control of the Operating System (OS)
à User control through OMP_PLACES
n Windows, Linux and Solaris all use the “First Touch” placement policy by default
à May be possible to override default (check the docs)
AboutDataDistribu.on
28
for (i=0; i<10000; i++) a[i] = 0;
CacheCoherentInterconnectMem
ory
Processor
Mem
ory
a[0] : a[9999]
ProcessorProcessor
First Touch All array elements are in the memory of
the processor executing this thread
ExampleFirstTouchPlacement/1
29
for (i=0; i<10000; i++) a[i] = 0;
CacheCoherentInterconnect
Mem
ory
Processor
Mem
ory
a[0] : a[4999]
#pragma omp parallel for num_threads(2)
First Touch Both memories now have “their own
half” of the array
a[5000] : a[9999]
Processor
ExampleFirstTouchPlacement/2
ProcessorProcessor
30
GetReal
31
“Don’tTryThisAtHome(yet)”
32
TheIni.alPerformance(35GB)
0.000#
0.010#
0.020#
0.030#
0.040#
0.050#
0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#
Billion
#Traversed
#Edges#per#Secon
d#(GTEPS)#
Number#of#threads#
SPARC#T5J2#Performance#(SCALE#26)#
Gameoverbeyond16threads
33
Thatdoesn’tscaleverywell
Let’suseabiggermachine!
34
Ini.alPerformance(35GB)
0.000#
0.010#
0.020#
0.030#
0.040#
0.050#
0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#
Billion
#Traversed
#Edges#per#Secon
d#(GTEPS)#
Number#of#threads#
SPARC#T5J2#and#T5J8#Performance#
T5J2#SCALE#26# T5J8#SCALE#26#
3535
Oops!Thatcan’tbetrue
Let’srunalargergraph!
36
Ini.alPerformance(280GB)
36
0.000#
0.010#
0.020#
0.030#
0.040#
0.050#
0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#Billion
#Traversed
#Edges#per#Secon
d#(GTEPS)#
Number#of#threads#
SPARC#T5J2#and#T5J8#Performance#
T5J2#SCALE#29# T5J8#SCALE#29#
3737
Let’sGetTechnical
38
37%$ 31%$22%$ 15%$
7%$ 6%$
27%$21%$
16%$
10%$
6%$ 4%$
29%$
26%$
20%$
13%$
8%$6%$
11%$
20%$
27%$
22%$ 27%$
4%$ 5%$12%$
17%$
31%$ 34%$
3%$ 6%$ 15%$25%$ 22%$
3%$ 2%$ 4%$ 3%$ 2%$ 2%$
0%$
20%$
40%$
60%$
80%$
100%$
1$ 2$ 4$ 8$ 16$ 20$Number$of$threads$
Total$CPU$Time$Percentage$DistribuDon$(Baseline,$SCALE$26)$
Atomic$operaDons$
OMPPatomic_wait$
OMPPcriDcal_secDon_wait$
OMPPimplicit_barrier$
Other$
make_bfs_tree$
verify_bfs_tree$
TotalCPUTimeDistribu.on
38
Func.on1
Func.on2
39
0"
10"
20"
30"
40"
50"
60"
70"
80"
0" 750" 1500" 2250" 3000" 3750" 4500" 5250" 6000" 6750" 7500" 8250" 9000"
Band
width"(G
B/s)"
Elapsed"Time"(seconds)"
SPARC"T5F2"Measured"Bandwidth"(BASE,"SCALE"28,"16"threads)""
Total" Read"(Socket"0)" Read"(Socket"1)" Write"(Socket"0)" Write"(Socket"1)"
BandwidthOfTheOriginalCode
39
Lessthanhalfofthememorybandwidthisused
40
SummaryOriginalVersion
40
• Communica4oncostsaretoohigh– Increasesasthreadsareadded– Thisseriouslylimitsthenumberofthreadsused– Thisisturnaffectsmemoryaccessonlargergraphs
• Thebandwidthisnotbalanced• Fixes:
– FindandfixmanyOpenMPinefficiencies– Usesomeefficientatomicfunc4ons
41
Methodology
UseTheChecklistToIdenDfyBoYlenecks
UseAProfilingTool
IfTheCodeDoesNotScaleWell
TackleThemOneByOne
ThisIsAnIncrementalApproach
ButVeryRewarding
4242
BO BOSecretSauce
43
NotethemuchshorterrunDmeforthemodifiedversion
ComparisonOfTheTwoVersions
43
Atomic/cri4calsec4onwait
Implicitbarrier
Idlestate
44
0.00#
0.10#
0.20#
0.30#
0.40#
0.50#
0.60#
0.70#
0# 16# 32# 48# 64# 80# 96# 112#128#144#160#176#192#208#224#240#256#Billion
#Traversed
#Edges#per#Secon
d#(GTEPS)#
Number#of#threads#
SPARC#T5L2#Performance#(SCALE#29)#
T5L2#Baseline# T5L2#OPT1#
Peakperformanceis13.7xhigher
PerformanceComparison
44
4545
BO MOMore
SecretSauce
46
Observa.ons
ButNeedsIt....
TheCodeDoesNotExploitLargePages
FirstTouchPlacementIsNotUsed
UsedASmarterMemoryAllocator
47
0"
25"
50"
75"
100"
125"
150"
0" 400" 800" 1200" 1600" 2000" 2400" 2800" 3200" 3600" 4000"
Band
width"(G
B/s)"
Elasped"Time"(seconds)"
SPARC"T5E2"Measured"Bandwidth"(OPT2,"SCALE"29,"224"threads)""
Total"BW"(GB/s)" Read"(GBs/s)" Write"(GB/s)"
BandwidthOfTheNewCode
47
MaximumReadBandwidth135GB/s
48
TheResult
48
0.04 0.03 0.03
1.13
0.790.89
1.47
1.731.67
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
SCALE29(282GB) SCALE30(580GB) SCALE31(1150GB)
Billion
Searche
dEd
gesP
erSecon
d(GTEPS)
T5-8OPT0(BO) T5-8OPT1(BO) T5-8OPT2(MO)
39-52ximprovementoveroriginalcode
49
0"
10"
20"
30"
40"
50"
60"
70"
80"
0"
100"
200"
300"
400"
500"
600"
700"
800"
1" 2" 4" 8" 16" 32" 64" 128" 256" 384" 512"
Elap
sed"Time"(m
inutes)"
Number"of"threads"
SPARC"T5E8"Graph"Search"Performance""(SCALE"30"E"Memory"Requirements:"580"GB)"
Search"Time"(minutes)" Speed"Up"
BiggerIsDefinitelyBe[er!
49
72xspeedup!
SearchDmereducedfrom12hoursto10
minutes
50
0.00#
0.20#
0.40#
0.60#
0.80#
1.00#
1.20#
1.40#
1.60#
0# 64# 128# 192# 256# 320# 384# 448# 512# 576# 640# 704# 768# 832# 896# 960#1024#
Billion
#Traversed
#Edges#per#Secon
d#(GTEPS)#
Number#of#threads#
SPARC#T5L8#(SCALE#32,#OPT2)##
A2.3TBSizedProblem
50
EvenstarDngat32threadsthespeedupissDll11x
896Threads!
51
0"5"
10"15"20"25"30"35"40"45"50"55"
26" 27" 28" 29" 30" 31"
Speed"Up"Over"T
he"OPT
0"Ve
rsion"
SCALE"Value"
SPARC"T5D8"Speed"Up"Over"OPT0"
OPT1" OPT2"
TuningBenefitBreakdown
51
Somewhatdiminishingreturn
Biggerisbe[er
5252
MO MOBODifferent
SecretSauce
53
57-75ximprovement
0.04 0.03 0.03
1.13
0.790.89
1.47
1.73 1.67
2.15
2.43 2.42
0.00
0.50
1.00
1.50
2.00
2.50
SCALE29(282GB) SCALE30(580GB) SCALE31(1150GB)
Billion
Searche
dEd
gesP
erSecon
d(GTEPS)
T5-8OPT0(BO) T5-8OPT1(BO) T5-8OPT2(MO) T5-8OPT2(MOBO)
ASimpleOpenMPChange
5454
“IValueMyPersonalSpace”
55
MyFavoriteSimpleAlgorithmvoid mxv(int m,int n,double *a,double *b[], double *c) { for (int i=0; i<m; i++) { double sum = 0.0; for (int j=0; j<n; j++) sum += b[i][j]*c[j]; a[i] = sum; } }
parallel loop
= *
j
i
56
TheOpenMPSource
#pragma omp parallel for default(none) \ shared(m,n,a,b,c) for (int i=0; i<m; i++) { double sum = 0.0; for (int j=0; j<n; j++) sum += b[i][j]*c[j]; a[i] = sum; }
= *
j
i
57
0"
5,000"
10,000"
15,000"
20,000"
25,000"
30,000"
35,000"
0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"
Performan
ce"in"M
flop/s"
Memory"Footprint"(KByte,"logD2"scale)"
1x1"thread" 2x1"threads" 4x1"threads" 8x1"threads" 8x2"threads"
PerformanceOnIntelNehalem
Maxspeedupis~1.6xonly
Waitaminute!Thisalgorithmishighly
parallel....
NotaDon:Numberofcoresxnumberofthreadswithincore
System: Intel X5570 with 2 sockets, 8 cores, 16 threads at 2.93 GHz
58
?
59
Let'sGetTechnical
60
hwthread0hwthread1
core 0 caches
hwthread0hwthread1
core 1 caches
hwthread0hwthread1
core 2 caches
hwthread0hwthread1
core 3 caches
socket 0
shar
ed c
ache
mem
ory
hwthread0hwthread1
core 0 caches
hwthread0hwthread1
core 1 caches
hwthread0hwthread1
core 2 caches
hwthread0hwthread1
core 3 caches
socket 1
shar
ed c
ache
mem
ory
QPI
Inte
rcon
nect
ATwoSocketNehalemSystem
61
DataIni.aliza.onRevisited#pragma omp parallel default(none) \ shared(m,n,a,b,c) private(i,j) { #pragma omp for for (j=0; j<n; j++) c[j] = 1.0; #pragma omp for for (i=0; i<m; i++) { a[i] = -1957.0; for (j=0; j<n; j++) b[i][j] = i; } /*-- End of omp for --*/ } /*-- End of parallel region --*/
= *
j
i
62
0"
5,000"
10,000"
15,000"
20,000"
25,000"
30,000"
35,000"
0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"
Performan
ce"in"M
flop/s"
Memory"Footprint"(KByte,"logD2"scale)"
1x1"thread" 2x1"threads" 4x1"threads" 8x1"threads" 8x2"threads"
DataPlacementMa[ers!
Theonlychangeisthewaythedataisdistributed
Maxspeedupis~3.3xnow
NotaDon:Numberofcoresxnumberofthreadswithincore
System: Intel X5570 with 2 sockets, 8 cores, 16 threads at 2.93 GHz
63
mem
ory
mem
ory
Syst
em In
terc
onne
ct
ATwoSocketSPARCT4-2Systemhwthread0
hwthread7
core 0 caches
hwthread0
hwthread7
core 7 caches shar
ed c
ache
.....
.....
.....
.....
socket 0
hwthread0
hwthread7
core 0 caches
hwthread0
hwthread7
core 7 caches shar
ed c
ache
.....
.....
.....
.....
socket 1
64
0"
5,000"
10,000"
15,000"
20,000"
25,000"
30,000"
35,000"
0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"
Performan
ce"in"M
flop/s"
Memory"Footprint"(KByte,"logD2"scale)"
1x1"Thread" 8x1"Threads" 16x1"Threads" 16x2"Threads"
PerformanceOnSPARCT4-2
Maxspeedupis~5.8x
Scalingonlargermatricesisaffectedbycc-NUMAeffects(similarasonNehalem)
Notethattherearenoidlecyclestofillhere
NotaDon:Numberofcoresxnumberofthreadswithincore
System: SPARC T4 with 2 sockets, 16 cores, 128 threads at 2.85 GHz
65
0"
5,000"
10,000"
15,000"
20,000"
25,000"
30,000"
35,000"
0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"
Performan
ce"in"M
flop/s"
Memory"Footprint"(KByte,"logD2"scale)"
1x1"Thread" 8x1"Threads" 16x1"Threads" 16x2"Threads"
DataPlacementMa[ers!
Maxspeedupis~11.3x
Theonlychangeisthewaythedataisdistributed
NotaDon:Numberofcoresxnumberofthreadswithincore
System: SPARC T4 with 2 sockets, 16 cores, 128 threads at 2.85 GHz
66
SummaryMatrixTimesVector
0"
5,000"
10,000"
15,000"
20,000"
25,000"
30,000"
35,000"
0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"
Performan
ce"in"M
flop/s"
Memory"Footprint"(KByte,"logD2"scale)"
Nehalem"OMP"16"threads" Nehalem"OMPDFT"16"threads"
T4D2"OMP"16"threads" T4D2"OMPDFT"16"threads"
1.9x
2.1x
67
TheWrapping
68
WrappingThingsUp“Whilewe'resDllwaiDngforyourMPIdebugruntofinish,IwanttoaskyouwhetheryoufoundmyinformaDonuseful.”
“Yes,itisoverwhelming.Iknow.”
“AndOpenMPissomewhatobscureincertainareas.Iknowthataswell.”
69
WrappingThingsUp“Iunderstand.You'renotaComputerScienDstandjustneedtogetyourscienDficresearchdone.”
“IagreethisisnotagoodsituaDon,butitisallaboutDarwin,youknow.I'msorry,itisatoughworldoutthere.”
70
ItNeverEnds“Oh,yourMPIjobjustfinished!Great.”“Yourprogramdoesnotwriteafilecalled'core'anditwasn'ttherewhenyoustartedtheprogram?”“Youwonderwheresuchafilecomesfrom?Let'stalk,butIneedtogetabigandstrongcoffeefirst.”
“WAIT!Whatdidyoujustsay?”
71
ItReallyNeverEnds“SomebodytoldyouWHAT??”
“YouthinkGPUsandOpenCLwillsolveallyourproblems?”
“Let'smakethatanXLTripleEspresso.I’llbuy”
72
Thank You And ..... Stay Tuned ! [email protected]
73
RuudvanderPasDis.nguishedEngineer
ArchitectureandPerformanceSPARCMicroelectronics,Oracle
SantaClara,CA,USA
DTrace Why It Can Be Good For You
74
n DTrace is a Dynamic Tracing Tool à Supported on Solaris, Mac OS X and some Linux flavours
n Monitors the Operating System (OS) n Through “probes”, the user can see what the OS is
doing n Main target: OS related performance issues n Surprisingly, it can also greatly help to find out what
your application is doing though *
Mo.va.on
*)Aregularprofilingtoolshouldbeusedfirst
75
n A DTrace probe is written by the user
n When the probe “fires”, the code in the probe is executed
n The probes are based on “providers” n The providers are pre-defined
à Example: “sched” provider to get info from the scheduler
à You can also instrument your own code to have DTrace
probes, but there is little need for that
HowDTraceWorks
provider:module:function:name
76
Example–ThreadAffinity/1
sched:::off-cpu/ pid == $target && self->on_cpu == 1 /{ self->time_delta = (timestamp - ts_base)/1000;
@thread_off_cpu [tid-1] = count(); @total_thr_state[probename] = count();
printf("Event %8u %4u %6u %6u %-16s %8s\n", self->time_delta, tid-1, curcpu->cpu_id, curcpu->cpu_lgrp, probename, probefunc);
self->on_cpu = 0; self->time_delta = 0;}
“predicate”
77
Example–ThreadAffinity/2pid$target::processor_bind:entry/ (processorid_t) arg2 >= 0 /{ self->time_delta = (timestamp - ts_base)/1000; self->target_processor = (processorid_t) arg2;
@proc_bind_info[tid-1, curcpu->cpu_id, self->target_processor] = count();
printf("Event %8u %4u %6u %6u %9s/%-6d %8s\n", self->time_delta, tid-1, curcpu->cpu_id, curcpu->cpu_lgrp, "proc_bind", self->target_processor, probename);
self->time_delta = 0; self->target_processor = 0;}
78
Example–ExampleCode#pragma omp parallel{ #pragma omp single { printf("Single region executed by thread %d\n", omp_get_thread_num()); } // End of single region} // End of parallel region
export OMP_NUM_THREADS=4export OMP_PLACES=coresexport OMP_PROC_BIND=close./affinity.d -c ./omp-par.exe
79
Example–Result=========================================================== Affinity Statistics===========================================================
Thread On HW Thread Lgroup Created Thread Count 0 787 7 1 1 0 787 7 2 1 0 787 7 3 1
Thread Running on HW Thread Bound to HW Thread 0 787 784 1 771 792 2 848 800 3 813 808
80
Thank You And ..... Stay Tuned ! [email protected]