“openmp does not scale”

73
8 Ruud van der Pas Dis.nguished Engineer Architecture and Performance SPARC Microelectronics, Oracle Santa Clara, CA, USA “OpenMP Does Not Scale”

Upload: dinhtuyen

Post on 13-Feb-2017

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: “OpenMP Does Not Scale”

8

RuudvanderPasDis.nguishedEngineer

ArchitectureandPerformanceSPARCMicroelectronics,Oracle

SantaClara,CA,USA

“OpenMP Does Not Scale”

Page 2: “OpenMP Does Not Scale”

9

Agendan The Myth n Deep Trouble n Get Real n The Wrapping

Page 3: “OpenMP Does Not Scale”

10

TheMyth

Page 4: “OpenMP Does Not Scale”

11

“A myth, in popular use, is something that is widely believed but false.” ......

(source: www.reference.com)

Page 5: “OpenMP Does Not Scale”

12

“OpenMPDoesNotScale”

ACommonMyth

AProgrammingModelCanNot“NotScale”

WhatCanNotScale:

TheImplementaDonTheSystemVersusTheResourceRequirements

Or.....You

Page 6: “OpenMP Does Not Scale”

1313

Hmmm....WhatDoesThatReallyMean?

Page 7: “OpenMP Does Not Scale”

14

SomeQues.onsICouldAsk“Doyoumeanyouwroteaparallelprogram,usingOpenMPanditdoesn'tperform?”“Isee.DidyoumakesuretheprogramwasfairlywellopDmizedinsequenDalmode?”

Page 8: “OpenMP Does Not Scale”

15

SomeQues.onsICouldAsk

“Oh.Youjustthinkitshouldandyouusedallthecores.HaveyouesDmatedthespeedupusingAmdahl'sLaw?”

“Oh.Youdidn't.Bytheway,whydoyouexpecttheprogramtoscale?”

“No,thislawisnotanewEUfinancialbailoutplan.Itissomethingelse.”

Page 9: “OpenMP Does Not Scale”

16

SomeQues.onsICouldAsk“Iunderstand.Youcan'tknoweverything.HaveyouatleastusedatooltoidenDfythemostDmeconsumingpartsinyourprogram?”

“Oh.Youdidn't.Didyouminimizethenumberofparallelregionsthen?”

“Oh.Youdidn't.Itjustworkedfinethewayitwas.

“Oh.Youdidn't.Youjustparallelizedallloopsintheprogram.Didyoutrytoavoidparallelizinginnermostloopsinaloopnest?”

Page 10: “OpenMP Does Not Scale”

17

MoreQues.onsICouldAsk“Didyouatleastusethenowaitclausetominimizetheuseofbarriers?”“Oh.You'veneverheardofabarrier.Mightbeworthtoreadupon.”“Doallthreadsroughlyperformthesameamountofwork?”

“Youdon'tknow,butthinkitisokay.Ihopeyou'reright.”

Page 11: “OpenMP Does Not Scale”

18

IDon’tGiveUpThatEasily“DidyoumakeopDmaluseofprivatedata,ordidyousharemostofit?”“Oh.Youdidn't.Sharingisjusteasier.Isee.

Page 12: “OpenMP Does Not Scale”

19

IDon’tGiveUpThatEasily

“You'veneverheardofthateither.Howunfortunate.CouldthereperhapsbeanyfalsesharingaffecDngperformance?”“Oh.Neverheardofthateither.MaycomehandytolearnaliYlemoreaboutboth.”

“Youseemtobeusingacc-NUMAsystem.Didyoutakethatintoaccount?”

Page 13: “OpenMP Does Not Scale”

20

TheGrassIsAlwaysGreener...“So,whatdidyoudonexttoaddresstheperformance?”

“SwitchedtoMPI.Isee.DoesthatperformanybeYerthen?”

“Oh. You don't know. You're still debugging the code.”

Page 14: “OpenMP Does Not Scale”

21

GoingIntoPedan.cMode

“Whileyou'rewaiDngforyourMPIdebugruntofinish(areyousureitdoesn'thangbytheway?),pleaseallowmetotalkaliYlemoreaboutOpenMPandPerformance.”

Page 15: “OpenMP Does Not Scale”

22

DeepTrouble

Page 16: “OpenMP Does Not Scale”

23

n The transparency and ease of use of OpenMP are a mixed blessing

à Makes things pretty easy

à May mask performance bottlenecks

n In the ideal world, an OpenMP application “just performs well”

n Unfortunately, this is not always the case

OpenMPAndPerformance/1

Page 17: “OpenMP Does Not Scale”

24

n Two of the more obscure things that can negatively impact performance are cc-NUMA effects and False Sharing

n Neither of these are restricted to OpenMP

à They come with shared memory programming on modern

cache based systems

à But they might show up because you used OpenMP

à In any case they are important enough to cover here

OpenMPAndPerformance/2

Page 18: “OpenMP Does Not Scale”

2525

ConsideraDonsforcc-NUMA

Page 19: “OpenMP Does Not Scale”

26

AGenericcc-NUMAArchitecture

CacheCoherentInterconnect

ProcessorMem

ory

ProcessorMem

ory

Main Issue: How To Distribute The Data ?

Local Access (fast)

Remote Access (slower)

Page 20: “OpenMP Does Not Scale”

27

n Important aspect on cc-NUMA systems

à If not optimal, longer memory access times and hotspots

n OpenMP 4.0 does provide support for cc-NUMA

à Placement under control of the Operating System (OS)

à User control through OMP_PLACES

n Windows, Linux and Solaris all use the “First Touch” placement policy by default

à May be possible to override default (check the docs)

AboutDataDistribu.on

Page 21: “OpenMP Does Not Scale”

28

for (i=0; i<10000; i++) a[i] = 0;

CacheCoherentInterconnectMem

ory

Processor

Mem

ory

a[0] : a[9999]

ProcessorProcessor

First Touch All array elements are in the memory of

the processor executing this thread

ExampleFirstTouchPlacement/1

Page 22: “OpenMP Does Not Scale”

29

for (i=0; i<10000; i++) a[i] = 0;

CacheCoherentInterconnect

Mem

ory

Processor

Mem

ory

a[0] : a[4999]

#pragma omp parallel for num_threads(2)

First Touch Both memories now have “their own

half” of the array

a[5000] : a[9999]

Processor

ExampleFirstTouchPlacement/2

ProcessorProcessor

Page 23: “OpenMP Does Not Scale”

30

GetReal

Page 24: “OpenMP Does Not Scale”

31

“Don’tTryThisAtHome(yet)”

Page 25: “OpenMP Does Not Scale”

32

TheIni.alPerformance(35GB)

0.000#

0.010#

0.020#

0.030#

0.040#

0.050#

0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#

Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5J2#Performance#(SCALE#26)#

Gameoverbeyond16threads

Page 26: “OpenMP Does Not Scale”

33

Thatdoesn’tscaleverywell

Let’suseabiggermachine!

Page 27: “OpenMP Does Not Scale”

34

Ini.alPerformance(35GB)

0.000#

0.010#

0.020#

0.030#

0.040#

0.050#

0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#

Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5J2#and#T5J8#Performance#

T5J2#SCALE#26# T5J8#SCALE#26#

Page 28: “OpenMP Does Not Scale”

3535

Oops!Thatcan’tbetrue

Let’srunalargergraph!

Page 29: “OpenMP Does Not Scale”

36

Ini.alPerformance(280GB)

36

0.000#

0.010#

0.020#

0.030#

0.040#

0.050#

0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5J2#and#T5J8#Performance#

T5J2#SCALE#29# T5J8#SCALE#29#

Page 30: “OpenMP Does Not Scale”

3737

Let’sGetTechnical

Page 31: “OpenMP Does Not Scale”

38

37%$ 31%$22%$ 15%$

7%$ 6%$

27%$21%$

16%$

10%$

6%$ 4%$

29%$

26%$

20%$

13%$

8%$6%$

11%$

20%$

27%$

22%$ 27%$

4%$ 5%$12%$

17%$

31%$ 34%$

3%$ 6%$ 15%$25%$ 22%$

3%$ 2%$ 4%$ 3%$ 2%$ 2%$

0%$

20%$

40%$

60%$

80%$

100%$

1$ 2$ 4$ 8$ 16$ 20$Number$of$threads$

Total$CPU$Time$Percentage$DistribuDon$(Baseline,$SCALE$26)$

Atomic$operaDons$

OMPPatomic_wait$

OMPPcriDcal_secDon_wait$

OMPPimplicit_barrier$

Other$

make_bfs_tree$

verify_bfs_tree$

TotalCPUTimeDistribu.on

38

Func.on1

Func.on2

Page 32: “OpenMP Does Not Scale”

39

0"

10"

20"

30"

40"

50"

60"

70"

80"

0" 750" 1500" 2250" 3000" 3750" 4500" 5250" 6000" 6750" 7500" 8250" 9000"

Band

width"(G

B/s)"

Elapsed"Time"(seconds)"

SPARC"T5F2"Measured"Bandwidth"(BASE,"SCALE"28,"16"threads)""

Total" Read"(Socket"0)" Read"(Socket"1)" Write"(Socket"0)" Write"(Socket"1)"

BandwidthOfTheOriginalCode

39

Lessthanhalfofthememorybandwidthisused

Page 33: “OpenMP Does Not Scale”

40

SummaryOriginalVersion

40

•  Communica4oncostsaretoohigh–  Increasesasthreadsareadded– Thisseriouslylimitsthenumberofthreadsused– Thisisturnaffectsmemoryaccessonlargergraphs

•  Thebandwidthisnotbalanced•  Fixes:

– FindandfixmanyOpenMPinefficiencies– Usesomeefficientatomicfunc4ons

Page 34: “OpenMP Does Not Scale”

41

Methodology

UseTheChecklistToIdenDfyBoYlenecks

UseAProfilingTool

IfTheCodeDoesNotScaleWell

TackleThemOneByOne

ThisIsAnIncrementalApproach

ButVeryRewarding

Page 35: “OpenMP Does Not Scale”

4242

BO BOSecretSauce

Page 36: “OpenMP Does Not Scale”

43

NotethemuchshorterrunDmeforthemodifiedversion

ComparisonOfTheTwoVersions

43

Atomic/cri4calsec4onwait

Implicitbarrier

Idlestate

Page 37: “OpenMP Does Not Scale”

44

0.00#

0.10#

0.20#

0.30#

0.40#

0.50#

0.60#

0.70#

0# 16# 32# 48# 64# 80# 96# 112#128#144#160#176#192#208#224#240#256#Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5L2#Performance#(SCALE#29)#

T5L2#Baseline# T5L2#OPT1#

Peakperformanceis13.7xhigher

PerformanceComparison

44

Page 38: “OpenMP Does Not Scale”

4545

BO MOMore

SecretSauce

Page 39: “OpenMP Does Not Scale”

46

Observa.ons

ButNeedsIt....

TheCodeDoesNotExploitLargePages

FirstTouchPlacementIsNotUsed

UsedASmarterMemoryAllocator

Page 40: “OpenMP Does Not Scale”

47

0"

25"

50"

75"

100"

125"

150"

0" 400" 800" 1200" 1600" 2000" 2400" 2800" 3200" 3600" 4000"

Band

width"(G

B/s)"

Elasped"Time"(seconds)"

SPARC"T5E2"Measured"Bandwidth"(OPT2,"SCALE"29,"224"threads)""

Total"BW"(GB/s)" Read"(GBs/s)" Write"(GB/s)"

BandwidthOfTheNewCode

47

MaximumReadBandwidth135GB/s

Page 41: “OpenMP Does Not Scale”

48

TheResult

48

0.04 0.03 0.03

1.13

0.790.89

1.47

1.731.67

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

SCALE29(282GB) SCALE30(580GB) SCALE31(1150GB)

Billion

Searche

dEd

gesP

erSecon

d(GTEPS)

T5-8OPT0(BO) T5-8OPT1(BO) T5-8OPT2(MO)

39-52ximprovementoveroriginalcode

Page 42: “OpenMP Does Not Scale”

49

0"

10"

20"

30"

40"

50"

60"

70"

80"

0"

100"

200"

300"

400"

500"

600"

700"

800"

1" 2" 4" 8" 16" 32" 64" 128" 256" 384" 512"

Elap

sed"Time"(m

inutes)"

Number"of"threads"

SPARC"T5E8"Graph"Search"Performance""(SCALE"30"E"Memory"Requirements:"580"GB)"

Search"Time"(minutes)" Speed"Up"

BiggerIsDefinitelyBe[er!

49

72xspeedup!

SearchDmereducedfrom12hoursto10

minutes

Page 43: “OpenMP Does Not Scale”

50

0.00#

0.20#

0.40#

0.60#

0.80#

1.00#

1.20#

1.40#

1.60#

0# 64# 128# 192# 256# 320# 384# 448# 512# 576# 640# 704# 768# 832# 896# 960#1024#

Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5L8#(SCALE#32,#OPT2)##

A2.3TBSizedProblem

50

EvenstarDngat32threadsthespeedupissDll11x

896Threads!

Page 44: “OpenMP Does Not Scale”

51

0"5"

10"15"20"25"30"35"40"45"50"55"

26" 27" 28" 29" 30" 31"

Speed"Up"Over"T

he"OPT

0"Ve

rsion"

SCALE"Value"

SPARC"T5D8"Speed"Up"Over"OPT0"

OPT1" OPT2"

TuningBenefitBreakdown

51

Somewhatdiminishingreturn

Biggerisbe[er

Page 45: “OpenMP Does Not Scale”

5252

MO MOBODifferent

SecretSauce

Page 46: “OpenMP Does Not Scale”

53

57-75ximprovement

0.04 0.03 0.03

1.13

0.790.89

1.47

1.73 1.67

2.15

2.43 2.42

0.00

0.50

1.00

1.50

2.00

2.50

SCALE29(282GB) SCALE30(580GB) SCALE31(1150GB)

Billion

Searche

dEd

gesP

erSecon

d(GTEPS)

T5-8OPT0(BO) T5-8OPT1(BO) T5-8OPT2(MO) T5-8OPT2(MOBO)

ASimpleOpenMPChange

Page 47: “OpenMP Does Not Scale”

5454

“IValueMyPersonalSpace”

Page 48: “OpenMP Does Not Scale”

55

MyFavoriteSimpleAlgorithmvoid mxv(int m,int n,double *a,double *b[], double *c) { for (int i=0; i<m; i++) { double sum = 0.0; for (int j=0; j<n; j++) sum += b[i][j]*c[j]; a[i] = sum; } }

parallel loop

= *

j

i

Page 49: “OpenMP Does Not Scale”

56

TheOpenMPSource

#pragma omp parallel for default(none) \ shared(m,n,a,b,c) for (int i=0; i<m; i++) { double sum = 0.0; for (int j=0; j<n; j++) sum += b[i][j]*c[j]; a[i] = sum; }

= *

j

i

Page 50: “OpenMP Does Not Scale”

57

0"

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

Memory"Footprint"(KByte,"logD2"scale)"

1x1"thread" 2x1"threads" 4x1"threads" 8x1"threads" 8x2"threads"

PerformanceOnIntelNehalem

Maxspeedupis~1.6xonly

Waitaminute!Thisalgorithmishighly

parallel....

NotaDon:Numberofcoresxnumberofthreadswithincore

System: Intel X5570 with 2 sockets, 8 cores, 16 threads at 2.93 GHz

Page 51: “OpenMP Does Not Scale”

58

?

Page 52: “OpenMP Does Not Scale”

59

Let'sGetTechnical

Page 53: “OpenMP Does Not Scale”

60

hwthread0hwthread1

core 0 caches

hwthread0hwthread1

core 1 caches

hwthread0hwthread1

core 2 caches

hwthread0hwthread1

core 3 caches

socket 0

shar

ed c

ache

mem

ory

hwthread0hwthread1

core 0 caches

hwthread0hwthread1

core 1 caches

hwthread0hwthread1

core 2 caches

hwthread0hwthread1

core 3 caches

socket 1

shar

ed c

ache

mem

ory

QPI

Inte

rcon

nect

ATwoSocketNehalemSystem

Page 54: “OpenMP Does Not Scale”

61

DataIni.aliza.onRevisited#pragma omp parallel default(none) \ shared(m,n,a,b,c) private(i,j) { #pragma omp for for (j=0; j<n; j++) c[j] = 1.0; #pragma omp for for (i=0; i<m; i++) { a[i] = -1957.0; for (j=0; j<n; j++) b[i][j] = i; } /*-- End of omp for --*/ } /*-- End of parallel region --*/

= *

j

i

Page 55: “OpenMP Does Not Scale”

62

0"

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

Memory"Footprint"(KByte,"logD2"scale)"

1x1"thread" 2x1"threads" 4x1"threads" 8x1"threads" 8x2"threads"

DataPlacementMa[ers!

Theonlychangeisthewaythedataisdistributed

Maxspeedupis~3.3xnow

NotaDon:Numberofcoresxnumberofthreadswithincore

System: Intel X5570 with 2 sockets, 8 cores, 16 threads at 2.93 GHz

Page 56: “OpenMP Does Not Scale”

63

mem

ory

mem

ory

Syst

em In

terc

onne

ct

ATwoSocketSPARCT4-2Systemhwthread0

hwthread7

core 0 caches

hwthread0

hwthread7

core 7 caches shar

ed c

ache

.....

.....

.....

.....

socket 0

hwthread0

hwthread7

core 0 caches

hwthread0

hwthread7

core 7 caches shar

ed c

ache

.....

.....

.....

.....

socket 1

Page 57: “OpenMP Does Not Scale”

64

0"

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

Memory"Footprint"(KByte,"logD2"scale)"

1x1"Thread" 8x1"Threads" 16x1"Threads" 16x2"Threads"

PerformanceOnSPARCT4-2

Maxspeedupis~5.8x

Scalingonlargermatricesisaffectedbycc-NUMAeffects(similarasonNehalem)

Notethattherearenoidlecyclestofillhere

NotaDon:Numberofcoresxnumberofthreadswithincore

System: SPARC T4 with 2 sockets, 16 cores, 128 threads at 2.85 GHz

Page 58: “OpenMP Does Not Scale”

65

0"

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

Memory"Footprint"(KByte,"logD2"scale)"

1x1"Thread" 8x1"Threads" 16x1"Threads" 16x2"Threads"

DataPlacementMa[ers!

Maxspeedupis~11.3x

Theonlychangeisthewaythedataisdistributed

NotaDon:Numberofcoresxnumberofthreadswithincore

System: SPARC T4 with 2 sockets, 16 cores, 128 threads at 2.85 GHz

Page 59: “OpenMP Does Not Scale”

66

SummaryMatrixTimesVector

0"

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

Memory"Footprint"(KByte,"logD2"scale)"

Nehalem"OMP"16"threads" Nehalem"OMPDFT"16"threads"

T4D2"OMP"16"threads" T4D2"OMPDFT"16"threads"

1.9x

2.1x

Page 60: “OpenMP Does Not Scale”

67

TheWrapping

Page 61: “OpenMP Does Not Scale”

68

WrappingThingsUp“Whilewe'resDllwaiDngforyourMPIdebugruntofinish,IwanttoaskyouwhetheryoufoundmyinformaDonuseful.”

“Yes,itisoverwhelming.Iknow.”

“AndOpenMPissomewhatobscureincertainareas.Iknowthataswell.”

Page 62: “OpenMP Does Not Scale”

69

WrappingThingsUp“Iunderstand.You'renotaComputerScienDstandjustneedtogetyourscienDficresearchdone.”

“IagreethisisnotagoodsituaDon,butitisallaboutDarwin,youknow.I'msorry,itisatoughworldoutthere.”

Page 63: “OpenMP Does Not Scale”

70

ItNeverEnds“Oh,yourMPIjobjustfinished!Great.”“Yourprogramdoesnotwriteafilecalled'core'anditwasn'ttherewhenyoustartedtheprogram?”“Youwonderwheresuchafilecomesfrom?Let'stalk,butIneedtogetabigandstrongcoffeefirst.”

“WAIT!Whatdidyoujustsay?”

Page 64: “OpenMP Does Not Scale”

71

ItReallyNeverEnds“SomebodytoldyouWHAT??”

“YouthinkGPUsandOpenCLwillsolveallyourproblems?”

“Let'smakethatanXLTripleEspresso.I’llbuy”

Page 65: “OpenMP Does Not Scale”

72

Thank You And ..... Stay Tuned ! [email protected]

Page 66: “OpenMP Does Not Scale”

73

RuudvanderPasDis.nguishedEngineer

ArchitectureandPerformanceSPARCMicroelectronics,Oracle

SantaClara,CA,USA

DTrace Why It Can Be Good For You

Page 67: “OpenMP Does Not Scale”

74

n DTrace is a Dynamic Tracing Tool à Supported on Solaris, Mac OS X and some Linux flavours

n Monitors the Operating System (OS) n Through “probes”, the user can see what the OS is

doing n Main target: OS related performance issues n Surprisingly, it can also greatly help to find out what

your application is doing though *

Mo.va.on

*)Aregularprofilingtoolshouldbeusedfirst

Page 68: “OpenMP Does Not Scale”

75

n A DTrace probe is written by the user

n When the probe “fires”, the code in the probe is executed

n The probes are based on “providers” n The providers are pre-defined

à Example: “sched” provider to get info from the scheduler

à You can also instrument your own code to have DTrace

probes, but there is little need for that

HowDTraceWorks

provider:module:function:name

Page 69: “OpenMP Does Not Scale”

76

Example–ThreadAffinity/1

sched:::off-cpu/ pid == $target && self->on_cpu == 1 /{ self->time_delta = (timestamp - ts_base)/1000;

@thread_off_cpu [tid-1] = count(); @total_thr_state[probename] = count();

printf("Event %8u %4u %6u %6u %-16s %8s\n", self->time_delta, tid-1, curcpu->cpu_id, curcpu->cpu_lgrp, probename, probefunc);

self->on_cpu = 0; self->time_delta = 0;}

“predicate”

Page 70: “OpenMP Does Not Scale”

77

Example–ThreadAffinity/2pid$target::processor_bind:entry/ (processorid_t) arg2 >= 0 /{ self->time_delta = (timestamp - ts_base)/1000; self->target_processor = (processorid_t) arg2;

@proc_bind_info[tid-1, curcpu->cpu_id, self->target_processor] = count();

printf("Event %8u %4u %6u %6u %9s/%-6d %8s\n", self->time_delta, tid-1, curcpu->cpu_id, curcpu->cpu_lgrp, "proc_bind", self->target_processor, probename);

self->time_delta = 0; self->target_processor = 0;}

Page 71: “OpenMP Does Not Scale”

78

Example–ExampleCode#pragma omp parallel{ #pragma omp single { printf("Single region executed by thread %d\n", omp_get_thread_num()); } // End of single region} // End of parallel region

export OMP_NUM_THREADS=4export OMP_PLACES=coresexport OMP_PROC_BIND=close./affinity.d -c ./omp-par.exe

Page 72: “OpenMP Does Not Scale”

79

Example–Result=========================================================== Affinity Statistics===========================================================

Thread On HW Thread Lgroup Created Thread Count 0 787 7 1 1 0 787 7 2 1 0 787 7 3 1

Thread Running on HW Thread Bound to HW Thread 0 787 784 1 771 792 2 848 800 3 813 808

Page 73: “OpenMP Does Not Scale”

80

Thank You And ..... Stay Tuned ! [email protected]