![Page 1: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/1.jpg)
HPC Performance Profiling using Intel VTune Amplifier XE
Thanh Phung, SSG/DPD/TCAR, [email protected] Dmitry Prohorov, VTune HPC Lead, [email protected]
![Page 2: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/2.jpg)
Ø IntelParallelStudioXE–AnIntroduc4onØ VTuneAmplifierXE:2016U4,2017U1andU2
§ AnalysisConfigura4onandWorkflow
§ VTunePerformanceMetrics:v MemoryAccessanalysisv Micro-archanalysiswithGeneralExplora4onv AdvancedHotspotsv PerformanceOverviewwithHPCPerformanceCharacteriza4on
2
Agenda
![Page 3: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/3.jpg)
IntelParallelStudioXE:AnIntroduc4on
![Page 4: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/4.jpg)
4
Intel® Parallel Studio XE (Linux, Window, MacOS)
Intel® C/C++ & Fortran Compilers
Intel® Math Kernel Library Optimized Routines for Science, Engineering & Financial
Intel® Data Analytics Acceleration Library Optimized for Data Analytics & Machine Learning
Intel® MPI Library
Intel® Threading Building Blocks Task Based Parallel C++ Template Library
Intel® Integrated Performance Primitives Image, Signal & Compression Routines
Intel® VTune™ Amplifier Performance Profiler
Intel® Advisor Threading & Vectorization Architecture
Intel® Trace Analyzer & Collector MPI Profiler
Intel® Inspector Memory & Threading Checking
Prof
iling
, Ana
lysi
s &
A
rchi
tect
ure
Perf
orm
ance
Li
brar
ies
Clu
ster
Too
ls
Intel® Distribution for Python Performance Scripting - Coming Soon – Q3’16
![Page 5: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/5.jpg)
Op#mizingWorkloadPerformance-It’sanitera#veprocess…
Ignore if you are not targeting
clusters. Tune MPI
Optimize Bandwidth Thread
Y
N
Y N
Y NVectorize
Cluster Scalable
?
Memory
Bandwidth Sensitive
?
Effective threading
?
5
![Page 6: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/6.jpg)
IntelParallelStudioXE:AcompletetoolsuitforcodeandHWperformancecharacteriza#on
Intel® Trace Analyzer & Collector (ITAC)
Intel® MPI Snapshot Intel® MPI Tuner
Intel® VTune™ Amplifier
Intel® Advisor
Intel® VTune™ Amplifier
Tune MPI
Optimize Bandwidth Thread
Y
N
Y N
Y NVectorize
Cluster Scalable
?
Memory
Bandwidth Sensitive
?
Effective threading
?
6
![Page 7: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/7.jpg)
Ø VTuneonKNLworkswithSEPdriver(recommended)+PINoruponperf• Relatedto:AdvancedHotspots,MemoryAccess,GeneralExplora#on,HPCPerformanceCharacteriza#on,
customeventanalysis
Ø Perf-basedcollec#onlimita#ons:• MemoryAccessanalysisisnotenabledwithperf
• TocollectGeneralExplora#onincreasedefaultlimitofopenedfiledescriptors:In/etc/security/limits.confincreasedefaultnumberto100*<number_of_logic_CPU_cores>:
<user>hardnofile<100*number_of_logic_CPU_cores><user>so_nofile<100*number_of_logic_CPU_cores>
• Toenabledsystemwidecollec#ons,uncoreeventcollec#onsset:> echo0>/proc/sys/kernel/perf_event_paranoid
Ø DefaultsamplingintervalonKNLis10ms
Ø EMONdriverforcountermode
7
VTune:SystemConfigura#on-PrerequisitesforHWEBSeventbasedcollec#ons
![Page 8: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/8.jpg)
VTuneAmplifierXE:PerformanceAnalyzer
![Page 9: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/9.jpg)
Ø UseVTuneAmplifierXE2017U1(2017U2willbeavailableinWW12)Ø MemoryAccess-BWtrafficandmemoryaccesses
§ MemoryhierarchyandhighBWusage(MCDRAMVsDDR4)Ø GeneralExplora4on-Micro-architecturalissues
§ ExplorehowefficientlyyourcodepassingthroughthecorepipelineØ AdvancedHotspots-Algorithmictuningopportuni4esØ HPCPerformanceCharacteriza4on
§ ScalabilityaspectsforOpenMPandhybridMPI+OpenMPapps§ CPUu#liza#on:SerialvsParallel#me,imbalance,parallelrun#meoverhead
cost,parallelloopparameters§ Memoryaccessefficiency§ FPUu#liza#on(upperbound),FLOPS(upperbound),basicloopvectoriza#on
info
9
Overview:ExplorePerformanceonIntel®XeonandXeonPhi™Processor
![Page 10: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/10.jpg)
<mpi_launcher> – n N <vtune_command_line> ./app_to_run
srun –n 48 -N 16 amplxe-cl –collect advanced-hotspots –trace-mpi –r result_dir ./my_mpi_app
mpirun –n 48 -ppn 16 amplxe-cl –collect advanced-hotspots –r result_dir ./my_mpi_app
• Add -trace-mpi option for VTune CLI to enable per-node result directories for non-Intel MPIs
• Works for software and Intel driver-based collectors
• Superposition of application to launch and VTune command line for selective ranks to reduce trace size
Example: profile rank 1 from 0-15: mpirun -n 1 <vtune_command_line> ./my_app : -n 14 ./my_app
10
AnalysisConfigura#on-HowtoRunVTuneCLIonMPIApplica#ons
![Page 11: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/11.jpg)
1. Create a VTune project
2. Choose “Arbitrary Targets/Local”
3. Set processor arch and OS
4. Set application name and parameters
5. Check “Use MPI Launcher”
Provide the launcher name, number of ranks, ranks to profile, set result directory
11
AnalysisConfigura#on-MPIProfilingCommandLineGenera#onfromGUI
![Page 12: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/12.jpg)
6. Choose analysis type
7. Generate command line
12
AnalysisConfigura#on-MPIProfilingCommandLineGenera#onfromGUI
![Page 13: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/13.jpg)
Resultfinaliza#onandviewingonKNLtargetmightbeslow
Usetherecommendedworkflow:
1.Runcollec#ononKNLdeferringfinaliza#ontohost:amplxe-cl–collectmemory-access–no-auto-finalize–r<my_result_dir>./my_app
2.Finalizetheresultonthehost
• Providesearchdirectoriestothebinariesofinterestforresolvingwith–search-dirop#on
amplxe-cl–finalize–r<my_result_dir>–search-dir<my_binary_dir>
3.Generatereports,workwithGUIamplxe-cl–reporthotspots–r<my_result_dir>
13
Analysisworkflow
![Page 14: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/14.jpg)
VTuneAmplifierXE:PerformanceAnalyzer–MemoryAccess
![Page 15: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/15.jpg)
BTClassDwith4MPIranksand16OMPthreads/rank:memorybandwidth~100GB/swithDDR4(le_)and~280GB/swithMCDRAM(right)
~ 100 GB/s with DDR4
~ 260 - 280 GB/s with MCDRAM
![Page 16: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/16.jpg)
BTClassDwith4MPIranksand16OMPthreads/rank:hotspotsfromrunonDDR4(le_)VersusonMCDRAM(right)
![Page 17: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/17.jpg)
NPB-MZClassDrun#me(sec)comparisononDDR4Vs.MCDRAMwithvariousMPIranksXOMPthreadsàMCDRAMspeedupashighas2.5X
Run Time (sec)
0
500
1000
1500
2000
2500
3000
2x32 2x32 Numa 4x16 4x16 Numa 8x8 8x8 Numa 16x4 16x4 Numa
Performance: DDR4 Vs MCDRAM
SP-MZ
BT-MZ
LU-MZ
![Page 18: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/18.jpg)
Allocate HBW memory with Intel compiler directive fastmem and compile with –lmemkind that can be download from http://memkind.github.io/memkind/ (for C codes: int hbw_malloc (size_t size)
18
Intel Fortran compiler directive
![Page 19: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/19.jpg)
ExampleofrunscriptwithVTunecommandlineamplxe-cl
19
numactl to allocate all memory to 4
MCDRAM memory nodes 4-7
![Page 20: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/20.jpg)
“watch –n 1 numstat –m” shows NUMA nodes with DDR4 (0-3) and MCDRAM (4-7) showing only MCDRAM memory being allocated for LU Class D benchmark on KNL
MCDRAM DDR4
![Page 21: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/21.jpg)
BWusageon64threads(cores)(Anima#oncode)-max38GB/swithDDR4(le_)and240GBswithMCDRAM(right)
Max MCDRAM BW ~ 240 GB/s
Max DDR4 BW ~ 38 GB/s
Large L2 cache misses Large L2 cache misses
![Page 22: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/22.jpg)
TopmemoryobjectsandlargeL2cachemisseswithMCDRAMasHBM
![Page 23: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/23.jpg)
Performanceofanima#oncodewithDDR4BW(le_)comparedtoMCDRAMBW(right)
DDR4
MCDRAM
CPU Load
30 GB/s
220 GB/s 200 GB/s as cache
2 GB/s
Running on DDR4: numctl –m 0 Running on MCDRAM: numctl –m 1
Low CPU loads due to back-end bound
![Page 24: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/24.jpg)
VTuneAmplifierXE:PerformanceAnalyzer–GeneralExplora4on
![Page 25: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/25.jpg)
Micro-archanalysiswithGeneralExplora#on
• Execu#onpipelineslotsdistribu#onbyRe#ring,Front-End,Back-End,BadSpecula#on
• Secondlevelmetricsforeachaspectofexecu#onpipelinetounderstandthereasonofstalls
25
![Page 26: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/26.jpg)
PerformancesummaryandtopOMPregionsforBT-MZ
![Page 27: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/27.jpg)
Hotfunc#onsandOMPhotspotswithmostrun#meandCPUusageprofile
CPUloadonall64coresisnotashighcomparedtothatof40coresàanindica4onofnotop4maldataloadbalancingforthisrun
![Page 28: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/28.jpg)
CPUperformancesta#s#csofdifferentOMPregionsforBT-MZ
Verysmallnumberofinstanceswithrela4velylarge4medura4on
![Page 29: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/29.jpg)
SummaryofallHWeventscollectedusinggeneral-explora#onforBT-MZonKNL:AVX-512instruc#onsareincludedinUOPS_RETIRED_PACK.SIMDandUOPS_RETIRED_SCALAR_SIMDor~60%+offallUOPS_RETIRED_ALL
MemoryUops/Instsre4red
SIMD/allUOPS
![Page 30: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/30.jpg)
VTuneAmplifierXE:PerformanceAnalyzer–AdvancedHotspots
![Page 31: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/31.jpg)
Advanced-hotspotperformanceanalysis-summaryview
31
![Page 32: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/32.jpg)
Advanced-hotspotperformanceanalysis–bouomupview
32
![Page 33: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/33.jpg)
VTuneAmplifierXE:PerformanceAnalyzer–HPCPerformance
![Page 34: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/34.jpg)
HPCPerformanceCharacteriza#onAnalysisShowimportantaspectsofapplica#onperformanceinoneanalysis§ Entrypointtoassessapplica#onefficiencyonsystemresourcesu#liza#onwithdefini#onofthe
nextstepstoinves#gatepathologieswithsignificantperformancecost
§ Monitorhowcodechangesimpactimportantdifferentperformanceaspectstobeuerunderstandtheirimpactonelapsed#me
Customersasking§ Ieliminatedimbalancewithdynamicschedulingbutelapsed#meofmyapplica#onbecame
worse,why?
§ Ivectorizedthecodebutdon’thavemuchbenefit,why?
§ I’mmovingfrompureMPItoMPI+OpenMPbuttheresultsareworse,why?
CPU utilization, memory efficiency and FPU utilization aspects are important for performance study and correlated – let’s explore them in one view
> amplxe-cl –collect hpc-performance –data-limit=0 –r result_dir ./my_app
![Page 35: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/35.jpg)
PerformanceAspects:CPUU#liza#on(1/2)
CPU Utilization § % of “Effective” CPU usage by the
application under profiling (threshold 90%)
– Under assumption that the app should use all available logical cores on a node
– Subtracting spin/overhead time spent in MPI and threading runtimes
Metrics in CPU utilization section § Average CPU usage
§ Intel OpenMP scalability metrics impacting effective CPU utilization
§ CPU utilization histogram
![Page 36: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/36.jpg)
• MPI communication spinning metric for MPICH-based MPIs (Intel MPI, CRAY MPI, .._)
• Difference in MPI communication spinning between ranks can signal MPI imbalance
• Showing OpenMP metrics and serial time per process sorting by processes laying on critical path of MPI execution
36
PerformanceAspects:CPUU#liza#on(2/2)-SpecificsforhybridMPI+OpenMPapps
![Page 37: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/37.jpg)
PerformanceAspects:MemoryBound
MemoryBound
§ %ofpoten4alexecu4onpipelineslotslostbecauseoffetchingmemoryfromdifferentlevelsofhierarchy(threshold20%)
MetricsinMemoryBoundsec#on
§ Cachebound§ DRAMbound
– Issuedescrip#onspecifiesifthecodeisbandwidthorlatencyboundwithproperadviceofhowtofix
– NUMA:%ofremoteaccesses– Importanttoexploreifthecodeisbandwidth
bound
– Bandwidthu#liza#onhistogram
NUMA Access
![Page 38: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/38.jpg)
PerformanceAspects:MemoryBoundonKNL
Since no memory stall measurement on KNL “Memory Bound” high level metric replaced with Backend-Bound with second level based on misses and bandwidth measurement from uncore events:
§ L2Hit Bound – Cost of L1 misses served in L2
§ L2 Miss Bound – Cost of L2 misses
§ DRAM Bandwidth Bound – % of app elapsed time consuming high
DRAM Bandwidth
§ MCDRAM Bandwidth Bound – % of app elapsed time consuming high
MCDRAM Bandwidth
§ Bandwidth utilization histogram
38
![Page 39: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/39.jpg)
Performanceaspects:FPUU#liza#on
FPUu#liza#on§ %ofFPUload(100%-FPUisfullyloaded,threshold50%)MetricsinFPUu#liza#onsec#on
– SPFLOPsperCycle(vectorcodegenera#onandexecu#onefficiency)
– VectorCapacityUsageandFPInstruc#onMix,FPArith/Memra#os(vectorcodegenera#onefficiency)
– Top5loops/func#onsbyFPUusage– Dynamicallygeneratedissuedescrip#onsonlowFPU
usagehelptodefinethereasonandnextsteps:Non-vectorized,vectorizedwithlegacyinstruc#onset,memoryboundlimitedloopsnotbenefi#ngfromvectoriza#onetc.
These renewed FPU Utilization metrics will be available in 2017
Update 2
![Page 40: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/40.jpg)
NoFLOPcountersonKNLtocalculateFLOPSandFPUU#liza#on
ShowingSIMDInstruc#onspercycleandSIMDPackedvsSIMDScalarbasedonavailableHWcounters+Vectorinstruc#onsetperloopbasedonsta#canalysis
40
Performance aspects: FPU utilization on KNL
These renewed FPU Utilization metrics will be available in 2017
Update 2
![Page 41: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/41.jpg)
• Generateda_ercollec#onisdoneorwith“-Rsummary”op#onofamplxe-cl
• Withissuedescrip#onsthatcanbesuppressedby“–report-knobshow-issues=false”op#on
41
HPCPerformanceCharacteriza#on–CommandLineRepor#ng
![Page 42: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/42.jpg)
Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
42
![Page 43: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/43.jpg)
Back-up Slides
![Page 44: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/44.jpg)
Ø Core tuning: § Cache or vector friendly or both:
o AVX-2 and AVX-512
o Use best compiler options and check compiler report
mpiifort –g –O3 –xMIC-AVX512 –align array64byte … –qopt-report=5 –qopt-report-phase=loop, vec, openmp…
§ Compilers directives and pragmas: SIMD, Alignment, …
§ OpenMP 4.0 with OMP SIMD directives/pragmas § NUMA: MCDRAM Vs DDR – Allocate memory for active arrays or use NUMA
command to use MCDRAM for better performance
44
Code tuning requirements: know your code, know the compiler and know the platform microarchitecture
![Page 45: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/45.jpg)
45
Package
Knights Landing (KML) Overview
36 Tiles w/ 72 new Silvermont-based cores 4 Threads per core 2 Vector Processing Units per core 6 channels of DDR4 2400 up to 384GB 8 to16 GB of on-package MCDRAM memory 36 lanes PCIE Gen 3. 4 lanes of DMI
MC DRA
M
MC DRA
M
MC DRA
M
MC DRA
M
MC DRA
M
MC DRA
M
MC DRA
M
MC DRA
M
OPIO OPIO OPIO OPIO
OPIO OPIO OPIO OPIO
DDR4 DDR4
PCIE gen3
2 x16 1 x4 x4
DMI
36 Tiles Tiles connected with Mesh
TILE:
45
2 VPU
Core
2 VPU
Core
1MB L2
CHA
![Page 46: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/46.jpg)
46
3 Memory Modes
Hybrid Model
DDR4 4 or 8 GB MCDRAM
8 or 12GB MCDRAM Split Options:
25/75% Or 50/50% DDR4
16GB MCDRAM
DDR4
16GB MCDRAM
Flat Models
Phy
sica
l Add
ress
DDR4
16GB MCDRAM
Cache Model 64B cache lines
Direct mapped
• Mode selected at boot time • MCDRAM-Cache covers all HBM
46
![Page 47: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/47.jpg)
Ø Scalability: § OMP
o Load balance over all threads o Private Vs shared data o Synchronization o Lock, wait and spinning Vs doing work o SIMD directives
§ MPI o Timing cost due to communication Vs computing o Block Vs non-blocking message types o Global synchronizations o All-to-all communication
47
Codetuningrequirements:ParallelScalabilitywithMPI,OMP,HybridMPI+OMP
![Page 48: HPC Performance Profiling using Intel VTune Amplifier XE · HPC Performance Profiling using Intel VTune Amplifier XE Thanh Phung, SSG/DPD/TCAR, thanh.phung@intel.com ... • Works](https://reader030.vdocuments.site/reader030/viewer/2022040819/5e6598c41f403666052aab23/html5/thumbnails/48.jpg)