intel® xeon phi™ coprocessor: introductionŸ7.pdf · 2014-11-20 · сопроцессором...
TRANSCRIPT
Intel® Xeon Phi™ Coprocessor: IntroductionDmitry Sergeev
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Сопроцессоры Intel® Xeon Phi™
2
Основные характеристики платформы…
До 61 ядер на базе IA/1.1 GHz/ 244 потока
До 16GB памяти с пропускной способностью 352 Гб/с
512-битные SIMD инструкции
ОС Linux, доступ по IP-адресу
Стандартные программные средства и языки!
…приводящие к выдающимся результатам
До 1.2 Терафлоп пиковая производительность1
До 2.2x выше пропускная способность памяти по сравнению с Intel® Xeon® E5 2
До 4x более энергоэффективный, чем Intel® Xeon® E5 3
Software and workloads used in performance tests may have been optimized for performance only on
Intel microprocessors. Performance tests, such as SYSmark and MbileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of
those factors may cause the results to vary. You should consult other information and performance
tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products. For more information go to
http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
3 Family Впечатляющее решение
для параллельных расчетов
цена/производительность 3120P 3120A
5 FamilyСистемы с высокой
плотностьюэнергопотребление/производи
тельность 5110P 5120D
7 FamilyВысокопроизводительны
е системыНаивысшая
производительность 7120P 7120X
16GB GDDR5
352GB/s
>1.2TF DP
8GB GDDR5
>300GB/s
>1TF DP
225-245W
6GB GDDR5
240GB/s
>1TF DP
Сопроцессоры Intel® Xeon Phi™
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance3
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Сопроцессор Intel® Xeon Phi™
Сильно-параллельные HPC расчеты
Процессор Intel® Xeon®
Общие HPC расчеты
Дополняющие технологии
4
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Общие архитектурные характеристики
ПроцессорIntel® Xeon® Processor
E5-2690
Сопроцессор Intel® Xeon Phi™ 5110P
2.9GHz Частота 1.053GHz
8 (Multi-Core) Ядра 60 (Many-Core)
16 Потоки 240
256 SIMD 512
Когерентный Кэш Когерентный
Общая память Память Общая память
5
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Типичная платформа с сопроцессором Intel® Xeon Phi
Host CPU
Host CPU
Intel® Xeon® платформа («хост»)
QPI
x16 PCIe Xeon Phi™
Intel® Xeon Phi™ сопроцессор(ы)
x16 PCIe
GDDR5DDR3
DDR3
IBA, 10GbE
IBA, 10GbE
1-4 на узел
1-2 CPUs на узел
For illustration only.
GDDR5
Xeon Phi™
6
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Обзор микроархитектуры Intel® Xeon Phi™
PCIe
Client
Logic
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TDTDTDTD
GDDR MC
GDDR MC
GDDR MC
GDDR MC
TD: Tag DirectoryL2: L2-CacheMC: Memory Controller
For illustration only.
7
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
L2 Control
L1 TLB
and 32KB
Code Cache
T0 IP
4 Threads
In-Order
TLB Miss
Code Cache Miss
Decode uCode
16B/Cycle (2 IPC)
Pipe 0
X87 RF Scalar RF
X87 ALU 0 ALU 1
VPU RF
VPU
512b SIMD
Pipe 1
TLB Miss
Handler
L2 TLB
T1 IP
T2 IP
T3 IP
L1 TLB and 32KB Data Cache
DCache Miss
TLB Miss
To On-Die Interconnect
HWP
Intel® Xeon
Phi™ Processor
Core
512KB
L2 Cache
For illustration only.8
Ядро Intel® Xeon Phi™
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Эффективные приложения для Intel® Xeon Phi™Допускают массовый параллелизм
Имеют высокую вычислительную сложность
Векторизация
Большое количество вычислений на единицу данных
Умещаются в доступную память
Multicore(8+)
Many-Core(60)
9
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
10
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel Xeon PhiВектор: 512 bitТипы:• integer (32 и 64 бит)• float (F32)• double (F64)
X4
Y4
X4◦Y4
X3
Y3
X3◦Y3
X2
Y2
X2◦Y2
X1
Y1
X1◦Y1
0
X8
Y8
X8◦Y8
X7
Y7
X7◦Y7
X6
Y6
X6◦Y6
X5
Y5
X5◦Y5
X16
Y16
X16◦Y16
…
...
…
511
SIMD, Single Instruction Multiple-Data
11
SIMD/Параллелизм по данным
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Векторизация кода• Заставляет последовательный код использовать возможности
параллелизма по данным (SIMD) процессоров Intel
– Вручную за счет спец синтаксиса
– Автоматически за счет компилятора
for(i = 0; i <= MAX;i++)
c[i] = a[i] + b[i];
a
b
c
++
a[i]
b[i]
c[i]
+
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
12
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Почему важна векторизация ?
#define MAX(x,y) ((x)>(y)?(x):(y))
#define MIN(x,y) ((x)<(y)?(x):(y))
#define SAT2SI16(x) \
MAX(MIN((x),32767),-32768)
void foo1(int n, short *A, short *B){
int i;
#pragma ivdep
#pragma vector aligned
for (i=0; i<n; i++)
A[i] = SAT2SI16(A[i]+B[i]);
}
movsx r11d, [rdx+r9*2]
movsx ebx, [r8+r9*2]
add r11d, ebx
cmp r11d, 32767
cmovge r11d, eax
cmp r11d, -32768
cmovl r11d, ecx
mov [rdx+r9*2], r11w
inc r9
cmp r9, r10
jb .B1.8
11 инстр./ 1 элем
Saturation Add
movdqa xmm0, [rdx+rax*2]
paddsw xmm0, [r8+rax*2]
movdqa [rdx+rax*2], xmm0
add rax, 8
cmp rax, r9
jb .B1.4 6 инстр/ 8 элем
Скалярный код:
Векторный код (SSSE-3):
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Параллельные программные модели Intel
Intel® Cilk™ Plus
Расширения языка C/C++ для упрощения параллелизма
(Исходный код и продукт Intel)
Intel® Threading Building Blocks
Библиотека C++ шаблонов для параллелизма
(Исходный код и продукт Intel)
Специализиро-ванные библиотеки
Intel® Integrated Performance Primitives
Intel® Math Kernel Library
Стандарты
Message Passing Interface (MPI)
OpenMP*
CoarrayFortran
OpenCL*
R&D
Intel® Concurrent Collections
Offload Extensions
Intel® SPMD Parallel Compiler
Применяются как к Multicore, так и к Many-core
14
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Fortran (CAF)
MKL
TBB
OpenCL
Cilk Plus
C++
Инструменты
OpenMP
Fortran (CAF)
TBB
OpenCL
Cilk Plus
C++
MKL
Параллельное программирование одно и то же
OpenMP
ИнструментыPCIe
PC
Ie
Исполняемый
файл для
сопроцессора
Xeon Phi
Исполняемый
файл для CPU
Гетерогенное программирование
15
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Гибкие модели исполненияОптимальная производительность для различных нагрузок
XEON®
PHI
XEON
PHI™
XEON®XEON
PHI™
Родная (NATIVE)
модельOFFLOAD модель Симметричная
модель
XEON®XEON
PHI™
MPI
XEON® XEON®
DIRECTIVES
16
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
MPI+OffloadMPI ranks on Intel® Xeon® processors (only)
All messages into/out of processors
Offload models used to accelerate MPI ranks
Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* within Intel® MIC Architecture
Homogenous network of hybrid nodes:
Xeon MIC
Xeon MIC
Xeon MIC
Xeon MIC
Network
Data
Data
Data
Data
Data
Data
Data
Data
MPI
MPI
17
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Compile your code with the offload directives
Create your hosts file (Xeon only)
Run your application (Xeon only)
MPI + OffloadHow to run
$ mpiifort –openmp test.f –o test.offload
$ cat hosts
node0
node1
$ mpirun –f hosts –n 2 ./test.offload
18
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Пример offload : Вычисление π(только демонстрация)# define NSET 1000000
int main ( int argc, const char** argv )
{ long int i, num_inside = 0;
float Pi;
#pragma offload target (MIC)
#pragma omp parallel for reduction(+:num_inside)
for( i = 0; i < NSET; i++ )
{ float x, y, distance2;
// Generate x, y random numbers in [0,1)
x = float(rand()) / float(RAND_MAX + 1);
y = float(rand()) / float(RAND_MAX + 1);
distance2 = x*x + y*y;
if ( distance2 <= 1.0f )
num_inside++;
}
Pi = 4.0f * ( (float)num_inside / NSET );
printf("Value of Pi = %f \n",Pi);
}
Добавление всего одной строки для гетерогенной (Xeon +Xeon Phi) версии
19
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
void foo() /* Intel® Math Kernel Library */{
float *A, *B, *C; /* Matrices */
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
}
Автоматический offload с Intel® Math Kernel Library
Xeon Xeon Phi
Неявный автоматический offload не требует
изменений в исходном коде
2020
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Many-core Hosted (Native)
MPI ranks on Intel® Xeon PhiTMcoprocessors(only)
All messages into/out of Intel® Xeon PhiTM coprocessors
Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreadsused directly within MPI processes
Programmed as homogenous network of many-core CPUs:
Xeon MIC
Xeon MIC
Xeon MIC
Xeon MIC
Network
Data
Data
Data
Data
MPI
21
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Compile your code for Intel® Xeon Phi™ Coprocessor
Copy the MIC-enabled executable to the coprocessor
Create your hosts file (MIC only)
Let the library know you’re planning on running on MIC
Run your application (from the Xeon)
Many-core Hosted (Native)How to run
$ mpiifort –mmic test.f –o test.mic
$ scp test.mic mic0:/home/user/
$ scp test.mic mic1:/home/user/
$ cat hosts
mic0
mic1
$ export I_MPI_MIC=1
$ mpirun –f hosts –n 4 /home/user/test.mic
22
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Symmetric
MPI ranks on Intel® Xeon PhiTMcoprocessors and Intel® Xeon® processors
Messages to/from any core
Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* used directly within MPI processes
Programmed as heterogeneous network of homogeneous nodes:
Xeon MIC
Xeon MIC
Xeon MIC
Xeon MIC
Network
Data
Data
Data
Data
MPI
Data
Data
Data
Data
MPI
MPI
23
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Compile for the Intel® Xeon and the Intel® Xeon Phi™ Coprocessor
Copy the MIC-enabled executable to the coprocessor (rename during copy)
Create your hosts file (Xeon+MIC)
Let the library know you’re planning on running on MIC
Run your application (from the Xeon)
SymmetricHow to run
$ mpiifort test.f –o /home/user/test
$ mpiifort –mmic test.f –o test.mic
$ scp test.mic mic0:/home/user/test
$ scp test.mic mic1:/home/user/test
$ cat hosts
node0
mic0
mic1
$ export I_MPI_MIC=1
$ mpirun –f hosts –n 4 /home/user/test.mic
24
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Two environment variables available to support NFS on Coprocessor
I_MPI_MIC_PREFIX – Prepends value to executable name (directory)
I_MPI_MIC_POSTFIX – Appends value to executable name (extension)
Procedure:
Set I_MPI_MIC=1
Run job as normal
Host nodes will launch command as specified
Coprocessor nodes will launch modified command
NFS support via environment variables
mpirun … ./app args
./app args
$I_MPI_MIC_PREFIX./app$I_MPI_MIC_POSTFIX args
25
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Configuration files allow different MPI options, different executables, different program arguments, etc.
One argument set per line, # for comments
Run command should only specify configuration file
Configuration files for complex runs
$ cat theconfigfile
-n 1 –host node1 ./master
-n 3 –env OMP_NUM_THREADS 8 –host node1 ./worker
-n 4 –env OMP_NUM_THREADS 60 –host node1-mic0 ./worker.mic
-n 4 –env OMP_NUM_THREADS 8 –host node2 ./worker
-n 4 –env OMP_NUM_THREADS 60 –host node2-mic0 ./worker.mic
$ mpirun –configfile theconfigfile
26
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® MPI Library 5.0What’s New
MPI-3 Standard Support
Non-Blocking Collectives
Fast RMA
Large Counts
MPICH ABI Compatibility
Compatibility with MPICH* v3.1, IBM* MPI v1.4, Cray* MPT v7.0
Performance & Scaling
Memory Consumption Optimizations
Scaling up to 150K Ranks*
Gains up to 35% reduction on Collectives
Hydra now default job manager on Windows*
Configuration: Hardware: Intel® Xeon® CPU E5-2680 @ 2.70GHz, RAM 64GB; Interconnect: InfiniBand, ConnectX adapters; FDR. MIC: C0-KNC 1238095 kHz; 61 cores. RAM: 15872 MB per card. Software: RHEL 6.2, OFED 1.5.4.1, MPSS Version: 3.2, Intel® C/C++ Compiler XE 13.1.1, Intel® MPI Benchmarks 3.2.4.;
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .
2.0
1.9
1.1
1.6
1.8
1 1 1 1 1
0
1
2
3
4 bytes 512 bytes 16 Kbytes 128 Kbytes 4 Mbytes
Sp
ee
du
p (
tim
es)
Intel MPI 5.0 MVAPICH2-2.0 RC2
Superior Performance with Intel® MPI Library 5.064 Processes, 8 nodes (InfiniBand + shared memory), Linux* 64Relative (Geomean) MPI Latency Benchmarks (Higher is Better)
2X Faster
1.9X Faster
1.1X Faster
1.6X Faster
1.8X Faster
27
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Tuning MPI Application Performance
28
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Performance Tuning Tools for Distributed ApplicationsIntel® Trace Analyzer and Collector
Tune cross-node MPI
Visualize MPI behavior
Evaluate MPI load balancing
Find communication hotspots
Intel® VTune™ Amplifier XE
Tune single node threading
Visualize thread behavior
Evaluate thread load balancing
Find thread sync bottlenecks
29
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Trace Analyzer and Collector OverviewIntel® Trace Analyzer and Collector helps the developer:
Visualize and understand parallel application behavior
Evaluate profiling statistics and load balancing
Identify communication hotspots
Features
Event-based approach
Low overhead
Excellent scalability
Powerful aggregation and filtering functions
Idealizer
NEW in 9.0: Automatic Performance Assistant
Source
Code
Binary
Objects
Compiler
Linker
Runtime
Output
Intel® Trace Collector
Trace File (.stf)
API and -tcollect
-trace
Intel® Trace Analyzer
30
30
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Using the Intel® Trace Analyzer and Collector is … Easy!
Run your binary and create a tracefile
$ mpirun –trace –n 2 ./test
View the Results:$ traceanalyzer &
Step 1
Step 2
31
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Collection Mechanism Advantages Disadvantages
Run with –trace or preload trace collector library.
Automatically collects all MPI calls, requires no modification to source, compile, or link.
No user code collection.
Link with –trace. Automatically collects all MPI calls.
No user code collection.Must be done at link time.
Compile with –tcollect. Automaticallyinstruments all function entries/exits.
Requires recompile of code.
Add API calls to source code.
Can selectively instrument desired code sections.
Requires code modification.
Multiple Methods for Data Collection
32
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Tracing libraries have been ported
Ensure libraries are available on card
Installation path available via NFS (preferred)
Manually copy files via scp
scp /opt/intel/itac/<version>/mic/slib/libVT.so mic0:/lib64
Run as a normal job
All trace files stored in working directory
If not on NFS share, files will need to be copied from coprocessor
Analyze using Intel® Trace Analyzer
traceanalyzer test.stf &
Tracing on Intel® Xeon Phi™ Coprocessor
33
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Compare the event timelines of two communication profiles
Blue = computationRed = communication
Chart showing how the MPI processes interact
Intel® Trace Analyzer and Collector
34
34
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Improving Load Balance: Real World Case
Host16 MPI procs x1 OpenMP thread
Coprocessor8 MPI procs x28 OpenMP threads
Collapsed data per node and coprocessor card
Too high load on Host= too low load on coprocessor
35
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Improving Load Balance: Real World Case
Collapsed data per node and coprocessor card
Host16 MPI procs x1 OpenMP thread
Coprocessor24 MPI procs x8 OpenMP threads
Too low load on Host= too high load on coprocessor
36
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Improving Load Balance: Real World Case
Collapsed data per node and coprocessor card
Host16 MPI procs x1 OpenMP thread
Coprocessor16 MPI procs x12 OpenMP thrds
Perfect balanceHost load = Coprocessor load
37
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
NEW in 9.0: MPI Performance Assistant
Automatic Performance Assistant
Detect common MPI performance issues
Automated tips on potential solutions
Automatically detect performance issues and their impact on runtime
38
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Which Performance Issues are automatically identified? Point-to-point exchange
39
Late Sender Late Receiver
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Which Performance Issues are automatically identified?Global collective operation performance
40
Wait at Barrier
Early Reduce
Late Broadcast
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
NEW in 9.0: Summary page shows computation vs. communication breakdown
Is your application
MPI-bound?
Is your application
CPU-bound?
Resource usage
Largest MPI consumers
Next Steps
41
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Non-blocking Allreduce
(MPI_Iallreduce)
Support for major MPI-3.0 features
Non-blocking collectives
Fast RMA
Large counts
NEW in 9.0: Initial MPI-3.0 Support
42
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Launch Intel® VTune™ Amplifier XE
Use mpirun
List your app as a parameter
Results organized by MPI rank
Review results
Graphical user interface
Command line report
Intel® VTune™ Amplifier XE with MPI
Tune for Scalable Multicore Performance
43
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Use the command-line tool under the MPI run script to gather report data
Argument Sets can be used for more control
Required: Only run one driver collection per node
Only collect data on certain ranks
Different collections or options on different ranks
A unique results directory is created for each analyzed MPI rank
Launch the GUI and view the results for each rank
Using Intel® VTune™ Amplifier XE with MPI
mpirun –n #ranks amplxe-cl –result-dir ampl_results –collect hotspots -- ./test
44
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Inspector XE with MPIWhere are my application’s…
Memory Errors Threading Errors Security Errors
• Invalid Accesses• Memory Leaks• Uninitialized Memory
Accesses
• Races• Deadlocks• Cross Stack References
• Buffer overflows and underflows
• Incorrect pointer usage• Over 250 error types…
• MPI aware, cluster friendly• Both dynamic and static analysis• Multiple tools – common GUI• Windows* & Linux* Jean Kypreos
Advanced Video Processing Team ManagerEnvivio
"Having such a tool this early in the development stage frees the validation from trivial bug reports and gives our engineers the opportunity to code more efficiently from the very beginning of the product cycle."
Multi-threading problems are hard to reproduce, difficult to debug, and expensive to fix!
45
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Inspector XE
Dynamic Analysis
Launch Intel® Inspector XE
Use mpirun
List your app as a parameter
Results organized by MPI rank
Review results
Graphical user interface
Command line report
Static Analysis
Source analyzed for errors (similar to a build)
Review results
Graphical user interface
Find errors earlier when they are less expensive to fix
46
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Use the command-line tool under the MPI run script to gather report data
Argument Sets can be used for more control
Only collect data on certain ranks
Different collections or options on different ranks
A unique results directory is created for each analyzed MPI rank
Launch the GUI and view the results for each rank
Using Intel® Inspector XE with MPI
mpirun –n #ranks inspxe-cl –result-dir insp_results –collect hotspots -- ./test
47
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® MPI Library product page
www.intel.com/go/mpi
Intel® Trace Analyzer and Collector product page
www.intel.com/go/traceanalyzer
Intel® Clusters and HPC Technology forums
http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology
Intel® Xeon Phi™ Coprocessor Developer Community
http://software.intel.com/en-us/mic-developer
Online Resources
48
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
49