intel xeon phi coprocessor
TRANSCRIPT
Intel® Xeon Phi™ Coprocessor
James Reinders
http://tinyurl.com/inteljames twitter @jamesreinders
it’s all about parallel
programming
Multicore CPU Multicore CPU
Intel® MIC architecture coprocessor
Source
Compilers Libraries,
Parallel Models
Multicore CPU Multicore CPU
Intel® MIC architecture coprocessor
Source
Compilers Libraries,
Parallel Models
Game Changer
“Unparalleled productivity… most of this software does not run on a GPU” - Robert Harrison, NICS, ORNL
“R. Harrison, “Opportunities and Challenges Posed by Exascale Computing - ORNL's Plans and Perspectives”, National Institute of Computational Sciences, Nov 2011”
Intel® Inspector XE, Intel® VTune™ Amplifier
XE, Intel® Advisor
Intel® C/C++ and Fortran Compilers w/OpenMP
Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP
Intel® Parallel Studio XE
Intel® Trace Analyzer and Collector
Intel® MPI Library
Intel® Inspector XE, Intel® VTune™ Amplifier
XE, Intel® Advisor
Intel® C/C++ and Fortran Compilers w/OpenMP
Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP
Intel® Parallel Studio XE
Intel® Trace Analyzer and Collector
Intel® MPI Library
Open Source
Commercial
Compilers, Run environs
gcc (kernel build only, not for applications), Python
Intel® C++ Compiler, Intel® Fortran Compiler, MYO, CAPS* HMPP* compiler, ScaleMP*
Debugger gdb Intel Debugger, Rogue Wave* TotalView*, Allinea* DDT
Libraries TBB1, MPICH2, FFTW, NetCDF
NAG*, Intel® MKL, Intel® MPI, OpenMP* (in Intel compilers), Cilk™ Plus (in Intel compilers), Rogue Wave* IMSL, Intel® OpenCL* SDK
Profiling & Analysis Tools
Intel® VTune™ Amplifier XE, Intel® Trace Analyzer & Collector, Intel® Inspector XE, Rogue Wave ThreadSpotter*
Workload Scheduler
Altair* PBS Professional, Adaptive* Computing Moab
Software Development Ecosystem for Intel Xeon Phi coprocessors
1 Commercial support of TBB available from Intel. *Other names and brands may be claimed as the property of others.
Open Source
Commercial
Compilers, Run environs
gcc (kernel build only, not for applications), Python
Intel® C++ Compiler, Intel® Fortran Compiler, MYO, CAPS* HMPP* compiler, ScaleMP*
Debugger gdb Intel Debugger, Rogue Wave* TotalView*, Allinea* DDT
Libraries TBB1, MPICH2, FFTW, NetCDF
NAG*, Intel® MKL, Intel® MPI, OpenMP* (in Intel compilers), Cilk™ Plus (in Intel compilers), Rogue Wave* IMSL, Intel® OpenCL* SDK
Profiling & Analysis Tools
Intel® VTune™ Amplifier XE, Intel® Trace Analyzer & Collector, Intel® Inspector XE, Rogue Wave ThreadSpotter*
Workload Scheduler
Altair* PBS Professional, Adaptive* Computing Moab
Software Development Ecosystem for Intel Xeon Phi coprocessors
1 Commercial support of TBB available from Intel. *Other names and brands may be claimed as the property of others.
Intel® Trace Analyzer and Collector
Intel® MPI Library
Knights Corner Coprocessor
KN
KNC Card
KN
Intel® Xeon® Processor PCIe x16
>= 8GB GDDR5 memory
TCP/IP
System Memory
> 50 Cores
Linux OS
GDDR5 Channel … PC e x16
KNC Card GDDR5 Channel
GDDR5 Channel … GDDR5
Channel
GD
DR
5 C
hannel …
GD
DR
5 C
hannel
Knights Corner Micro-architecture
PCIe Client Logic
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
GDDR MC
GDDR MC
GDDR MC
GDDR MC
Knights Corner Core
X86 specific logic < 2% of core + L2 area
L2 Ctl
L1 TLB and 32KB
Code Cache
T0 IP
4 Threads In-Order
TLB Miss
Code Cache Miss
Decode uCode
16B/Cycle (2 IPC)
Pipe 0
X87 RF Scalar RF
X87 ALU 0 ALU 1
VPU RF
VPU 512b SIMD
Pipe 1
TLB Miss Handler
L2 TLB
T1 IP T2 IP T3 IP
L1 TLB and 32KB Data Cache DCache Miss
TLB Miss
To On-Die Interconnect
HWP
Core
512KB L2 Cache
PPF PF D0 D1 D2 E WB
Vector Processing Unit
PPF PF D0 D1 D2 E WB
VC2 V1-V4 WB D2 E VC1
VC2 V1 V2 D2 E VC1 V3 V4
DEC VPU RF
3R, 1W
Mask RF
Scatter Gather
ST
LD
EMU Vector ALUs
16 Wide x 32 bit 8 Wide x 64 bit
Fused Multiply Add
Interconnect
Core
L2
Data
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
BL - 64 Bytes
AD
AK
BL – 64 Bytes
AD
AK
Command and Address
Coherence and Credits
Interleaved Memory Access
Core
L2
Core
L2
GD
DR
MC
Core
L2
Core
L2
GDDR MC
C
ore
L2
Cor
e
L2
GDDR MC
Core
L2
Core
L2
GD
DR
MC
TD TD TD
TD
TD
TD
TD TD
http://tinyurl.com/inteljames twitter @jamesreinders
A picture can be worth a thousand words.
Picture worth many words
Picture worth many words
Picture worth many words
SMALL NUMBER OF THREADS IS UNINTERESTING
Picture worth many words
AT LOW PERFORMANCE LEVELS, MORE THREADS NEEDED FOR SAME PERFORMANCE
Picture worth many words
THE PAYOFF
IS HIGHER ACHIEVEABLE
RESULTS ON CERTAIN
WORKLOADS AND LOWER POWER
USAGE
Over 100 threads?
!$OMP PARALLEL do PRIVATE(j,k) do i=1, M ! each thread will work its own part of the problem do j=1, N do k=1, X ! calculations end do end do end do Fortran do loop transformed to create many threads using an OpenMP directive
Where does my program run?
1. On CPU and “offload” to coprocessor model popular with GPUs
1. All the cores (CPU or coprocessor) are just peers
in a system (probably connect with MPI)
Your choice. Whatever works best for you.
On CPU and “offload” to coprocessor model popular with GPUs
Supported by: 1. Automatic use by Intel® math Kernel Library (MKL) 2. Program controls by Compiler directives (C, C++, Fortran) 3. APIs available to build additional tools or low level
programs
Offload Directives and Standard Requirements Feature NVidia’s
OpenACC Intel’s LEO
Desired Standard
Support for C and C++, Fortran ✔ ✔ ✔ Support single code base of hetero-machine ✔ ✔ ✔ Overlap communication and computation ✔ ✔ ✔ Interoperate with MPI ✔ ✔ ✔ Interoperate with OpenMP* ✔ ✔ Offload to GPU ✔ ✔ Offload to MIC Coprocessor ✔ ✔ Ability to support all accelerators ✔ Ability to support all GPUs ✔ Ability to support all co-processors ✔ Proof of performance portability ✔ Support for nested parallelism ✔ ✔ User-managed memory consistency ✔ ✔ ✔ Multiple vendor support ✔ ✔ Restrict clause support ✔ Support for dynamic dispatch ✔ ✔ Parallel on/off separate from offload ✔ ✔ PGI*, CAPS* compiler support 2012 ✔ Cray* compiler support soon ✔ Intel® compiler support 2010* ✔ Broad standards body approval ✔ OpenMP* 4.0 (early 2013) planned
* public product in 2012
two pre-Standard approaches to directives to control “offload”
nVidia OpenACC Data Parallelism Only
Optimized for SIMT GPU No General Purpose Threading
Targets “GPU Computing” closed spec
Intel Language Extensions for Offload Broad range of Parallelism
Multicore, Many-core CPU, GPU General Purpose Threading
Supports Intel CPU, GPU & coprocessor standards body with broad participation
Intel LEO support diverse parallel programming models and is an ideal path to OpenMP 4.0
Other brands and names are the property of their respective owners.
OpenMP “omp target” Open, Standard, Supports Diverse Hardware
Intel will support the OpenMP/TR in our C/C++ and Fortran compilers
Where does your program RUN? Everywhere More flexible possibility: Consider the program to run on cores everywhere.
This opens up many possibilities.
Peers cores or groups of cores can be organized in many ways.
Peers? Well, it is an SMP-on-a-chip
running Linux.
As peers, a distributed program runs on processors and coprocessors,
communicating with each other.
Many ways to think about this.
Starts with MPI.
Intel Xeon Phi coprocessors stand out here – because of how
very flexible this model is. Limited only by imagination!
HotChips presentation (architecture details)
Where to Learn More
http://intel.com/software/mic
http://tinyurl.com/inteljames twitter @jamesreinders
This is a really great book… I've been dreaming for a while of a modern accessible book that I could recommend to my threading-deprived colleagues and assorted enquirers to get them up to speed with the core concepts of multithreading as well as something that covers all the major current interesting implementations. Finally I have that book. Martin Watt, Principal Engineer, Dreamworks Animation
(c) 2012, publisher: Morgan Kaufmann
http://tinyurl.com/inteljames twitter @jamesreinders
Available in early 2013. (limited partial “proof” version available at SC12 for reviewers)
Completely focused on Intel Xeon Phi coprocessors. Volume 1: essentials ~350 pages of explanation of programming. It all comes down to PARALLEL PROGRAMMING ! (applicable to processors and Intel® Xeon Phi™ coprocessor)
(c) 2013
http://tinyurl.com/inteljames twitter @jamesreinders
http://tinyurl.com/inteljames
my blogs
Summary
Intel® Xeon Phi™ coprocessor provides:
Performance and Performance/Watt for highly parallel HPC workloads with cores, threads, wide-SIMD, caches, memory BW while maintaining the advantages of Intel Architecture general purpose programming environment advanced power management technology
delivers programmability and performance/watt for highly parallel HPC
parallel programming
http://tinyurl.com/inteljames twitter @jamesreinders
Thank you. http://tinyurl.com/inteljames twitter @jamesreinders
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Copyright © 2012 , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Today’s presentations contain forward-looking statements. All statements made that are not historical facts are subject to a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent Earnings Release and our most recent Form 10-Q or 10-K filing for more information on the risk factors that could cause actual results to differ. If we use any non-GAAP financial measures during the presentations, you will find on our website, intc.com, the required reconciliation to the most directly comparable GAAP financial measure. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel's current plan of record product roadmaps.
Legal Information
Legal Information: Performance
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm. Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported. SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjEnterprise, SPECjbb, SPECompM, SPECompL, and SPEC MPI are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information. TPC Benchmark is a trademark of the Transaction Processing Council. See http://www.tpc.org for more information. SAP and SAP NetWeaver are the registered trademarks of SAP AG in Germany and in several other countries. See http://www.sap.com/benchmark for more information. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Intel® Many Integrated Core (Intel MIC) Architecture
Targeted at highly parallel HPC workloads • Physics, Chemistry, Biology, Financial Services
Power efficient cores, support for parallelism • Cores: less speculation, threads, wider SIMD • Scalability: high BW on die interconnect and memory
General Purpose Programming Environment
• Runs Linux (full service, open source OS) • Runs applications written in Fortran, C, C++, … • Supports X86 memory model, IEEE 754 • x86 collateral (libraries, compilers, Intel® VTune™ debuggers, etc)