intel xeon phi coprocessor

Intel® Xeon Phi™ Coprocessor

James Reinders

http://tinyurl.com/inteljames twitter @jamesreinders

it’s all about parallel

programming

Multicore CPU Multicore CPU

Intel® MIC architecture coprocessor

Source

Compilers Libraries,

Parallel Models

Multicore CPU Multicore CPU

Intel® MIC architecture coprocessor

Source

Compilers Libraries,

Parallel Models

Game Changer

“Unparalleled productivity… most of this software does not run on a GPU” - Robert Harrison, NICS, ORNL

“R. Harrison, “Opportunities and Challenges Posed by Exascale Computing - ORNL's Plans and Perspectives”, National Institute of Computational Sciences, Nov 2011”

Intel® Inspector XE, Intel® VTune™ Amplifier

XE, Intel® Advisor

Intel® C/C++ and Fortran Compilers w/OpenMP

Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP

Intel® Parallel Studio XE

Intel® Trace Analyzer and Collector

Intel® MPI Library

Open Source

Commercial

Compilers, Run environs

gcc (kernel build only, not for applications), Python

Intel® C++ Compiler, Intel® Fortran Compiler, MYO, CAPS* HMPP* compiler, ScaleMP*

Debugger gdb Intel Debugger, Rogue Wave* TotalView*, Allinea* DDT

Libraries TBB1, MPICH2, FFTW, NetCDF

NAG*, Intel® MKL, Intel® MPI, OpenMP* (in Intel compilers), Cilk™ Plus (in Intel compilers), Rogue Wave* IMSL, Intel® OpenCL* SDK

Profiling & Analysis Tools

Intel® VTune™ Amplifier XE, Intel® Trace Analyzer & Collector, Intel® Inspector XE, Rogue Wave ThreadSpotter*

Workload Scheduler

Altair* PBS Professional, Adaptive* Computing Moab

Software Development Ecosystem for Intel Xeon Phi coprocessors

1 Commercial support of TBB available from Intel. *Other names and brands may be claimed as the property of others.

Open Source

Commercial

Compilers, Run environs

gcc (kernel build only, not for applications), Python

Intel® C++ Compiler, Intel® Fortran Compiler, MYO, CAPS* HMPP* compiler, ScaleMP*

Debugger gdb Intel Debugger, Rogue Wave* TotalView*, Allinea* DDT

Libraries TBB1, MPICH2, FFTW, NetCDF

NAG*, Intel® MKL, Intel® MPI, OpenMP* (in Intel compilers), Cilk™ Plus (in Intel compilers), Rogue Wave* IMSL, Intel® OpenCL* SDK

Profiling & Analysis Tools

Intel® VTune™ Amplifier XE, Intel® Trace Analyzer & Collector, Intel® Inspector XE, Rogue Wave ThreadSpotter*

Workload Scheduler

Altair* PBS Professional, Adaptive* Computing Moab

Software Development Ecosystem for Intel Xeon Phi coprocessors

1 Commercial support of TBB available from Intel. *Other names and brands may be claimed as the property of others.

Intel® Trace Analyzer and Collector

Intel® MPI Library

Knights Corner Coprocessor

KN

KNC Card

KN

Intel® Xeon® Processor PCIe x16

>= 8GB GDDR5 memory

TCP/IP

System Memory

> 50 Cores

Linux OS

GDDR5 Channel … PC e x16

KNC Card GDDR5 Channel

GDDR5 Channel … GDDR5

Channel

GD

DR

5 C

hannel …

GD

DR

5 C

hannel

Knights Corner Micro-architecture

PCIe Client Logic

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

GDDR MC

GDDR MC

GDDR MC

GDDR MC

Knights Corner Core

X86 specific logic < 2% of core + L2 area

L2 Ctl

L1 TLB and 32KB

Code Cache

T0 IP

4 Threads In-Order

TLB Miss

Code Cache Miss

Decode uCode

16B/Cycle (2 IPC)

Pipe 0

X87 RF Scalar RF

X87 ALU 0 ALU 1

VPU RF

VPU 512b SIMD

Pipe 1

TLB Miss Handler

L2 TLB

T1 IP T2 IP T3 IP

L1 TLB and 32KB Data Cache DCache Miss

TLB Miss

To On-Die Interconnect

HWP

Core

512KB L2 Cache

PPF PF D0 D1 D2 E WB

Vector Processing Unit

PPF PF D0 D1 D2 E WB

VC2 V1-V4 WB D2 E VC1

VC2 V1 V2 D2 E VC1 V3 V4

DEC VPU RF

3R, 1W

Mask RF

Scatter Gather

ST

LD

EMU Vector ALUs

16 Wide x 32 bit 8 Wide x 64 bit

Fused Multiply Add

Interconnect

Core

L2

Data

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

BL - 64 Bytes

AD

AK

BL – 64 Bytes

AD

AK

Command and Address

Coherence and Credits

Interleaved Memory Access

Core

L2

Core

L2

GD

DR

MC

Core

L2

Core

L2

GDDR MC

C

ore

L2

Cor

e

L2

GDDR MC

Core

L2

Core

L2

GD

DR

MC

TD TD TD

TD

TD

TD

TD TD

A picture can be worth a thousand words.

Picture worth many words


SMALL NUMBER OF THREADS IS UNINTERESTING


AT LOW PERFORMANCE LEVELS, MORE THREADS NEEDED FOR SAME PERFORMANCE


THE PAYOFF

IS HIGHER ACHIEVEABLE

RESULTS ON CERTAIN

WORKLOADS AND LOWER POWER

USAGE

Over 100 threads?

!$OMP PARALLEL do PRIVATE(j,k) do i=1, M ! each thread will work its own part of the problem do j=1, N do k=1, X ! calculations end do end do end do Fortran do loop transformed to create many threads using an OpenMP directive

Where does my program run?

1. On CPU and “offload” to coprocessor model popular with GPUs

1. All the cores (CPU or coprocessor) are just peers

in a system (probably connect with MPI)

Your choice. Whatever works best for you.

On CPU and “offload” to coprocessor model popular with GPUs

Supported by: 1. Automatic use by Intel® math Kernel Library (MKL) 2. Program controls by Compiler directives (C, C++, Fortran) 3. APIs available to build additional tools or low level

programs

Offload Directives and Standard Requirements Feature NVidia’s

OpenACC Intel’s LEO

Desired Standard

Support for C and C++, Fortran ✔ ✔ ✔ Support single code base of hetero-machine ✔ ✔ ✔ Overlap communication and computation ✔ ✔ ✔ Interoperate with MPI ✔ ✔ ✔ Interoperate with OpenMP* ✔ ✔ Offload to GPU ✔ ✔ Offload to MIC Coprocessor ✔ ✔ Ability to support all accelerators ✔ Ability to support all GPUs ✔ Ability to support all co-processors ✔ Proof of performance portability ✔ Support for nested parallelism ✔ ✔ User-managed memory consistency ✔ ✔ ✔ Multiple vendor support ✔ ✔ Restrict clause support ✔ Support for dynamic dispatch ✔ ✔ Parallel on/off separate from offload ✔ ✔ PGI*, CAPS* compiler support 2012 ✔ Cray* compiler support soon ✔ Intel® compiler support 2010* ✔ Broad standards body approval ✔ OpenMP* 4.0 (early 2013) planned

* public product in 2012

two pre-Standard approaches to directives to control “offload”

nVidia OpenACC Data Parallelism Only

Optimized for SIMT GPU No General Purpose Threading

Targets “GPU Computing” closed spec

Intel Language Extensions for Offload Broad range of Parallelism

Multicore, Many-core CPU, GPU General Purpose Threading

Supports Intel CPU, GPU & coprocessor standards body with broad participation

Intel LEO support diverse parallel programming models and is an ideal path to OpenMP 4.0

Other brands and names are the property of their respective owners.

OpenMP “omp target” Open, Standard, Supports Diverse Hardware

Intel will support the OpenMP/TR in our C/C++ and Fortran compilers

Where does your program RUN? Everywhere More flexible possibility: Consider the program to run on cores everywhere.

This opens up many possibilities.

Peers cores or groups of cores can be organized in many ways.

Peers? Well, it is an SMP-on-a-chip

running Linux.

As peers, a distributed program runs on processors and coprocessors,

communicating with each other.

Many ways to think about this.

Starts with MPI.

Intel Xeon Phi coprocessors stand out here – because of how

very flexible this model is. Limited only by imagination!

HotChips presentation (architecture details)

Where to Learn More

http://intel.com/software/mic

This is a really great book… I've been dreaming for a while of a modern accessible book that I could recommend to my threading-deprived colleagues and assorted enquirers to get them up to speed with the core concepts of multithreading as well as something that covers all the major current interesting implementations. Finally I have that book. Martin Watt, Principal Engineer, Dreamworks Animation

(c) 2012, publisher: Morgan Kaufmann


Available in early 2013. (limited partial “proof” version available at SC12 for reviewers)

Completely focused on Intel Xeon Phi coprocessors. Volume 1: essentials ~350 pages of explanation of programming. It all comes down to PARALLEL PROGRAMMING ! (applicable to processors and Intel® Xeon Phi™ coprocessor)

(c) 2013


http://tinyurl.com/inteljames

my blogs

Summary

Intel® Xeon Phi™ coprocessor provides:

Performance and Performance/Watt for highly parallel HPC workloads with cores, threads, wide-SIMD, caches, memory BW while maintaining the advantages of Intel Architecture general purpose programming environment advanced power management technology

delivers programmability and performance/watt for highly parallel HPC

parallel programming


Thank you. http://tinyurl.com/inteljames twitter @jamesreinders

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Copyright © 2012 , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

http://www.intel.com/software/products

Today’s presentations contain forward-looking statements. All statements made that are not historical facts are subject to a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent Earnings Release and our most recent Form 10-Q or 10-K filing for more information on the risk factors that could cause actual results to differ. If we use any non-GAAP financial measures during the presentations, you will find on our website, intc.com, the required reconciliation to the most directly comparable GAAP financial measure. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel's current plan of record product roadmaps.

Legal Information

Legal Information: Performance

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm. Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported. SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjEnterprise, SPECjbb, SPECompM, SPECompL, and SPEC MPI are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information. TPC Benchmark is a trademark of the Transaction Processing Council. See http://www.tpc.org for more information. SAP and SAP NetWeaver are the registered trademarks of SAP AG in Germany and in several other countries. See http://www.sap.com/benchmark for more information. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel® Many Integrated Core (Intel MIC) Architecture

Targeted at highly parallel HPC workloads • Physics, Chemistry, Biology, Financial Services

Power efficient cores, support for parallelism • Cores: less speculation, threads, wider SIMD • Scalability: high BW on die interconnect and memory

General Purpose Programming Environment

• Runs Linux (full service, open source OS) • Runs applications written in Fortran, C, C++, … • Supports X86 memory model, IEEE 754 • x86 collateral (libraries, compilers, Intel® VTune™ debuggers, etc)

intel xeon phi coprocessor

Documents