Download - Intel Technologies for High Performance Computing

Intel Technologies for High Performance ComputingLeo Borges

Intel Software Conference 2014 BrazilMay 2014

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Legal DisclaimersSoftware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as

SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those

factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated

purchases, including the performance of that product when combined with other products.

Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the

baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that

correlates with the performance improvements reported.

Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its

customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks

are accurate and reflect performance of systems available for purchase.

Intel® Hyper-Threading Technology Available on select Intel® Xeon® processors. Requires an Intel® HT Technology-enabled system. Consult your PC

manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors

support HT Technology, visit http://www.intel.com/info/hyperthreading.

Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology

performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system

delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different

processor sequences. See http://www.intel.com/products/processor_number for details. Intel products are not intended for use in medical, life saving, life

sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are subject

to change without notice

Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s

current plan of record product roadmaps. Product plans, dates, and specifications are preliminary and subject to change without notice

Copyright © 2014 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Xeon logo , Xeon Phi and Xeon Phi logo are trademarks or registered

trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and products specified are for planning purposes only

and are subject to change without notice.

*Other names and brands may be claimed as the property of others.

2


Building Blocks

Many Product Families – Today’s talk: HPC Focus

3

E5-2600 v3

(E5-2400 v3 for Comms &Storage only)

E3-1200 v3

E7-4800 v3

E5-4600 v3

E7-2800 v3

E7-8800 v3

Haswell

E7

E5Efficient Performance

E3

E5-1600 v3

Boards/PDKs

Software

SSDsLAN

RAID

Note: For discussion purposes pnly(Not intended to be interpreted as portfolio recommendations or guidance)

Cloud

Storagev3

Segments

Channel

Enterprise

HPC

MissionCritical

Big Data

Public Cloud

Co-processors

Product families and building blocks targeting an array of Segments

Storage

Networking


Recall of a few basics for HPC

What to expect from your code

What to expect from the hardware

Review Vectorization

Xeon + Xeon Phi Example

Objectives of this session

4


Review of a few HPC basicsfor non-ninja programmers

5


How it works and where are the bottlenecks

CPUCPUCPUCPU

L 1L 1L 1L 1 L 2L 2L 2L 2 L 3L 3L 3L 3

memorymemorymemorymemory

CPUCPUCPUCPU

L 2L 2L 2L 2 L 3L 3L 3L 3memorymemorymemorymemory

I/OI/OI/OI/O

Interconnect.Interconnect.Interconnect.Interconnect.

L 1L 1L 1L 1

Memory size, BW & latency ?Memory size, BW & latency ?Memory size, BW & latency ?Memory size, BW & latency ?

Cache Size, BW & Cache Size, BW & Cache Size, BW & Cache Size, BW & latency latency latency latency

CoreCoreCoreCore count, size & perf ?count, size & perf ?count, size & perf ?count, size & perf ?

Intra / Inter socket Intra / Inter socket Intra / Inter socket Intra / Inter socket communicationscommunicationscommunicationscommunications

Inter Inter Inter Inter nodesnodesnodesnodescommunication? communication? communication? communication?

6


Unfortunately, you need to be aware

CPU

L 1 L 2 L 3memory

Bandwidth

Latency

Capacity

From the core ………………….. ------> ………………………… to the i/o subsystem

L1 L2 L3 L4 L5 …. Ln

caches eDram MCDram NVM SSD PCIe SSD HDD TapesDDR

7


FLOPS and memory Bandwidth impact the efficiency & scalability

� Performing Flops is not an issue

� Data movement is the issue (BW, Latency, Power)

Efficiency (= Peak flops / Achieved flops)

won’t be high enough if store / load are not fast enough (GB/s)

First approximation: Only a matter of Frequency and Bandwidth

for (i=0;i<=MAX;i++)

c[i]= a[i] + b[i]* d[i];

store load load load

add mul

8


Performance expectation: upper bounds

CPU bound.“HPL”Real world applications

Memory bound.“Stream”

Flops/s demanding

applications

Analyzing this Flop/memory-access ratio will give a first guess

for performance prediction

BW demanding

applications

• Our performance metrics are Gflop/s and% of peak (efficiency) • Elapsed time might not tell all the information (how far of the peak

performance?)

9


Performance expectation: upper bounds

CPU bound.“HPL”Real world applications

Memory bound.“Stream”

Analyzing this Flop/memory-access ratio will give a first guess

for performance prediction

• Our performance metrics are Gflop/s and% of peak (efficiency) • Elapsed time might not tell all the information (how far of the peak

performance?)

10

Memory Bound?

Compute Bound?


Glossary, “High performance computing”� Peak =nb of floating points operations per cycle * frequency

� “Flops /sec”

“Efficiency = % of the peak performance”

Same for Bandwidth (but in Gbytes / sec)

sec/sec)/(*)/( FlopscyclecycleFlopsPeak ==

By the way : What is the peak perf of your laptop ?

11


Anatomy of a Computer Platform

12


CPU: Core/Uncore - Designed For Modularity

DRAMDRAMDRAMDRAM

QPIQPIQPIQPI

Core

Uncore

IMC QPIPower &Clock

#QPILinks

# memchannels

Size ofcache# cores

PowerManage-ment

Type ofMemory

Integratedgraphics

Differentiation in the “Uncore”:

…

QPI…

…

…

L3 Cache

QPI: Intel®

QuickPathInterconnect

CCCCOOOORRRREEEE

CCCCOOOORRRREEEE

CCCCOOOORRRREEEE


Romley EP/EN PlatformsIntel® Xeon® Processor E5-2600 v2/2400 v2 Product Families

14

Intel® Xeon® processorE5-2400/2600 prod fam

Intel® Xeon® ProcessorE5-2400/2600 prod fam

Intel® C600 series chipset

QPI

QPI

DDR3

DDR3

DDR3

DDR3

3Gb/sSAS,SATA

Memory

DDR3 & DDR3L

RDIMMs & UDIMMs, LR DIMMs

Socket R: 4 channels per socket, up to 3 DPC; speeds up to DDR3 1866

Socket B2: 3 channels per socket, up to 2 DPC; speeds up to DDR3 1600

PCI Express* 3.0

Socket R: 40 lanes per socket

Socket B2: 24 lanes per socket

Extra Gen 2 x4 on 2nd CPU

DDR3

DDR3

DDR3

DDR3

PCIe* 3.0 x8

PCIe* 3.0 x8

PCIe* 3.0 x8

PCIe* 3.0 x8

PCIe* 3.0 x8

Intel® C600 series chipset (Patsburg PCH)

Optimized Server & WS PCH

Integrated Storage:

Up to 8 ports 3Gb/s SAS

RAID 5 optional

Ivy Bridge CPUs

Socket R: Up to 12 cores / socket

Socket B2: Up to 10 cores / socket

DMI2

PCIe* 3.0 x8

PCIe* 3.0 x8

PCIe* 3.0 x8

PCIe* 3.0 x8

PCIe* 3.0 x8

PCIe* 2.0 x4

QPI

Socket R: 2 QPI links

Socket B2: 1 QPI link


IvyBridge (IVB) E5-2600 v2 family

The total benefit (at node level) is given by a combination of factors

DD

R3

DD

R3

DD

R3

DD

R3

LLCCache

MC

QPII/O

C

C

QPI

QPI

Gen3 x16

Gen3 x16

Gen3 x8

15

C

C

C

C

C

C

C

C

C

C

Feature Xeon E5-2600 v2

Process Technology

22 nm

Cores/ThreadsUp to 12 Cores/24

Threads

Last-level Cache Up to 30 MB

Max Memory Speed (MHz)

Up to 1866

Max DIMM Capacity

12 Slots/Processor

PCIe* Lanes / Controllers/Speed

40 / 10 (PCIe* 3.0 at 8 GT/s)

TDP (W)150 (Workstation only), 130, 115, 95, 80, 70, 60

wstream.exe


Advanced

Standard

Workstation Only SKU

Segment Optimized

� 8.0 GT/s QPI

� DDR3-1866

� Intel® HT

� Intel® Turbo Boost

Low Power

Basic

Socket compatible withSNB-EP top to bottom on the SKU stack

All SKUs, frequencies and features and can change without notice

6C 80W2.1GHz 15M E5-2620 v2

4C 80W2.5GHz 10M E5-2609 v2

10C 115W2.5GHz 25M E5-2670 v2

8C 95W2.0GHz 20M E5-2640 v2

4C 80W1.8GHz 10M E5-2603 v2

6C 80W2.6GHz 15M E5-2630 v2

10C 130W3.0GHz 25M E5-2690 v2

10C 115W2.8GHz 25M E5-2680 v2

8C 95W2.6GHz 20M E5-2650 v2

10C 95W2.2GHz 25M E5-2660 v2

12C 130W2.7GHz 30M E5-2697 v2

12C 115W2.4GHz 30M E5-2695 v2

8C 130W3.3GHz 25M

6C 130W3.5GHz 25M E5-2643 v2

4C 130W3.5GHz 15M E5-2637 v2

10C 70W1.7GHz 25M E5-2650L v2

6C 60W2.4GHz 15M E5-2630L v2

� 10C 8.0 GT/s QPI

� 6C 7.2 GT/s QPI

� DDR3-1600

� Intel® HT

Intel® Turbo Boost

� 7.2 GT/s QPI

� DDR3 1600

� Intel® HT

� Intel® Turbo

Boost

� 8.0 GT/s QPI

� DDR3-1866 (skt R)

� DDR3-1600 (skt B2)

� Intel® HT

� Intel® Turbo Boost

� 6.4 GT/s QPI

� DDR3 1333

� No Intel® HT

No Intel® Turbo

8C 150W3.4GHz 20M E5-2687W v2

E5-2667 v2

E5-2600 v2 Product Family

16


Parallel Programming for Intel® Architecture(or, in general, for normal CPUs)

Cores

Vectors

Memory, caches

Data layout and alignment

OpenMP TBB Cilk plus

Vectorloops

Vectorfunctions

Blocking algorithms

Manual layout,ugly code

AoS � SoAlibrary

4 considerations when writing an efficient, unconstrained parallel program

Array notations

Threads, locks

Intrinsics

Directives for alignment

Performance Analysis


“SIMDization”, so called VectorizationSingle Instruction Multiple Data (SIMD):

� Processing vector with a single operation

� Provides data level parallelism (DLP)

Vector:

� Consists of more than one element

� Elements are of same scalar data types (e.g. floats, integers, …)

Scalar Processing

Vector Processing

AA BB

CC

++

A B

C

+

CiCi

++

AiAi BiBi

CiCi

AiAi BiBi

CiCi

AiAi BiBi

CiCi

AiAi BiBi

VLVL

Ci

+

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

Ci

Ai Bi

VL

18


Vectorization of Code

• Transform sequential code to exploit vector processing capabilities (SIMD)

– Manually by explicit syntax

– Automatically by tools like a compiler

for(i = 0; i <= MAX;i++)c[i] = a[i] + b[i];

a

b

c

+

a

b

c

++

a[i]

b[i]

c[i]

+

a[i]

b[i]

c[i]

+

a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]

b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]

c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]

+

a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]

b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]

c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]

19


Reminder about the peak flops

Scheduler (Port names as used by Intel® Architecture Code Analyzer ***)

Load

Port 0 Port 1 Port 5 Port 2 Port 3

Load

Store Address

Store DataALUALU ALU/JMP

AVX FP Shuf

AVX FP Bool

VI* ADDVI*MUL

SSE MUL

DIV**

SSE ADD

AVX FP ADD

AVX FP MUL

0 63 127 255

SSE Shuf

AVX FP Blend

Port 4

AVX FP Blend

VI* ADD Store Address

6 instructions / cycle: • 3 memory ops • 3 computational operations

Nehalem /Westmere: Two 128 bits SIMD per cycle4 MUL (32b) and 4 ADD (32b): 8 Single Precision Flops / cycle2 MUL (64b) and 2 ADD (64b): 4 Double Precision Flops / cycleSandyBridge/ Ivy Bridge: Two 256 bits SIMD per cycle8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle4 MUL (64b) and 4ADD (64b): 8 Double Precision Flops / cycle

Intel® SandyBridge/Ivy Bridge micro-architecture

20


Processor: Intel Core i5-3427U

ark.intel.com:

21

In the Laptop We’ll be Using for Demo…

Processor Number i5-3427U

# of Cores 2

# of Threads 4

Clock Speed 1.8 GHz

Max Turbo Frequency 2.8 GHz

Instruction Set Extensions AVX

SandyBridge/ Ivy Bridge: Two 256 bits SIMD per cycle8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle4 MUL (64b) and 4ADD (64b): 8 Double Precision Flops / cycle

2 (cores) * 1.8GHz * 16 Flop/cycle = 57.6 Gflop/s (single precision)2 (cores) * 1.8GHz * 8 Flop/cycle = 28.8 Gflop/s (double precision)


Haswell-EP vs IvyBridge-EPThe total benefit (at node level) is given by a combinaison of factors

• Benefit frommicro-u optimization (IPC)

25 % IPC improvements

• Benefit from the nb of cores

up to 1.16x (at cst Frequency)

• Benefit from AVX2

up to 2x (FMA)

• Benefit fromMemory bandwidth

up to 1.14x (1866MHz to 2133MHz)

DD

R4

DD

R4

DD

R4

DD

R4

LLCCache

MC

QPII/O

C

C

QPI

QPI

Gen3 x16

Gen3 x16

Gen3 x8

22

C

C

C

C

C

C

C

C

C

C

C C


Flops/s, AVX, AVX2 and AVX-512

2013 2014 2015 2016H1 H2 H1 H2 H1 H2 H1 H2

Haswell-EP future futureIvy Bridge-EP

23

----512512512512

----512512512512


FMAFP Multiply

Unified Reservation Station

Port 1

Port 2

Port 3

Port 4

Port 5

Load &Store Address

StoreData

Integer ALU & Shift

IntegerALU & LEA

Integer ALU & LEA

FMA FP MultFP Add

Divide

Port 6

Integer ALU & Shift

Port 7

Store Address

Port 0

New AGU for Stores• Leaves Port 2 & 3 open for

Loads

Branch

New Branch Unit• Reduces Port0 Conflicts• 2nd EU for high branch code

4th ALU• Great for integer workloads• Frees Port0 & 1 for vector

VectorShuffle

Branch

Vector IntMultiply

VectorLogicals

Vector Shifts

Vector IntALU

Vector IntALU

VectorLogicals

VectorLogicals

Intel® Microarchitecture (Haswell)

2xFMA• Doubles peak FLOPs• Two FP multiplies benefits

legacy

Haswell Execution Unit Overview

24


Extends 128-bit integer vector instructions to 256-bit

Floating Point Fused Multiply Add: A*B + C

� Increased FLOPS potential

� Increased accuracy – Only a single round

Enhanced vectorization with Gather, Shifts and powerful permutes

Intel® AVX2 uses same 256-bit YMM registers as Intel AVX

Floating-Point Performance (Peak) per Core

2x

2x

AVX2Haswell

FMA (*,+)

FMA (*,+)

AVXSandyBridge/

Ivy Bridge

MUL (*)

ADD (+)

SSE4Nehalem/Westmere

MUL (*)

ADD (+)

8 DP (16 SP)

4 DP (8 SP)

16 DP (32 SP)

256b AVX1

16 SP / 8 DPFlops/Cycle

256b AVX2

32 SP / 16 DP Flops/Cycle (FMA)

25



Cores

Vectors

Memory, caches



Vectorloops

Vectorfunctions

Blocking algorithms


AoS � SoAlibrary


Array notations

Threads, locks

Intrinsics




Use math libs for best use of AVX1, AVX2 & AVX-512

1.0

2.0

0.0

AssemblyIntrinsicsAssemblyIntrinsics

MKL DgemmbenchmarkMKL Dgemmbenchmark

MKL FFT benchmarkMKL FFT benchmark

1.5

Use Intel® Math Kernel

Library as much as

possible

Use of intrinsics or

assembly for specific

kernels

Use Compiler and Intel

tools to optimize your

source code

speedup

Application Source codeApplication Source code

One core basis comparison

27


Intel® Math Kernel Library: Optimized Mathematical Building Blocks

Linear Algebra

• BLAS

• LAPACK

• Sparse Solvers

• Iterative

• Pardiso*

• ScaLAPACK

Fast Fourier Transforms

• Multidimensional

• FFTW interfaces

• Cluster FFT

Vector Math

• Trigonometric

• Hyperbolic

• Exponential, Log

• Power / Root

Vector RNGs

• Congruential

• Wichmann-Hill

• Mersenne Twister

• Sobol

• Neiderreiter

• Non-deterministic

Summary Statistics

• Kurtosis

• Variation coefficient

• Order statistics

• Min/max

• Variance-covariance

And More

• Splines

• Interpolation

• Trust Region

• Fast Poisson Solver

Intel® MKL is an integral part of Intel® Parallel Studio XE

28


Many Ways to Vectorize

Ease of useCompiler: Auto-vectorization (no change of code)

Programmer control

Compiler: Auto-vectorization hints (#pragma simd , …)

SIMD intrinsic class(e.g.: F32vec , F64vec , …)

Vector intrinsic(e.g.: _mm_fmadd_pd(…) , _mm_add_ps(…), …)

Assembler code(e.g.: [v]addps , [v]addss , …)

Compiler: Intel® Cilk™ Plus Array Notation Extensions

29


Control Vectorization !

Provides details on vectorization success & failure:

Linux*, Mac OS* X: -vec-report<n> , Windows*: /Qvec-report<n>

*: First available with Intel® Parallel Studio XE

n Diagnostic Messages

0 Tells the vectorizer to report no diagnostic information. Useful for turning off reporting in case it was enabled on command line earlier.

1 Tells the vectorizer to report on vectorized loops.[default if nmissing]

2 Tells the vectorizer to report on vectorized and non-vectorized loops.

3 Tells the vectorizer to report on vectorized and non-vectorized loops and any proven or assumed data dependences.

4 Tells the vectorizer to report on non-vectorized loops.

5 Tells the vectorizer to report on non-vectorized loops and the reason why they were not vectorized.

6* Tells the vectorizer to use greater detail when reporting on vectorized and non-vectorized loops and any proven or assumed data dependences.

30


Vectorization Report II

Note:

In case inter-procedural optimization (-ipo or /Qipo ) is activated and

compilation and linking are separate compiler invocations, the switch to enable

reporting needs to be added to the link step!

35: subroutine fd( y )36: integer :: i37: real, dimension(10), intent(inout) :: y38: do i=2,1039: y(i) = y(i-1) + 140: end do41: end subroutine fd

novec.f90(38): (col. 3) remark: loop was not vector ized: existence of vector dependence.novec.f90(39): (col. 5) remark: vector dependence: proven FLOW dependence between y line 39, and y line 39.novec.f90(38:3-38:3):VEC:MAIN_: loop was not vecto rized: existence of vector dependence

31


Reasons for Vectorization Fails & How to Succeed

● Most frequent reason is Dependence:

Minimize dependencies among iterations by design!

● Alignment: Align your arrays/data structures

● Function calls in loop body: Use aggressive in-lining (IPO)

● Complex control flow/conditional branches:

Avoid them in loops by creating multiple versions of loops

● Unsupported loop structure: Use loop invariant expressions

● Not inner loop: Manual loop interchange possible?

● Mixed data types: Avoid type conversions

● Non-unit stride between elements: Possible to change algorithm to

allow linear/consecutive access?

● Loop body too complex reports: Try splitting up the loops!

● Vectorization seems inefficient reports: Enforce vectorization,

benchmark !32


IVDEP vs. SIMD Pragma/Directives

33

Differences between IVDEP & SIMD pragmas/directives:

#pragma ivdep (C/C++) or !DIR$ IVDEP (Fortran)

-Ignore vector dependencies (IVDEP):

Compiler ignores assumed but not proven dependencies for a loop

-Example:

#pragma simd (C/C++) or !DIR$ SIMD (Fortran):

- Aggressive version of IVDEP: Ignores all dependencies inside a loop

- It’s an imperative that forces the compiler try everything to vectorize

- Efficiency heuristic is ignored

- Attention: This can break semantically correct code!

However, it can vectorize code legally in some cases that wouldn’t be possible

otherwise!

void foo(int *a, int k, int c, int m){#pragma ivdep

for (int i = 0; i < m; i++)a[i] = a[i + k] * c;

}


Memory Subsystem

34


CPU: Core/Uncore - Designed For Modularity

DRAMDRAMDRAMDRAM

QPIQPIQPIQPI

Core

Uncore

IMC QPIPower &Clock

#QPILinks

# memchannels

Size ofcache# cores

PowerManage-ment

Type ofMemory

Integratedgraphics

Differentiation in the “Uncore”:

…

QPI…

…

…

L3 Cache

QPI: Intel®

QuickPathInterconnect

CCCCOOOORRRREEEE

CCCCOOOORRRREEEE

CCCCOOOORRRREEEE


Memory Bandwidth update

For Sandy Bridge EP platform: 4 channels , 2 sockets and 1600 MHz memory

8*1.600* 4*2 = 102.4 GB/s peak (ST : 80 GB/s)

For Ivy Bridge EP platform: 4 channels , 2 sockets and 1866 MHz memory

8*1.866* 4*2 = 119.42 GB/s peak (ST : ~98 GB/s)

For Haswell EP platform: 4 channels , 2 sockets and 2133 MHz memory

8*2.133* 4*2 = 136.5 GB/s peak (ST : ~114 GB/s)

Basical rules for theoretical memory BW [Bytes / second ] :

8 [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets

2 full width QPI 1.12 full width QPI 1.1

DMI2DMI2

40L PCIe3.0

40L PCIe3.0 HSW

Socket-R3 LGA

HSWSocket-R3

LGADDR3/4DDR3/4

DDR3/4DDR3/4

DDR3/4DDR3/4

DDR3/4DDR3/4

36


Processor: Intel Core i5-3427U

ark.intel.com:

37

In the Laptop We’ll be Using for Demo…

Memory Types DDR3/L/-RS 1333/1600

# of Memory Channels 2

Max Memory Bandwidth 25.6 GB/s

Basical rules for theoretical memory BW [Bytes / second ] :

8 [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets

Platform: 2 channels , 1 sockets and 1600 MHz memory

8*1.6* 2*1 = 25.6 GB/s peak (ST : 20 GB/s)



Cores

Vectors

Memory, caches



Vectorloops

Vectorfunctions

Blocking algorithms


AoS � SoAlibrary


Array notations

Threads, locks

Intrinsics




Intel® Many Integrated Core Architecture

39


Up to 61 IA cores/1.2 GHz/ 244 Threads

Up to 16 GB memory with up to 352 GB/s bandwidth

512-bit SIMD instructions

Open Source Linux operating system

IP addressable

Standard programming languages, tools, clustering

22 nm process

Intel® Xeon Phi™ Product Family

Passive Card

Active Card

http://software.intel.com/en-us/mic-developer

40


3 Family Outstanding Parallel Computing Solution

Performance/$ leadership

5 FamilyOptimized for High

Density EnvironmentsPerformance/Watt

leadership

8GB GDDR5

>300GB/s

>1TF DP

225-245W TDP

6GB GDDR5

240GB/s

>1TF DP

300W TDP

Intel® Xeon Phi™ Coprocessor Product Lineup

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated

purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance41

Optional 3-year Warranty

Extend to 3-year warranty on any Intel® Xeon Phi™ Coprocessor. Product Code:

XPX100WRNTY, MM# 933057

7 FamilyHighest Performance

Most Memory

Performance leadership

16GB GDDR5

352GB/s

>1.2TF DP

300W TDP

3120PMM# 927501

3120AMM# 927500

5110PMM# 924044

5120D (no thermal)

MM# 927503

7120PMM# 927499

7120X(No Thermal

Solution)MM# 927498

7120AMM# 934878

7120D(Dense Form

Factor)MM# 932330

41


Core ArchitectureInstruction decoder

L1 Cache (I & D)

L2 Cache

Interprocessornetwork

VectorUnit

Scalar Unit

VectorRegisters

ScalarRegisters

512 KB Slice per

32 KB per core

L2 Hardware Prefetching

Fully Coherent

In Order

512-wide64-bit

4 Threads per Core

VPU: integer, SP, DP;3-operand,

16-instruction

42


Spectrum of Execution Models(Offload / Native / Symmetric)

Offload:

Workload is run on host, and highly

parallel phases on Coprocessor

!dir$ omp offload target(mic)

!$omp parallel dodo i=1,10

A(i) = B(i) * C(i)enddo!$omp end parallel

MPI Exampleon Host with offload to coprocessors

43


Spectrum of Execution Models(Offload / Native / Symmetric)

MPI exampleon Coprocessor only

Native (Coprocessor-only model):

Workload is run solely on coprocessor

icc –mmic … ./bin_mic

Then

ssh mic0

./bin_mic

Or start it from host

micnaticeloadex ./bin_mic

44


Symmetric Mode Command Line

Arslan et al. 2013. Rice HPC Conf.

Workload runs on Host AND Coprocessors

45


QPI

IOH* IOH*

rank 0 in“mic0”

rank 1 in“mic1”

rank 4 in“mic2”

rank 2 in“cpu0”

rank 3 in“cpu1”

MPI Process

OpenMP Threads

244threads

244threads

12threads

12threads

244threads

244 threads

4x 7120A(61 Cores, 1.238 GHz, 16GB GDDR5)

2x E5-2697v2 (12C, 2.7GHz) and

64GB DDR3-1866 MHz

rank 5 in“mic3”

Peer-to-peer via DMA

*Integrated in the processor

Single Node Tests – HW and SW ConfigurationIsotropic RTM FD Kernel

Direct DMA transfers between devices

46


Scalability study with one to four Intel® Xeon Phi™ coprocessors

1.1

4.0

9.3

14.7

20.1

24.4

0.0

5.0

10.0

15.0

20.0

25.0

30.0

0.0

0.4

0.8

1.2

1.6

TFlops

Scaling Based on Number of Coprocessors

CUDA K40c CUDA K10

High performance and scalability with Intel® Xeon Phi® coprocessor

Single Node Tests – Performance & ScalabilityIsotopic RTM FD Kernel

47

� Scaling analysis with each Intel® Xeon Phi™ coprocessor solving a 14GB subdomain and pair of Intel® Xeon® processors solving a 10GB subdomain

� 16th order 3D space and 2nd order time; 61 Flops per Cell

� 24.4 GCell/s total performance with 2 processors + 4 coprocessors

� semi-OPT measurement is an OpenMP parallel version implemented with cache-blocking and compiler directives to improve vectorization. The remaining measurements are on code with additional optimizations such as loop unrolling, non-temporal stores, tiling on Y-Z, prefetchtuning, and balance between MULs and ADDs via intrinsics

� CUDA K40c and CUDA K10 are measurements on single devices using code that extended the FDTD3d sample in the CUDA SDK5.5 to 16th order in space and further optimized to increase register reuse

4.2

GCell/s

5.1

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance

1. Xeon = Intel® Xeon® processor E5-2697v2 Source: Intel Measured Results as of April 2014

2x Xeon1

semi-OPT2x Xeon1 2x Xeon1 +

1x 7120A2x Xeon1 +2x 7120A

2x Xeon1 +3x 7120A

2x Xeon1 +4x 7120A

Config. SummaryIC 14.0 U1 MPI 4.1.1.036MPSS 6720-15ECC off,Turbo on (Xeon & 7120A)CUDA 5.5(875MHz Boost Enabled)



Cores

Vectors

Memory, caches



Vectorloops

Vectorfunctions

Blocking algorithms


AoS � SoAlibrary


Array notations

Threads, locks

Intrinsics




3DFD comparison : E5-2697v2 (Ivy Bridge) and Xeon Phi 7120A


Energy efficiency with multiple Intel® Xeon Phi cards

Note: 3 and 4 Xeon Phi power values are projections based on the data collected for 1 and 2 Xeon Phi.

Single Node Tests – Performance/Watt

High energy efficiency with Xeon Phi

This data was presented by Petrobras at SC13 and Rice 2014

Oil & Gas HPC Workshop

Source: Petrobras presentation at 2014 RICE Oil & Gas HPC: http://rice2014oghpc.blogs.rice.edu/files/2014/03/Intel-Rice2014-RTM-XeonPhi-V3.pdf50


Next Intel® Xeon Phi™ Product Family (Codenamed Knights Landing)

51

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

• “Knights Landing” code name for the 2nd generation Intel® Xeon Phi™ product

• Based on Intel’s 14 nanometermanufacturing process

• Standalone bootable processor (running the host OS) and a PCIe coprocessor (PCIe end-point device)

• Integrated on-package high-bandwidth memory• Flexible memory modes for the on package memory

include: cache and flat• Support for Intel® Advanced Vector Extensions 512

(Intel® AVX-512)• 60+ cores, 3+ TeraFLOPS of double-precision peak performance per single socket node

• Multiple hardware threads per core with improved single-thread performance over the current generation Intel® Xeon Phi™ coprocessor

51 Note that code name above is not the product name


Programming Resources

52

� Intel® Xeon Phi™ Coprocessor Developer’s Quick Start Guide

� Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors

� Access to webinar replays and over 50 training videos

� Beginning labs for the Intel® Xeon Phi™ Coprocessor

� Programming guides, tools, case studies, labs, code samples, forums & more

http://software.intel.com/mic-developer

Using a familiar programming model and tools means that developers don’t need to start from scratch. Many

programming resources are available to further accelerate time to solution.

52

Click on tabs


Questions?Questions?

Are you ready for Multicore and

ManyCore?


Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

54

Download - Intel Technologies for High Performance Computing

Top Related