accelerating scientific computing on intel architecture · pdf fileaccelerating scientific...

Intel Confidential — Do Not Forward

Accelerating Scientific Computing on Intel

Architecture

Machine Evaluation Workshop 2014 (MEW 25), Ricoh Arena, Coventry

2nd December, 2014

Booth 13 – (John Swinburne, Ian Llyod Gabriele Paciucci, Gaurav Kaul)

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Intel, Intel Xeon, Intel Xeon Phi™ are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. Copyright © 2014, Intel Corporation *Other brands and names may be claimed as the property of others.`

http://www.intel.com/design/literature.htm

Modernizing HPC Community Codes

AVBP

(Large

Eddy)

Blast

BUDE

CAM-5

CASTEP

CESM

CFSv2

CIRCAC

AMBER

CliPhi

(COSMOS) COSA

Cosmos

codes DL-MESO DL-Poly ECHAM6 Elmer FrontFlow/Blue Code GADGET GAMESS-US

GPAW

Gromacs

GS2

GTC

Harmonie

Ls1

MACPO

Mardyn

MPAS

NEMO5

NWChem Openflow OPENMP/

MPI

Optimized integral

Quantum

Espresso R

ROTOR

SIM SeisSol SG++ SU2 UTBENCH VASP VISIT WRF

Intel® Parallel Computing Centers

>40 Centers 13 Countries >70 Codes

2 User Groups

https://software.intel.com/en-us/ipcc

Collaborating to accelerate the pace of discovery

https://software.intel.com/en-us/articles/intel-parallel-computing-center-at-hartree-centre-stfc

https://software.intel.com/en-us/articles/intel-parallel-computing-center-at-epcc

https://software.intel.com/en-us/articles/intel-parallel-computing-center-at-institute-of-computational-cosmology-durham-university

https://software.intel.com/en-us/articles/intel-parallel-computing-center-at-ichec

https://software.intel.com/en-us/articles/intel-parallel-computing-center-at-the-higgs-center-for-theoretical-physics-the-university

Scientific Codes and Workloads optimized for

Intel® Xeon® and Xeon Phi ®

4

Mix of ISV and End User Development

Sub-Segments: Applications/Workloads

Molecular Dynamics HPL, HPCC, NPB, LAMMPS, QCD, GROMACS, NAMD, AMBER,

VASP

Energy (including Oil & Gas)

RTM (Reverse Time Migration), WEM (Wave Equation Migration)

Climate Modeling

and Weather Simulation WRF, HOMME

Life Sciences (Bioinformatics, Gene Sequencing, Bio-Chemistry)

HMMER, MPI-HMMER, BLAST, BWA-mem, BWA-aln, Bowtie2,

Cufflinks, GATK 3.1

Manufacturing

(CAD/CAM/CAE/CFD/EDA) PyFR, Implicit, Explicit Solvers

Software Ecosystem Tools, Middleware

Improvements in GATK* 3.0

• Pair HMM Acceleration using Intel® AVX

resulted in 970x speedup

− Computation kernel and bottleneck in

GATK* Haplotype Caller

− AVX enables 8 floating point SIMD

operations in parallel

*Some names and brands may be claimed as the property of others. 5

Burrows-Wheeler Aligner (BWA-ALN)

Execution Mode: Hybrid MPI + OpenMP using symmetric

Code Optimisation:

• Replaced pthreads with OpenMP for better load balance

• Data location improvements – all data files remain on host

• Overlapped I/O and compute to improve thread utilisation

• Used Intel TBB memory allocator for improved efficiency

• Vectorisation of performance critical loop

• Implemented data pre-fetch intrinsics to reduce memory

latency

Critically, output was identical to the unmodified BWA

6

TGen* RNA-Seq Pipeline

Before: Bowtie2* 2.0.0-beta7, TopHat* 2.0.4, Cufflinks* 2.0.2

After: Bowtie2* 2.1.0, TopHat* 2.0.8b, Cufflinks* 2.1.1.2

Intel® Xeon® E5-2687W / 3.1 GHz

*Some names and brands may be claimed as the property of others.

7 RNA-Seq from 7 days to less than 4 hours RNA-Seq from 7 days to less than 4 hours

• Challenge: Experiment processing takes 7 days with

existing infrastructure. Delays clinical treatment in very

aggressive pediatric cancers

• Solution: Dell* Genomics Data Analytics Platform

− Single Rack Solution with balanced HPC compute, fabric and

storage

− 9 Teraflops of Intel® Xeon® E5 v2 processors

− Intel® Enterprise Edition for Lustre*

− Intel® Cluster Studio XE

• Results:

− RNA-Seq pipeline reduced from 7 days to less than

4 hours

• Benefits: Rapid results enables sequencing multiple

times during the course of treatment, monitor patient

response, adjust protocol, improve outcomes

Lustre

Intel Genomics & Health Analytics Appliances

2U Plenum

Actual placement in racks may vary.

NSS-HA Pair

NSS User Data

HSS Metadata Pair

HSS OSS Pair

HSS User Data

Scale through independent solutions, each targeting a different segment & usage model

Scale through independent solutions, each targeting a different segment & usage model

8

• Challenge: Can high performance interconnect

technology (InfiniBand) keep up with increase in

number of processor cores?

• Workloads: VASP, WIEN2K

• Benchmarks: MVAPICH (MPI over InfiniBand),

IMPI (Intel MPI)

• Results:

- Scale-up research – 5 to 10x speed up when scaling

from single node to 16 nodes

- Intel® True Scale Fabric QDR-40 shows excellent

price/performance results

• Infrastructure and Data Characteristics:

- 1 Head + 16 compute nodes, Dual Xeon® E5-2680 2.7GHz p/node

- 32GB of RAM 1666MHz p/node

- RHEL, Compiler, MPI variations available

- Intel® Cluster Suite, Intel® Fabric Suite

High-Performance Interconnect (InfiniBand) Intel® True Scale Fabric

9 *Some names and brands may be claimed as the property of others.

True Scale

Fabric

ASKAP*/tHogBomClean (Australian Square Kilometer Array Pathfinder) n

Description: The tHogBomClean benchmark

implements the kernel of the HogBom Clean

deconvolution algorithm. This benchmark is quite

minimal and actually omits the final step, convolution

of the model with the clean beam, but this involves

the similar operations to the other steps as far as the

CPU is concerned. Availability:

https://github.com/ATNF/askap-benchmarks

Usage Model: Offload using OpenMP*; host

performs data initialization and transfers to the Intel®

Xeon Phi™ coprocessor 7120A; all computing is

performed by the Intel Xeon Phi coprocessor 7120A.

10 SOURCE: INTEL MEASURED RESULTS AS OF MARCH,

2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance

For configuration details, go here.

0

1

Cleaning Rate Throughput per Second Speed Up

E5-2697 v2 BaselineE5-2697 v2 OptimizedE5-2697 v2 + 7120AE5-2697 v2 + 7120A Turbo Off OptimizedE5-2697 v2 + 7120A Turbo On OptimizedE5-2697 v2 + NVIDIA K40* Boost OffE5-2697 v2 + NVIDIA K40* Boost On

Co

mp

ara

tive

Pe

rfo

rma

nce

“E5-2697 v2” = Intel® Xeon® processor E5-2697 v2

“7120A” = Intel® Xeon Phi™ coprocessor 7120A

1X 1.03X

.68X

1.74X 1.78X

1.07X

1.24X

* Other names and brands may be claimed as the property of others.

1.78X Speed Up with Intel® Xeon Phi co-

processor


processor




http://www.intel.com/performance

Modal - COSMOS @ DiRAC - University of Cambridge(in

3D)

11

1. Laplacian stencil operation (radius 1): 𝛻2𝜙 𝑖𝑗𝑘 ≡ 𝜙𝑖+1,𝑗,𝑘 + 𝜙𝑖−1,𝑗,𝑘 + 𝜙𝑖,𝑗+1,𝑘 + 𝜙𝑖,𝑗−1,𝑘 + 𝜙𝑖,𝑗,𝑘+1

+ 𝜙𝑖,𝑗,𝑘−1 − 6𝜙𝑖,𝑗,𝑘

2. Time-stepping (leap-frog integration):

𝜙𝑖𝑗𝑘𝑛+1 = 𝜙𝑖𝑗𝑘

𝑛 + Δ𝜂𝜙𝑖𝑗𝑘𝑛+1/2

, where

𝜙𝑖𝑗𝑘𝑛+1/2=1−𝛿 𝜙𝑖𝑗𝑘

𝑛−1/2+Δ𝜂 𝛻2𝜙𝑖𝑗𝑘

𝑛 −𝜕𝑉

𝜕𝜙𝑖𝑗𝑘𝑛

1+𝛿

3. Calculate area of domain walls:

𝑛. 𝑑𝐴 = Δ𝐴 𝛿−+

|𝛻𝜙|

𝜙𝑥 + 𝜙𝑦 + |𝜙𝑧|links

Optimization and Modernization

The Strategy:

Use straightforward parallel tuning techniques

vectorize, scale, memory

Use tools/compiler guidance features

Maintain readability and platform portability

The Result

Significant performance improvements in ~3-4 weeks

Single, clear, readable code-base

“Template” stencil code transferable to other simulations

12

Final Performance

0

1

2

3

4

5

6

7

8

Baseline Auto-vec MemoryReduction

Halo (Inner) 3DSpecialisation

Halo (All) Single Loop DivisionHoisting

2 x Processor 1 x Coprocessor

13

0.45x

1.18x 1.12x

1.28x

1.52x 1.62x

1.44x

1.38x

PyFR (Python based Flux Reconstruction) Imperial College, London

Description: CFD code for turbulent flow and

unstructured meshes written in Python which

generates matrix multiplication templates in C

Usage Model: Host performs C kernel initialization

and transfers to the Intel® Xeon Phi™ coprocessor

7120P in Python; all computing is performed by the

Intel Xeon Phi coprocessor 7120P.

14 SOURCE: INTEL MEASURED RESULTS AS OF MARCH,

2014

* Other names and brands may be claimed as the property of others.


processor over SNB


processor over SNB

FINRA Audit Trail System

*Some names and brands may be claimed as the property of others.

15

Performance comparison of Lustre and HDFS on MR

implementation of FSI workload using HDDP cluster

hosted in the Intel BigData Lab in Swindon (UK) and

Intel Enterprise Edition for Lustre.

Lustre

A query of Audit Trail System part of FINRA security

specifications (publicly available) is used as a

representative application.

Performance metric: Average Query Execution Time

5.5X performance vs HDFS on 7TB data size 5.5X performance vs HDFS on 7TB data size

1.5X performance vs HDFS on same BoM 1.5X performance vs HDFS on same BoM

Fluent 15, Aircraft 2M Model

16

True Scale

Fabric

High message rate and low scalable latency

characteristics of Intel® True Scale enables

improved application scaling performance.

Performance Scaled Messaging (PSM) provides 2X-5X

message rate over IB Verbs on Intel® Xeon E5-2600 v3

25% performance advantage at 32 nodes 25% performance advantage at 32 nodes

Test Configuration

Processor: 2.6Ghz E5-2697 v3 (14 c)

64GB (2133Mhz Memory)

OS - Redhat Linux

MPI – OpenMPI

On-Load/PSM

HCA - Intel QLE7340

IB Switch -True Scale 12800

Host Library – PSM

Offload/Verbs

HCA – MLNX MCX353A-FCAT

Switch – MLNX MSX6025F-1BR

Host Library – InfiniBand Verbs

Investing in a parallel future

17

Parallel Hardware

Parallel Programming

Models

18

Intel® Technical Computing

What makes a great HPC cluster?

Performance for highly-parallel workloads while preserving software investment

Intel® Xeon Phi™ coprocessors: Leading performance for highly parallel workloads

Uses common Intel® Xeon ® programming models to preserve software investment

Industry-leading workload performance

Intel® Xeon® processors: Ground-breaking real-world application performance

Industry-leading energy efficiency

C:\ _

Ease of deployment and maintenance

Intel® Cluster Ready

Fast, reliable access to data

Intel® Xeon® processor based storage

Intel® SSD/NVMe

Intel® Lustre

Fast, cost-effective data movement between nodes

Intel® True Scale fabric

Intel: The Architecture for Discovery

See us at Booth # 13 for any questions

and queries !

John Swinburne – Truescale Architect

Ian Lloyd – Technical Account Manager HPC

Gabriele Paciucci – Lustre Architect EMEA

Robert Maskell – Account Manager HPC

Gaurav Kaul – Solutions Architect

19

Backup

20

Optimization and Modernization (2/3)

Optimizations

Improve auto-vectorization (using -vec-report3 and -guide-vec)

int ip1 = (i+1) % Nx;

loop was not vectorized: operator unsuited for vectorization.

int ip1 = (i < Nx-1) ? i+1 : 0;

LOOP WAS VECTORIZED

21


Optimizations


Introduce halo regions, to improve cache behavior (and remove gathers)

int ip1 = i+1;

Data from i = 0 is replicated at i = Nx; all loads become contiguous

22


Optimizations


Introduce halo regions, to improve cache behavior (and remove gathers)

Swap division by constants for multiplication by pre-computed reciprocals

P2[i][j][k][l] = … / (1-delta);

One division per stencil point

P2[i][j][k][l] = … * i1mdelta;

One division reused for all stencil points

23


Modernizations

Reduce memory footprint (2x) by removing redundant arrays

Rewrite 𝜙𝑖𝑗𝑘𝑛+1 = 𝜙𝑖𝑗𝑘

𝑛 + Δ𝜂𝜙𝑖𝑗𝑘𝑛+1/2

as 𝜙𝑖𝑗𝑘𝑛+1/2= [𝜙𝑖𝑗𝑘

𝑛+1 − 𝜙𝑖𝑗𝑘𝑛 ]/Δ𝜂;

no need to store time derivative in addition to solution at two timesteps

24


Modernizations


Remove unnecessary 4D calculations and array lookups from 3D simulations Lphi = P[i-1][j][k][l] + P[i+1][j][k][l] + P[i][j-1][k][l] + …

- 8 * P[i][j][k][l];

Lphi = P[i][j-1][k][l] + P[i][j+1][k][l] + …

- 6 * P[i][j][k][l];

Saves two reads/additions per stencil point

25


Modernizations


Remove unnecessary 4D calculations and array lookups from 3D simulations

Combine three algorithmic stages into a single loop

Before: solve(t), timestep(), area(t+1)

After: solve(t), area(t), timestep()

Allows for one pass through the data each timestep.

26

accelerating scientific computing on intel architecture · pdf fileaccelerating scientific...

Documents