accelerating scientific computing on intel architecture · pdf fileaccelerating scientific...
TRANSCRIPT
Intel Confidential — Do Not Forward
Accelerating Scientific Computing on Intel
Architecture
Machine Evaluation Workshop 2014 (MEW 25), Ricoh Arena, Coventry
2nd December, 2014
Booth 13 – (John Swinburne, Ian Llyod Gabriele Paciucci, Gaurav Kaul)
Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Intel, Intel Xeon, Intel Xeon Phi™ are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. Copyright © 2014, Intel Corporation *Other brands and names may be claimed as the property of others.`
Modernizing HPC Community Codes
AVBP
(Large
Eddy)
Blast
BUDE
CAM-5
CASTEP
CESM
CFSv2
CIRCAC
AMBER
CliPhi
(COSMOS) COSA
Cosmos
codes DL-MESO DL-Poly ECHAM6 Elmer FrontFlow/Blue Code GADGET GAMESS-US
GPAW
Gromacs
GS2
GTC
Harmonie
Ls1
MACPO
Mardyn
MPAS
NEMO5
NWChem Openflow OPENMP/
MPI
Optimized integral
Quantum
Espresso R
ROTOR
SIM SeisSol SG++ SU2 UTBENCH VASP VISIT WRF
Intel® Parallel Computing Centers
>40 Centers 13 Countries >70 Codes
2 User Groups
https://software.intel.com/en-us/ipcc
Collaborating to accelerate the pace of discovery
Scientific Codes and Workloads optimized for
Intel® Xeon® and Xeon Phi ®
4
Mix of ISV and End User Development
Sub-Segments: Applications/Workloads
Molecular Dynamics HPL, HPCC, NPB, LAMMPS, QCD, GROMACS, NAMD, AMBER,
VASP
Energy (including Oil & Gas)
RTM (Reverse Time Migration), WEM (Wave Equation Migration)
Climate Modeling
and Weather Simulation WRF, HOMME
Life Sciences (Bioinformatics, Gene Sequencing, Bio-Chemistry)
HMMER, MPI-HMMER, BLAST, BWA-mem, BWA-aln, Bowtie2,
Cufflinks, GATK 3.1
Manufacturing
(CAD/CAM/CAE/CFD/EDA) PyFR, Implicit, Explicit Solvers
Software Ecosystem Tools, Middleware
Improvements in GATK* 3.0
• Pair HMM Acceleration using Intel® AVX
resulted in 970x speedup
− Computation kernel and bottleneck in
GATK* Haplotype Caller
− AVX enables 8 floating point SIMD
operations in parallel
*Some names and brands may be claimed as the property of others. 5
Burrows-Wheeler Aligner (BWA-ALN)
Execution Mode: Hybrid MPI + OpenMP using symmetric
Code Optimisation:
• Replaced pthreads with OpenMP for better load balance
• Data location improvements – all data files remain on host
• Overlapped I/O and compute to improve thread utilisation
• Used Intel TBB memory allocator for improved efficiency
• Vectorisation of performance critical loop
• Implemented data pre-fetch intrinsics to reduce memory
latency
Critically, output was identical to the unmodified BWA
6
TGen* RNA-Seq Pipeline
Before: Bowtie2* 2.0.0-beta7, TopHat* 2.0.4, Cufflinks* 2.0.2
After: Bowtie2* 2.1.0, TopHat* 2.0.8b, Cufflinks* 2.1.1.2
Intel® Xeon® E5-2687W / 3.1 GHz
*Some names and brands may be claimed as the property of others.
7 RNA-Seq from 7 days to less than 4 hours RNA-Seq from 7 days to less than 4 hours
• Challenge: Experiment processing takes 7 days with
existing infrastructure. Delays clinical treatment in very
aggressive pediatric cancers
• Solution: Dell* Genomics Data Analytics Platform
− Single Rack Solution with balanced HPC compute, fabric and
storage
− 9 Teraflops of Intel® Xeon® E5 v2 processors
− Intel® Enterprise Edition for Lustre*
− Intel® Cluster Studio XE
• Results:
− RNA-Seq pipeline reduced from 7 days to less than
4 hours
• Benefits: Rapid results enables sequencing multiple
times during the course of treatment, monitor patient
response, adjust protocol, improve outcomes
Lustre
Intel Genomics & Health Analytics Appliances
2U Plenum
Actual placement in racks may vary.
NSS-HA Pair
NSS User Data
HSS Metadata Pair
HSS OSS Pair
HSS User Data
Scale through independent solutions, each targeting a different segment & usage model
Scale through independent solutions, each targeting a different segment & usage model
8
• Challenge: Can high performance interconnect
technology (InfiniBand) keep up with increase in
number of processor cores?
• Workloads: VASP, WIEN2K
• Benchmarks: MVAPICH (MPI over InfiniBand),
IMPI (Intel MPI)
• Results:
- Scale-up research – 5 to 10x speed up when scaling
from single node to 16 nodes
- Intel® True Scale Fabric QDR-40 shows excellent
price/performance results
• Infrastructure and Data Characteristics:
- 1 Head + 16 compute nodes, Dual Xeon® E5-2680 2.7GHz p/node
- 32GB of RAM 1666MHz p/node
- RHEL, Compiler, MPI variations available
- Intel® Cluster Suite, Intel® Fabric Suite
High-Performance Interconnect (InfiniBand) Intel® True Scale Fabric
9 *Some names and brands may be claimed as the property of others.
True Scale
Fabric
ASKAP*/tHogBomClean (Australian Square Kilometer Array Pathfinder) n
Description: The tHogBomClean benchmark
implements the kernel of the HogBom Clean
deconvolution algorithm. This benchmark is quite
minimal and actually omits the final step, convolution
of the model with the clean beam, but this involves
the similar operations to the other steps as far as the
CPU is concerned. Availability:
https://github.com/ATNF/askap-benchmarks
Usage Model: Offload using OpenMP*; host
performs data initialization and transfers to the Intel®
Xeon Phi™ coprocessor 7120A; all computing is
performed by the Intel Xeon Phi coprocessor 7120A.
10 SOURCE: INTEL MEASURED RESULTS AS OF MARCH,
2014
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance
For configuration details, go here.
0
1
Cleaning Rate Throughput per Second Speed Up
E5-2697 v2 BaselineE5-2697 v2 OptimizedE5-2697 v2 + 7120AE5-2697 v2 + 7120A Turbo Off OptimizedE5-2697 v2 + 7120A Turbo On OptimizedE5-2697 v2 + NVIDIA K40* Boost OffE5-2697 v2 + NVIDIA K40* Boost On
Co
mp
ara
tive
Pe
rfo
rma
nce
“E5-2697 v2” = Intel® Xeon® processor E5-2697 v2
“7120A” = Intel® Xeon Phi™ coprocessor 7120A
1X 1.03X
.68X
1.74X 1.78X
1.07X
1.24X
* Other names and brands may be claimed as the property of others.
1.78X Speed Up with Intel® Xeon Phi co-
processor
1.78X Speed Up with Intel® Xeon Phi co-
processor
Modal - COSMOS @ DiRAC - University of Cambridge(in
3D)
11
1. Laplacian stencil operation (radius 1): 𝛻2𝜙 𝑖𝑗𝑘 ≡ 𝜙𝑖+1,𝑗,𝑘 + 𝜙𝑖−1,𝑗,𝑘 + 𝜙𝑖,𝑗+1,𝑘 + 𝜙𝑖,𝑗−1,𝑘 + 𝜙𝑖,𝑗,𝑘+1
+ 𝜙𝑖,𝑗,𝑘−1 − 6𝜙𝑖,𝑗,𝑘
2. Time-stepping (leap-frog integration):
𝜙𝑖𝑗𝑘𝑛+1 = 𝜙𝑖𝑗𝑘
𝑛 + Δ𝜂𝜙𝑖𝑗𝑘𝑛+1/2
, where
𝜙𝑖𝑗𝑘𝑛+1/2=1−𝛿 𝜙𝑖𝑗𝑘
𝑛−1/2+Δ𝜂 𝛻2𝜙𝑖𝑗𝑘
𝑛 −𝜕𝑉
𝜕𝜙𝑖𝑗𝑘𝑛
1+𝛿
3. Calculate area of domain walls:
𝑛. 𝑑𝐴 = Δ𝐴 𝛿−+
|𝛻𝜙|
𝜙𝑥 + 𝜙𝑦 + |𝜙𝑧|links
Optimization and Modernization
The Strategy:
Use straightforward parallel tuning techniques
vectorize, scale, memory
Use tools/compiler guidance features
Maintain readability and platform portability
The Result
Significant performance improvements in ~3-4 weeks
Single, clear, readable code-base
“Template” stencil code transferable to other simulations
12
Final Performance
0
1
2
3
4
5
6
7
8
Baseline Auto-vec MemoryReduction
Halo (Inner) 3DSpecialisation
Halo (All) Single Loop DivisionHoisting
2 x Processor 1 x Coprocessor
13
0.45x
1.18x 1.12x
1.28x
1.52x 1.62x
1.44x
1.38x
PyFR (Python based Flux Reconstruction) Imperial College, London
Description: CFD code for turbulent flow and
unstructured meshes written in Python which
generates matrix multiplication templates in C
Usage Model: Host performs C kernel initialization
and transfers to the Intel® Xeon Phi™ coprocessor
7120P in Python; all computing is performed by the
Intel Xeon Phi coprocessor 7120P.
14 SOURCE: INTEL MEASURED RESULTS AS OF MARCH,
2014
* Other names and brands may be claimed as the property of others.
1.5X Speed Up with Intel® Xeon Phi co-
processor over SNB
1.5X Speed Up with Intel® Xeon Phi co-
processor over SNB
FINRA Audit Trail System
*Some names and brands may be claimed as the property of others.
15
Performance comparison of Lustre and HDFS on MR
implementation of FSI workload using HDDP cluster
hosted in the Intel BigData Lab in Swindon (UK) and
Intel Enterprise Edition for Lustre.
Lustre
A query of Audit Trail System part of FINRA security
specifications (publicly available) is used as a
representative application.
Performance metric: Average Query Execution Time
5.5X performance vs HDFS on 7TB data size 5.5X performance vs HDFS on 7TB data size
1.5X performance vs HDFS on same BoM 1.5X performance vs HDFS on same BoM
Fluent 15, Aircraft 2M Model
16
True Scale
Fabric
High message rate and low scalable latency
characteristics of Intel® True Scale enables
improved application scaling performance.
Performance Scaled Messaging (PSM) provides 2X-5X
message rate over IB Verbs on Intel® Xeon E5-2600 v3
25% performance advantage at 32 nodes 25% performance advantage at 32 nodes
Test Configuration
Processor: 2.6Ghz E5-2697 v3 (14 c)
64GB (2133Mhz Memory)
OS - Redhat Linux
MPI – OpenMPI
On-Load/PSM
HCA - Intel QLE7340
IB Switch -True Scale 12800
Host Library – PSM
Offload/Verbs
HCA – MLNX MCX353A-FCAT
Switch – MLNX MSX6025F-1BR
Host Library – InfiniBand Verbs
Investing in a parallel future
17
Parallel Hardware
Parallel Programming
Models
18
Intel® Technical Computing
What makes a great HPC cluster?
Performance for highly-parallel workloads while preserving software investment
Intel® Xeon Phi™ coprocessors: Leading performance for highly parallel workloads
Uses common Intel® Xeon ® programming models to preserve software investment
Industry-leading workload performance
Intel® Xeon® processors: Ground-breaking real-world application performance
Industry-leading energy efficiency
C:\ _
Ease of deployment and maintenance
Intel® Cluster Ready
Fast, reliable access to data
Intel® Xeon® processor based storage
Intel® SSD/NVMe
Intel® Lustre
Fast, cost-effective data movement between nodes
Intel® True Scale fabric
Intel: The Architecture for Discovery
See us at Booth # 13 for any questions
and queries !
John Swinburne – Truescale Architect
Ian Lloyd – Technical Account Manager HPC
Gabriele Paciucci – Lustre Architect EMEA
Robert Maskell – Account Manager HPC
Gaurav Kaul – Solutions Architect
19
Backup
20
Optimization and Modernization (2/3)
Optimizations
Improve auto-vectorization (using -vec-report3 and -guide-vec)
int ip1 = (i+1) % Nx;
loop was not vectorized: operator unsuited for vectorization.
int ip1 = (i < Nx-1) ? i+1 : 0;
LOOP WAS VECTORIZED
21
Optimization and Modernization (2/3)
Optimizations
Improve auto-vectorization (using -vec-report3 and -guide-vec)
Introduce halo regions, to improve cache behavior (and remove gathers)
int ip1 = i+1;
Data from i = 0 is replicated at i = Nx; all loads become contiguous
22
Optimization and Modernization (2/3)
Optimizations
Improve auto-vectorization (using -vec-report3 and -guide-vec)
Introduce halo regions, to improve cache behavior (and remove gathers)
Swap division by constants for multiplication by pre-computed reciprocals
P2[i][j][k][l] = … / (1-delta);
One division per stencil point
P2[i][j][k][l] = … * i1mdelta;
One division reused for all stencil points
23
Optimization and Modernization (3/3)
Modernizations
Reduce memory footprint (2x) by removing redundant arrays
Rewrite 𝜙𝑖𝑗𝑘𝑛+1 = 𝜙𝑖𝑗𝑘
𝑛 + Δ𝜂𝜙𝑖𝑗𝑘𝑛+1/2
as 𝜙𝑖𝑗𝑘𝑛+1/2= [𝜙𝑖𝑗𝑘
𝑛+1 − 𝜙𝑖𝑗𝑘𝑛 ]/Δ𝜂;
no need to store time derivative in addition to solution at two timesteps
24
Optimization and Modernization (3/3)
Modernizations
Reduce memory footprint (2x) by removing redundant arrays
Remove unnecessary 4D calculations and array lookups from 3D simulations Lphi = P[i-1][j][k][l] + P[i+1][j][k][l] + P[i][j-1][k][l] + …
- 8 * P[i][j][k][l];
Lphi = P[i][j-1][k][l] + P[i][j+1][k][l] + …
- 6 * P[i][j][k][l];
Saves two reads/additions per stencil point
25
Optimization and Modernization (3/3)
Modernizations
Reduce memory footprint (2x) by removing redundant arrays
Remove unnecessary 4D calculations and array lookups from 3D simulations
Combine three algorithmic stages into a single loop
Before: solve(t), timestep(), area(t+1)
After: solve(t), area(t), timestep()
Allows for one pass through the data each timestep.
26