powered by intel® math kernel library

Powered by

Intel® Math Kernel Library (Intel® MKL)

Shane Story

Intel MKL Technology Strategist,

Victor Kostin

Intel MKL Dense Solvers team manager

"MKL is the best math library in the world…

robust, accurate, and vastly faster than the competition.”

Robert Harrison, Director, Oak Ridge National Labs

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

All you ever wanted to know about Intel MKL ...

• What is the Intel Math Kernel Library (Intel MKL)?

• Who uses Intel MKL?

• Why Intel MKL is relevant to the world’s fastest

computer systems?

• What is the most widely used math library in the world?

• How do I acquire Intel MKL?

You will be an Intel MKL evangelist 40 minutes from

now!

2


Agenda

• Introduction

• Performance

– Benchmarks

– Top500

• Customers

• Looking forward

3


What is a math library? • Start with your problem in a scientific discipline

• These problems might involve mathematics – Differential equations

– Linear algebra

– Fourier transforms

– Statistics

Geosciences & Geo-

engineering Financial Analytics

Science & Research

Engineering Design

Signal Processing

Digital Content Creation

4

− 𝝏𝒖𝟐

𝝏𝒙𝟐 − 𝝏𝒖𝟐

𝝏𝒚𝟐 + 𝒒 𝒖 = 𝒇 𝒙, 𝒚


Solving these problems can be tough

• These problems often – are impossible to solve by hand (no closed form

solution)

– are computationally intensive

– involve complex algorithms

• Math libraries can help ease the burden – Pick your favorite programming language:

C/C++/C#/Fortran/Java/Python

– IEEE floating point: single(32-bit) and double(64-bit)

(+/-1) * 2exponent

* 1.fraction

– Translate your mathematics into a SW program

5


A * B = C

Matrix Multiplication Example

1 2 34 5 6

∗ 1 4 7 102 5 8 113 6 9 12

= ? ? ? ?? ? ? ?

Match the general Intel MKL format

alpha * op( A ) * op( B ) + beta * C = C

Make an Intel MKL call

DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC)

m=2

k=3 n=4 n=4

k=3 m=2

Dimensions

m,n,k

6


No need to reinvent the wheel - math libraries

improve developer productivity

Compile and link against Intel MKL or

Program Sample

...

do i= 1,m

do j=1,n

do l=1,k

c(i,j)=c(i,j)+a(i,l)*b(l,j)

end do

end do

end do

...

call DGEMM(transa,transb,

m,n,k,

alpha,a,lda

,b,ldb,beta

,c,ldc)

...

stop

end

On Windows, using the Intel Fortran or C/C++ compiler, simply

ifort prog.f /Qmkl

or

icl prog.c /Qmkl

For other compilers linking to MKL may be tricky – see Intel® Math Kernel Library Link Line Advisor at http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/

7

http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/













Linear Algebra

• BLAS

• LAPACK

• Sparse Solvers

• ScaLAPACK

Fast Fourier Transforms

• Multidimensional FFT

• FFTW interfaces

• Cluster FFT

Vector Math

• Trigonometric

• Hyperbolic

• Exponential, Logarithmic

• Power / Root

Vector Random

Number Generators

• Congruential

• Wichmann-Hill

• Mersenne Twister

• Sobol

• Neiderreiter

• Non-deterministic

Summary Statistics

• Kurtosis

• Variation coefficient

• Quantiles

• Order statistics

• Min/max

• Variance-covariance

Data Fitting

• Splines

• Interpolation

• Cell search

Intel MKL supports a wide variety of

mathematical computations

8


Linear Algebra

• BLAS

• LAPACK

• Sparse Solvers

• ScaLAPACK

Fast Fourier Transforms

• Multidimensional FFT

• FFTW interfaces

• Cluster FFT

Vector Math

• Trigonometric

• Hyperbolic

• Exponential, Logarithmic

• Power / Root

Vector Random

Number Generators

• Congruential

• Wichmann-Hill

• Mersenne Twister

• Sobol

• Neiderreiter

• Non-deterministic

Summary Statistics

• Kurtosis

• Variation coefficient

• Quantiles

• Order statistics

• Min/max

• Variance-covariance

Data Fitting

• Splines

• Interpolation

• Cell search

Intel MKL supports a wide variety of

mathematical computations

9

Let’s review these

domains in more detail


BLAS – Basic Linear Algebra Subprograms

Defacto-standard APIs since the 1980s (FORTRAN 77)

Level 1 – vector-vector operations

Level 2 – matrix-vector operations

Level 3 – matrix-matrix operations

Precisions: single, double, single complex, double complex

Matrix Types: “dense” general, packed, triangular, banded

Operation Intel MKL Routine (D is for double)

Example Computational complexity (work)

Vector Vector D AXPY y = y + α x O(N)

Matrix Vector D GEMV y = αAx + βy O(N²)

Matrix Matrix D GEMM C = αA * B + βC O(N³)

Original BLAS available at http://netlib.org/blas/

10


Defacto-standard API since early 1990s

Thousands of linear algebra functions

Four floating point precisions

supported

Breadth of coverage:

• Matrix factorizations: the 3 Amigos – LU, Cholesky, QR

• Solving systems of linear equations

• Condition number estimates

• Singular value decomposition

• Symmetric and non-symmetric eigenvalue problems

• And much, much more

LAPACK – Linear Algebra PACKage

Original LAPACK is available at:

http://netlib.org/lapack/

11


VML - Vector Math Library

• Arithmetic

‒ add/sub/sqrt/ ...

• Exponential and log

‒ exp/pow/log/log10

• Trigonometric and hyperbolic

‒ sin/cos/sincos/tan(h)

‒ asin/acos/atan(h)

• Rounding

‒ ceil, floor, round ...

‒ And many more ...

• Real and complex

• Single/double precision

• 3 accuracy modes

‒ High accuracy

(Almost correctly rounded)

‒ Low accuracy

( 2 lowest bits in error)

‒ Enhanced performance

(1/2 the bits correct)

Example: y(i) = ex(i) for i =1 to n

Vector-based elementary functions allow

developers to balance accuracy with performance 12


Intel Instruction Set Architectures

• Intel® Streaming SIMD Extensions (Intel® SSE)

• Intel® Streaming SIMD Extensions 2 (Intel® SSE2)


• Intel® Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3)


• Intel® Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3)

• Intel® Advanced Vector Extensions (Intel® AVX)

• Intel® Advanced Vector Extensions 2 (Intel® AVX2)

13


The evolution of Intel® MKL

Intel BLAS Library

Pentium®

Intel Math Kernel Library

2.0 BLAS 3

threaded

2D FFT

2.1 Sparse BLAS

3.0 LAPACK

2000

4.0 Vector Math

2002

6.0 DFTI & Vector

Statistics Itanium®

7.0 PARDISO*

& ScaLAPACK

2005

8.0 CDFT, ISS, F95

8.1, 9.0 Trig

Transforms, Poisson

9.1 Trust

Region, LINPACK

10.0/1 OOC

PARDISO

10.2 LAPACK

3.2

10.3 Data

Fitting

11.0 СNR Intel Xeon Phi™

Intel® SSE2

Intel® SSE3

Intel® SSSE3

Intel® SSE4

Intel® SSE

Over 18 years of strategic investment

1994

1996 1995

1997 1998

1999 2003 2006

2007

2008

2010

2011

2012 Intel® AVX

14

Intel® AVX2


Job #1 is optimal performance

• We go to extremes to get the most

“differentiated” performance out of the

processor and system resources available

‒ Core: think register usage, prefetching, caches

‒ Multicore (processor/socket) level parallelization

‒ Multi-socket (node) level parallelization

‒ Clusters

‒ Data locality is key: keep your friends close but

your data closer ...

Let’s visualize this!

15


Intel MKL is optimized for performance

…

Intel® MKL

Automatic scaling from the core, to multicore,

to many core and beyond!

Intel MKL

+

OpenMP

Sequential

Intel MKL

runs on

single core

Intel MKL

+

MPI

16

Intel Xeon Phi

Coprocessor

PCIe


How we measure performance

• FLOPS – floating point operations per second

‒ 1 FLOP is one floating point operation

‒ Addition +, subtraction -, multiplication *, division /

• An “old” 2.0 GHz Intel SSE2-based Xeon quad core can

do:

‒ 2 additions + 2 multiplications per cycle (via SIMD

instructions)

‒ (4 cores) * (2 GHz) * (4 FLOP) = 32 GigaFLOPS (Double)

• Intel AVX-based (2010) systems double these rates

(2x)

‒ Double precision: 4 additions + 4 multiplications per cycle

‒ (4 cores) * (2 GHz) * (8 FLOP) = 64 GigaFLOPS (Double) 17


DGEMM performance (Intel MKL vs. ATLAS*)

18

Intel MKL strives to provide competitive performance

0

50

100

150

200

250

300

350

64 112 128 160 256 300 450 512 800 1000 1024 1500 1536 2000 2048 2560 3000 3072 4000 5000 6000 7000 8000

Pe

rfo

rma

nce

(G

Flo

ps)

Matrix Size (M=N=K=64, 112, 128, .., 6000, 7000, 8000)

Intel® Math Kernel Library versus ATLAS*

DGEMM on Intel® Xeon® E5-2600 Family Processor

Intel MKL - 16 threads Intel MKL - 8 threads Intel MKL - 1 thread

ATLAS - 16 threads ATLAS - 8 threads ATLAS - 1 thread

Intel® MKL offers significant performance boost over ATLAS*

Performance scales as number of CPU cores grows

Performance up to 87% of CPU peak!


Why does performance matter?

• Grand Challenge problems (1980s/90s)

‒ Human genome

‒ Simulating nuclear explosions

‒ Speech and vision recognition

• Beyond GigaFLOPS (GFLOPS), we have ‒ TeraFLOPS (TFLOPS): GFLOPS x 1000 = 1012

‒ PetaFLOPS (PFLOPS): TFLOPS x 1000 = 1015

‒ ExaFLOPS (EFLOPS): PFLOPS x 1000 = 1018 (ETA ~2018)

‒ ZetaFLOPS (ZFLOPS): EFLOPS x 1000 = 1021

• More FLOPS (+ technology) will enable us to

solve currently unsolvable problems

19


An insatiable need for FLOPS

Exascale and beyond will help address humanity’s challenges

10 PFlops

1 PFlops

100 TFlops

10 TFlops

1 TFlops

100 GFlops

10 GFlops

1 GFlops

100 MFlops

100 PFlops

10 EFlops

1 EFlops

100 EFlops

1993 2017 1999 2005 2012 2023

1 ZFlops

2029

Weather Prediction

Medical Imaging

Cellular Research

Source: www.top500.org

Forecast

20


The big iron

21

• The Top500 list - the 500 most powerful commercially

available computer systems known

• Entrants are showcased (measured) using the High Performance Linpack (HPL) Benchmark

• List updated twice yearly at Super Computer conferences

• Intel MKL is key when it comes extracting optimal

performance on Intel based systems

"The Intel® Math Kernel Library is indispensable ... a rich, highly optimized collection of math routines …

Outstanding performance is achieved on both multicore and multiprocessor systems.“

Jack Dongarra (Father of Linpack Benchmark), Univ. of Tennessee

http://www.top500.org/


Distribute blocks of A among cluster nodes

High Performance Linpack (HPL)

5x + 3y – 2z = 23 7x + 9y + 3z = 102 8x + 8y – 8z = 8

Generalize to

22

Ax = b where A is huge!

http://www.netlib.org/benchmark/hpl/


Current Top500 list

23


TFLOP

PFLOP

EFLOP

“My current Core i5 laptop would have been the world’s fastest computer when I started at Intel in 1991”

Shane Story (Intel MKL Historian)

Titan: #1 17.6 PFLOPS!

162 PFlops

Lomonosov: #26 0.9 PFLOPS!


Direct sales

Strategic End User Accounts

from segments:

Major ISVs Library Vendors

Customers Intel MKL

Open Source

Projects

• NumPy

• PETSc

• Trilinos

• Eigen

• Gromacs

• Octave

• WRF

Direct Sales

24

• Academic/Research • Animation • Life Sciences

• Financial Segment • Energy • Manufacturing


Intel MKL Poisson solver

“We want solid building blocks that we know will be robust and have

optimal performance, Intel MKL provides that. ... “,

Ron Henderson, Senior R&D Manager, DreamWorks Animation* *Quote from a "A Very Good Kitty, Indeed" , Intel Visual Adrenaline 2012

25


Intel MKL a key ingredient of the technical

computing SW product offerings

26

And is also available standalone http://software.intel.com/en-us/intel-math-kernel-library-purchase


Q & A

27


28

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Core, Intel Inside, the Intel Inside logo, Itanium, Itanium Inside, Pentium, Pentium Inside, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice


Intel ‘s share of the Top500 systems

30

IA displaces RISC


powered by intel® math kernel library

Documents