tutorial presentation 6
Post on 07-Jul-2018
221 Views
Preview:
TRANSCRIPT
-
8/19/2019 Tutorial Presentation 6
1/24
BLAS and Vectorization extensions.
Carlos Pachajoa
December 5, 2012
http://find/http://goback/
-
8/19/2019 Tutorial Presentation 6
2/24
Contents
BLAS
Vectorization extensions for X86
GPGPU
http://find/
-
8/19/2019 Tutorial Presentation 6
3/24
BLAS
Stands for Basic Linear Algebra Subprograms
Is an interface for linear algebra operations. BLAS itself is aspecification for Fortran. The equivalent interface in C is CBLAS.
Use the local implementation with#include
http://find/
-
8/19/2019 Tutorial Presentation 6
4/24
BLAS levels
The operations are divided in three levels: Level 1: Vector operations like
y ← αx + y, x, y ∈ ZN
and also dot products and vector norms.
Level 2: Matrix-vector operations likey ← αAx + β y, x, y ∈ ZN , A ∈ ZM ×N
and solutions for triangular systems, among others.
Level 3: Matrix-matrix operations likeC ← αAB + β C, C ∈ ZM ×N , A ∈ ZM ×P , B ∈ ZP ×N
Different calls for different precisions and whether real or complexnumbers.
http://find/
-
8/19/2019 Tutorial Presentation 6
5/24
BLAS function naming conventions
The first letter specifies precision: S for real, single precision.
D for real, double precision.
C for complex, single precision.
Z for complex, double precision.
The first letter is followed by a function, for examplexAXPY is y ← αx + y from level one. Here, x represents a space
and a precision.Therefore, SAXPY will perform the operation using single precisionfloating point numbers.
http://find/
-
8/19/2019 Tutorial Presentation 6
6/24
CBLAS data representation
CBLAS receives data as contiguous positions in memory and a
size. Both matrices and vectors are stored in this way. To specify amatrix, one additionally has to provide a stride and define whetherit is row- or column- major.
{1,2,3,4,5,6,7,8,9} will be
1 2 34 5 6
7 8 9
in R.M. and
1 4 72 5 8
3 6 9
in C.M.
The ordering is given using this enumeration:enum CBLAS ORDER {CblasRowMajor=101, CblasColMajor=102};
http://find/http://goback/
-
8/19/2019 Tutorial Presentation 6
7/24
A function signature
y ← αAx + β y, y ← αATx + β y
v o i d c bl a s s g em v (c o n s t enum CBLAS ORDER Ord er ,c o n s t enum CBLAS TRANSPOSE Tran sA ,c on s t i n t M,
c on st i n t N ,c on s t f l o a t a l p h a ,c o n st f l o a t ∗A ,c on st i n t l d a ,c o n st f l o a t ∗X ,c o n s t i n t incX ,c on st f l o a t b et a ,f l o a t ∗Y ,c on st i n t i n c Y
) ;
http://find/
-
8/19/2019 Tutorial Presentation 6
8/24
Enumeration types
enum CBLAS ORDER {C bl a sR o wM a jo r = 10 1 , /∗ row−m a j o r a r r a y s ∗/C b l a s C o l M a j o r = 1 0 2}; /∗ column−m aj o r a r r a y s ∗/
enum CBLAS TRANSPOSE { // Whether t o work w it h t r a n s p o s eC b l a s No T r a n s =111 , /∗ t r a n s = ’ N ’ ∗/C b l a s T r a n s =112 , /∗ t r a n s = ’ T ’ ∗/
C b l a s C o n j T r a n s = 1 1 3} ; /∗ t r a n s = ’ C ’ ∗/enum CBLAS UPLO { // The m at ri x i s up p e r o r l o w erC b l a s U p p e r =121 , /∗ u p l o = ’U ’ ∗/C b l a s L o w e r = 1 2 2}; /∗ u p l o =’L ’ ∗/
enum CBLAS DIAG { // The m at ri x i s u n i t r i a n g u l a rC b l a s N o n U n i t =131 , /∗ d i a g = ’N ’ ∗/
C b l a s U n i t = 13 2}; /∗ d i a g = ’U ’ ∗/enum CBLAS SIDE { // O r d er o f m at ri x m u l t i p l i c a t i o n
C b l a s L e f t =141 , /∗ s i d e =’L ’ ∗/C b l a s R i g h t = 14 2}; /∗ s i d e =’R ’ ∗/
http://find/
-
8/19/2019 Tutorial Presentation 6
9/24
Some CBLAS implementations
ATLAS (Automatically Tuned Linear Algebra Software)
MKL (Math Kernel Library) CUBLAS
http://find/
-
8/19/2019 Tutorial Presentation 6
10/24
Documents with CBLAS routines
http:
//math-atlas.sourceforge.net/psdoc/cblasqref.ps
https://developer.apple.com/library/mac/
documentation/Accelerate/Reference/BLAS_Ref/
Reference/reference.html
http://math-atlas.sourceforge.net/psdoc/cblasqref.pshttp://math-atlas.sourceforge.net/psdoc/cblasqref.pshttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttp://math-atlas.sourceforge.net/psdoc/cblasqref.pshttp://math-atlas.sourceforge.net/psdoc/cblasqref.pshttp://find/http://goback/
-
8/19/2019 Tutorial Presentation 6
11/24
SIMD
Single Instruction, Multiple Data
Taken from
http://archive.arstechnica.com/cpu/1q00/simd/figure6.gif
http://archive.arstechnica.com/cpu/1q00/simd/figure6.gifhttp://archive.arstechnica.com/cpu/1q00/simd/figure6.gifhttp://find/
-
8/19/2019 Tutorial Presentation 6
12/24
SSE
Streaming SIMD Extensions.
Additional registers in the processor and operations in thearchitecture.
http://en.wikipedia.org/wiki/File:XMM_registers.svg
8 registers with 128 bits each. 4 single precision floating pointnumbers in each register.
http://en.wikipedia.org/wiki/File:XMM_registers.svghttp://en.wikipedia.org/wiki/File:XMM_registers.svghttp://find/
-
8/19/2019 Tutorial Presentation 6
13/24
Some instructions
; A l l i n s t r u c t i o n s end up i n S; p e nu l t i m at e l e t t e r d en ot es s c a l a r o r v e c t o r; S s t a n d s f o r s c a l a r , P f o r v e c t o r .; O pe ra nd s a r e XMM r e g i s t e r s
; Adds a l l e le me nt s o f a r r a y s op1 and op2 i n t o op1ADDPS op1 , op2
; Adds t he f i r s t e le me nt o f op1 and op2 i n t o; t h e f i r s t p o s i t i o n o f op2
ADDSS op1 , op2
http://find/http://goback/
-
8/19/2019 Tutorial Presentation 6
14/24
Some SSE instructions
v e c r e s . x = v 1 . x + v 2 . x ;
v e c r e s . y = v 1 . y + v 2 . y ;v e c r e s . z = v 1 . z + v 2 . z ;v e c r e s . w = v1 . w + v2 . w ;
; xmm0 = v1 .w | v1 . z | v1 . y | v1 . xmovaps xmm0, [ v1 ]
; xmm0 = v1 . w+v2 . w | v1 . z+v2 . z | v1 . y+v2 . y | v1 . x+v2 . xaddps xmm0, [ v2 ]
m ov ap s [ v e c r e s ] , xmm0
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensionshttp://en.wikipedia.org/wiki/Streaming_SIMD_Extensionshttp://find/http://goback/
-
8/19/2019 Tutorial Presentation 6
15/24
SSE upgrades
SSE2 Allows to represent multiple types of data fitting onthe vectors, including integers and characters, andperform the corresponding operations.AMD’s implementation also doubled the number of XMM arrays.
SSE3 Addition of horizontal operations within the XMMarrays, such as data reduction.
SSSE3 Additional instructions for SSE3.
SSE4 Introduction of Dword multiply, allowing to multiplytwo pairs of 32-bit integers to produce 2 64-bitnumbers. Vector dot products.
http://find/
-
8/19/2019 Tutorial Presentation 6
16/24
AVX
Intel’s extension to SSE for the Sandy Bridge microarchitecture,
introduced in 2011. Also available in AMD’s Bulldozer.
http://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svg
http://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svghttp://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svghttp://find/
-
8/19/2019 Tutorial Presentation 6
17/24
Automatic vectorization
The Intel compiler can, under certain conditions, vectorize loops inthe code.
This can be activated by using the -vec option.
for(i=0; i
-
8/19/2019 Tutorial Presentation 6
18/24
Obstacles to vectorization
Non-contiguous memory access
for(i=0; i < SIZE; i+=stride) A[i] = B[i] + C[i];
Data dependencies
for(i=0; i < SIZE; i++) A[i] = A[i-1] + B[i];
G idi ICC i i
http://find/
-
8/19/2019 Tutorial Presentation 6
19/24
Guiding ICC vectorization
Pragmas For example, #pragma ivdep, among others tocontrol when to vectorize a loop.
Keywords, such as restrict .
Switches passed to the compiler as optimization levels.
Look at the ICC automatic vectorization documentation[11].
GPGPU
http://find/http://goback/
-
8/19/2019 Tutorial Presentation 6
20/24
GPGPU
http://blogs.nvidia.com/2009/12/
whats-the-difference-between-a-cpu-and-a-gpu/
GPGPU d CPU
http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/http://find/
-
8/19/2019 Tutorial Presentation 6
21/24
GPGPU and CPU
CPU
General purpose
Pipelines
Few threads
Lots of cache (Correlation)
GPGPU
Specialized for local vector operations
Many cores and threads Little cache
Smaller consumption relative to a CPU
CUDA
http://find/
-
8/19/2019 Tutorial Presentation 6
22/24
CUDA
Stands for Compute Unified Device Architecture .
Effectively, a programming model to access and control GPUsusing a virtual instruction set, in a simmilar manner as a CPU.
Only supported on NVIDIA cards.
Uses the NVIDIA compiler, and can be programmed using CUDAC/C++, languages based on C/C++.
O CL
http://find/
-
8/19/2019 Tutorial Presentation 6
23/24
OpenCL
Stands for Open Computing Language
It also provides access to the GPU.
It’s an open standard, supported by NVIDIA and AMD,among others.
Provides a language based on C99.
Functionality provided by a driver. Compilation handled by linking to the correct library.
References
http://find/
-
8/19/2019 Tutorial Presentation 6
24/24
References
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
http://www.netlib.org/blas/
http://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.html
http://math-atlas.sourceforge.net/faq.html
http://software.intel.com/sites/products/documentation/hpc/mkl/
mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htm
https://developer.nvidia.com/cublas
http://www.godevtool.com/TestbugHelp/XMMfpins.htm
software.intel.com/en-us/avx
http://www.khronos.org/opencl/
http://developer.download.nvidia.com/CUDA/training/GTC_Express_
Sarah_Tariq_June2011.pdf
http:
//software.intel.com/sites/products/documentation/hpc/composerxe/
en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htm
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensionshttp://www.netlib.org/blas/http://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.htmlhttp://math-atlas.sourceforge.net/faq.htmlhttp://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htmhttp://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htmhttps://developer.nvidia.com/cublashttp://www.godevtool.com/TestbugHelp/XMMfpins.htmhttp://localhost/var/www/apps/conversion/tmp/scratch_5/software.intel.com/en-us/avxhttp://www.khronos.org/opencl/http://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdfhttp://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdfhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdfhttp://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdfhttp://www.khronos.org/opencl/http://localhost/var/www/apps/conversion/tmp/scratch_5/software.intel.com/en-us/avxhttp://www.godevtool.com/TestbugHelp/XMMfpins.htmhttps://developer.nvidia.com/cublashttp://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htmhttp://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htmhttp://math-atlas.sourceforge.net/faq.htmlhttp://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.htmlhttp://www.netlib.org/blas/http://en.wikipedia.org/wiki/Streaming_SIMD_Extensionshttp://find/
top related