panini: a gpu aware array class - gpu technology
TRANSCRIPT
Panini: A GPU aware Array class Dr. Santosh Ansumali(JNCASR) & Priyanka Sah (NVIDIA)
Heterogeneous Computing
CPU
— Multicore
— Multiprocessor
— Cluster of Multicore
GPU
— CPU + GPU
MIC
Background
Programing Efficiency
Performance
— Scalar
— Parallel
MATLAB/ STL-C++
Merit - Easy to write a code using
Demerit - Performance Issue – no scalability of code
Array (C/C++) – smarter way of writing code
Use the feature of object oriented Matlab and Template Meta programming
Background
Template Meta programming
— Blitz++ : Fast, Accurate Numerical Computing in C++
Advanced feature (vector, array ,matrix)
FORTRAN
MATLAB
Scalable
Disadvantage – large code
POOMA –MPI
Panini – part of MATLAB + Blitz Vector Initialization on MATLAB a=1,2,3
Expression Evaluation a=αb + βc
MATLAB Blitz way
T1[i]=βc[i] lazy evaluation
T2[i]=αb[i] a[i]= α*b[i] +β*c[i]
T3[i]=t1[i]+t2[i] A=t3[i]
— Small array double a[20] - complete loop unrolling
— Large array double *a – if we do loop unrolling - register spilling
a=αb +βc +µd type checking
CRTP –to avoid type checking without going to virtual function
bsum(a,b) –{ &a[i] , &b[i] } a+b+c bsumof X + X -> tuplesum
Vector Initialization
Type Checking
template< class lhs, typename dataType> class scalarMultR: public baseET<dataType, scalarMultR<lhs,dataType> > { public: __device__ scalarMultR(const lhs &l,const dataType & r): lhs_(l),rhs_(r){} __device__ dataType value(int i)const{ return lhs_.value(i) * rhs_; } private: const lhs &lhs_; const dataType rhs_; };
template< class T, int len> class commaHelper { public: __device__ commaHelper() : vPtr(0){} __device__ commaHelper(T * ptr) : vPtr(ptr) { } __device__ commaHelper & operator,(T val) { *vPtr++ = val; return *this; } private: T * vPtr; };
__device__ commaHelper<dataType, N> operator=(dataType val ) { for(int i =0; i< numELEM; i++) data[i] = val; return commaHelper<dataType, N>(&data[1]); }
Objectives
Programmer productivity
— Rapidly develop complex applications
— Leverage parallel primitives
Encourage generic programming
High performance
— With minimal programmer effort
Interoperability
— Integrates with CUDA C/C++ code
Panini Library
Generic parallel array class built on advanced generic programming methodologies where details of parallelization is hidden inside the array class itself.
Allow a user to work with high-level physical abstractions for scientific computation
Expression Template - high performance numerical libraries, where abstract mathematical notations via operator overloading in C++
Efficiently parallelizable for large scale scientific code.
Implementation for the expression templates mechanism based on “Curiously recurring template pattern" (CRTP) in c++.
What is Panini Library ?
C++ template library for CUDA
Support Data Structure :
— 1d , 2d and complex Vector on CUDA
— 1d , 2d Grid with multidimensional data on cuda
SOA as well as AOS Data Structure
Template & Operator Overloading
Loop Unrolling for small size vector
Lazy Evaluation
Common sub expression elimination
Template and Operator overloading
Containers/Objects Supported by Panini Small size vector on device
Large vector on device
Create complex array , grid 1d, 2d
using namespace Panini; int main() { int nX = 200; int nY = 200; vectET < double > coordX(nX,0.0); vectET < double > coordY(nY,0.0); gridFlow2D<FLOW_FIELD_2D> myGridN(nX, nY, 1,1); gridFlow2D<FLOW_FIELD_2D> myGridO(nX, nY, 1,1); gridFlow2D<FLOW_FIELD_2D> myGridM(nX, nY, 1,1); gridFlow2D<1> pressureM(nX, nY,1,1); gridFlow2D<1> pressureN(nX, nY,1,1); gridFlow2D<1> pressure(nX, nY,1,1); gridFlow2D<1> potentialO(nX, nY, 1,1);
Basic Feature of vectTiny/vectET Class
Direct Assignment
vectTiny <dataType, N> array
array =1.2, 3.5, 5.6
Binary operation
Scalar arithmetic operation
Math operation on vectTiny object
Type checking not required
supports single- and double-precision floating point values, complex numbers,
Booleans, 32-bit signed and unsigned integers
supports manipulating vectors, matrices, and N-dimensional arrays
Best Practice
Structure of Arrays
—Ensure memory coalescing
Array of Structure
Implicit Sequences
— Avoid explicitly storing and accessing regular patterns
— Eliminate memory accesses and storage
DataType
VectTiny : small size arrays – user know the size in advance
— vectTiny <float, 100 > a
VectET: large arrays or grid 1D or N-Dimensional.
— b in a grid of size 100, where at every point, we have a fixed array of size 3.
— VectET< vectTiny< myReal, 3 >> a(100).
gridFlow2D – grid 2D or N dimension
— gridFlow2D<T,FLOW_FIELD_2D>**myGrid;
Allowed Operation Three modes of initialization are provided
— vectTiny < double, 3> a=2;
— vectTiny < double, 3> a; a=1,2,3;
— vectTiny < double, 3> b; b =a;
All Math operations, Binary operation and Scalar operation.
— vectTiny < double, 3> a=0.1, b, c; b = sin(a) ; c = a + b ; c = 0.5*c;
All vectors operations
— vectTiny < double, 3> a, b, c,d; b = a+sin(c)+0.3*cos(d)
Vector operations are relying on following optimizations: Loop-unrolling (by hand), Lazy Evaluation.
Keep reference of object till the last iteration
Allowed Operation…
Lazy Evaluation
— vectTiny < double, 3> a, b, c,d;
— b = sin(c)+0.3*cos(d)
A typical operator overloading + virtual function approach will evaluate it in following sequence
— for(i=1,N) tmp[i]=cos(d[i]);
— for(i=1,N) tmp1[i]=0.3*tmp[i];
— for(i=1,N) tmp3[i]= sin(c[i]);
Panini supports optimized Fortran style code
— for(i=1,N)
— b[i] = sin(c[i])+0.3*cos(d[i])
Structure of Arrays
Coalescing improves memory efficiency
Accesses to arrays of arbitrary structures won’t coalesce
Reordering into structure of arrays ensures coalescing
— Struct float3{ float x;float y; floatz;}; float3 *aos;...aos[i].x = 1.0f;
— Struct float3_soa{float*x;float*y;float*z;}; float3_soa soa;...soa.x[i] = 1.0f;
Array of structures -
Structure of arrays (Best Practice)
struct Velocity
{
int ux;
int uy;
};
Velocity<FLOW_FIELD_2D> obj_vel(nX, nY, 1,1);
struct Pressure
{
float *pressure;
float *pressureM;
float *pressureN ;
};
gridFlow2D<1> pressureM(nX, nY,1,1);
gridFlow2D<1> pressureN(nX, nY,1,1);
gridFlow2D<1> pressure(nX, nY,1,1);
PlaceHolder Object
Implicit Sequences
— placeHolder IX(nX)
Often we need ranges following a sequential pattern
Constant ranges
[1, 1, 1, 1, ...]
Incrementing ranges
[0, 1, 2, 3, ...]
How Panini different from Array Fire Static resolution.
Approach used in Array fire is easy to develop for library, but that will have penalty in performance.
Very initial stage
Navier Stroke Example: How easy to write a scientific code using Panini Data Structure
Initial Condition
vectET <double> coordX(nX,0.0); vectET <double> coordY(nY,0.0); gridFlow2D<FLOW_FIELD_2D> myGridN (nX, nY, 1, 1); gridFlow2D<FLOW_FIELD_2D> myGridO ( nX, nY, 1, 1); gridFlow2D<FLOW_FIELD_2D> myGridM ( nX, nY, 1, 1); gridFlow2D<1> pressureM (nX, nY, 1, 1); gridFlow2D<1> pressureN (nX, nY,1, 1); gridFlow2D<1> pressure (nX, nY, 1, 1); gridFlow2D<1> potentialO(nX, nY, 1,1);
myGridN(iX,iY).value(UX) = -2.0*M_PI*kY*phi*cos(coordX[iX]*kX)*sin(coordY[iY]*kY); myGridN(iX,iY).value(UY) = 2.0*M_PI*kX*phi*sin(coordX[iX]*kX)*cos(coordY[iY]*kY); pressure(iX,iY) = -M_PI*M_PI*phi*phi*(kY*kY*cos(2.0*coordX[iX]*kX)+kX*kX*cos(2.0*coordY[iX]*kY)); pressureM(iX,iY) -= 0.5*(myGridN(iX,iY).value(UX)*myGridN(iX,iY).value(UX)+myGridN(iX,iY).value(UY)*myGridN(iX,iY).value(UY));
Laplacian Equation
template<int N> void getLaplacian(gridFlow2D<N> gridVar,gridFlow2D<N> &gridLap, double c3, double c4, int iX, int iY) { gridLap(iX,iY) = gridVar(iX,iY) + c4*(gridVar(iX,iY+1) -2.0*gridVar(iX,iY) +gridVar(iX,iY-1)) + c3*(gridVar(iX+1,iY) -2.0*gridVar(iX,iY) + gridVar(iX-1,iY)); }
Serial Code – CPU Timing
No of Grid Point
CPU Timing
Time for 100 Iteration(sec)
Time for 200 Iteration (sec)
1.00E+04 0.00126063 0.00126172
4.00E+04 0.00625661 0.00624585
1.60E+05 0.0439781 0.044118
2.50E+05 0.0781446 0.0785149
CPU Timing – MPI Version
No. of Processors 100 iterations 200 iterations
1 0.0743643 0.0755658
2 0.0580707 0.0579703
4 0.054078 0.0507001
5 0.0447405 0.0420167
10 0.0382128 0.0365341
16 0.0379704 0.0372657
20 0.0367649 0.0390902
25 0.0472415 0.0589682
30 0.0645379 0.0627601
MPI Version of Panini Code
CPU Timing – MPI Version
No. of Processors 100 iterations 200 iterations
1 0.00741906 0.00716424
2 0.00583018 0.00561896
4 0.00725028 0.0067555
5 0.00634362 0.0071692
10 0.00856 0.010103
16 0.0102652 0.0102073
20 0.0114238 0.0103288
25 0.0104092 0.0103344
30 0.011299 0.0118635
MPI Version of Panini Code
CPU Timing vs GPU Timing
No of Grid Point
CPU Timing (sec)
GPU Timing (sec )
SpeedUp
100 iteration 100 iteration
100 x 100 0.001260 0.000441 2.72x
200 x 200
0.006256 0.001279 4.89x
400 x 400 0.043978 0.004311 10.09x
Curiously recurring template pattern
namespace Panini {
template <typename dataType, class input>
class baseET {
public:
typedef const input& inputRef;
// Return Reference to object input
Inline operator inputRef () const {
return *static_cast<const input*> (this ) ; }
inputRef getInputRef() const {
return static_cast<inputRef> (*this ) ; } // Every Base class will have member value
__device__ dataType value(const int i) const{
return static_cast<inputRef> (*this ).value(i) ;
}
};
This is the core of Vector Design.
The basic idea is that all input class will derive from this template based class where template input class will be derived class itself