an m-script api for parallel out-of-core ... - hpca.uji.es file{rubio,quintana}@icc.uji.es high...
TRANSCRIPT
An M-Script API forParallel Out-of-Core Linear Algebra Operations
Rafael Rubio Enrique S. Quintana-Ortı{rubio,quintana}@icc.uji.es
High Performance Computing & Architectures GroupUniversidad Jaime I de Castellon (Spain)
API for Parallel OOC Linear Algebra 1 PARA 2008
Motivation
Very large-scale dense linear algebra problems occur in practice!
Space geodesy (e.g., computation of the gravity field)
Electromagnetism (as arising from BEM)
. . .
→
Problems are too large to fit in-core and use of virtual memory isinefficient
API for Parallel OOC Linear Algebra 2 PARA 2008
Motivation
As a result, message-passing dense linear algebra libraries exist:
SOLAR
OOC-ScaLAPACK
POOCLAPACK
Problems?
Message-passing may not be the best solution for multicore(many-core?)
Users (scientists and engineers) like high-level environments asmatlab (or clones Scilab, Octave), Labview, etc.
API for Parallel OOC Linear Algebra 3 PARA 2008
Motivation
Our solution:
Use FLAME@lab, the M-script API to FLAME
Extend FLASH to OOC
Extract parallelism from multithreaded BLAS (for now)
API for Parallel OOC Linear Algebra 4 PARA 2008
Outline
1 Motivation
2 FLAME@lab
3 FLASH for OOC
4 Experimental results
5 Concluding remarks
API for Parallel OOC Linear Algebra 5 PARA 2008
FLAME@lab
Outline
1 Motivation
2 FLAME@lab
Cholesky factorization (Overview of FLAME)
3 FLASH for OOC
4 Experimental results
5 Concluding remarks
API for Parallel OOC Linear Algebra 6 PARA 2008
FLAME@lab
The Cholesky Factorization
Definition
Given A → n× n symmetric positive definite, compute
A = L · LT ,
with L → n× n lower triangular
API for Parallel OOC Linear Algebra 7 PARA 2008
FLAME@lab
The Cholesky Factorization: Whiteboard Presentation
done
done
done
A(partially
updated)
?
α11 ?
a21 A22
?
α11:=√α11
?
a21:=a21/α11
A22:=
A22−a21aT21
?
done
done
done
A(partially
updated)
API for Parallel OOC Linear Algebra 8 PARA 2008
FLAME@lab
FLAME Notation
done
done
done
A(partially
updated)
?
α11 aT12
a21 A22
Repartition„ATL ATR
ABL ABR
«
→
0BB@A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
1CCAwhere α11 is a scalar
API for Parallel OOC Linear Algebra 9 PARA 2008
FLAME@lab
FLAME Notation
Algorithm: [A] := Chol unb(A)
Partition A→“
AT L AT R
ABL ABR
”where ATL is 0× 0
while n(ABR) 6= 0 do
Repartition„ATL ATR
ABL ABR
«→
0@ A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
1Awhere α11 is a scalar
α11 :=√
α11
a21 := a21/α11
A22 := A22 − a21aT21
Continue with„ATL ATR
ABL ABR
«←
0@ A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
1Aendwhile
API for Parallel OOC Linear Algebra 10 PARA 2008
FLAME@lab
FLAME@lab Code
From algorithm to code. . .
FLAME notationRepartition„
ATL ATR
ABL ABR
«→
0@ A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
1Awhere α11 is a scalar
FLAME@lab code[ A00, a01, A02, ...
a10t, alpha11, a12t, ...
A20, a21, A22 ] = FLA_Repart_2x2_to_3x3( ATL, ATR, ...
ABL, ABR, ...
1, 1, ’FLA_BR’ );
API for Parallel OOC Linear Algebra 11 PARA 2008
FLAME@lab
(Unblocked) FLAME@lab Code
function [ A_out ] = FLA_Cholesky_unb_var1( A )
% ... FLA_Part_2x2( ); ...
while ( size( ABR, 2 ) ~= 0 )
[ A00, a01, A02, ...
a10t, alpha11, a12t, ...
A20, a21, A22 ]
= FLA_Repart_2x2_to_3x3( ATL, ATR, ...
ABL, ABR, ...
1, 1, ’FLA_BR’ );
%----------------------------------------------------%
alpha11 = sqrt( alpha11 );
a21 = a21 / alpha11;
A22 = A22 - tril(a21 * a21’);
%----------------------------------------------------%
% ... FLA_Cont_with_3x3_to_2x2( ); ...
end
A_out = ATL;
return
API for Parallel OOC Linear Algebra 12 PARA 2008
FLAME@lab
Blocked FLAME@lab Code
function [ A_out ] = FLA_Cholesky_blk_var1( A, nb_alg )
% ... FLA_Part_2x2( ); ...
while ( size( ATL, 2 ) ~= 0 )
b = min( size( ABR, 1 ), nb_alg );
[ A00, A01, A02, ...
A10, A11, A12, ...
A20, A21, A22 ]
= FLA_Repart_2x2_to_3x3( ATL, ATR, ...
ABL, ABR, ...
b, b, ’FLA_BR’ );
%----------------------------------------------------%
FLA_Cholesky_unb( A11 );
A21 = A21 * inv( tril( A11 ) )’;
A22 = A22 - tril( A21 * A21’ );
%----------------------------------------------------%
% ... FLA_Cont_with_3x3_to_2x2( ); ...
end
A_out = ATL;
return
API for Parallel OOC Linear Algebra 13 PARA 2008
FLAME@lab
FLAME Code
Visit http://www.cs.utexas.edu/users/flame/Spark/. . .
M-script code formatlab: FLAME@lab
C code: FLAME/C
Other APIs:
FLATEXFortran-77LabViewMessage-passingparallel:PLAPACKFLAG: GPUs
API for Parallel OOC Linear Algebra 14 PARA 2008
FLASH
Outline
1 Motivation
2 FLAME@lab
3 FLASH for OOC
Overview of FLASHTiled OOC algorithm
4 Experimental results
5 Concluding remarks
API for Parallel OOC Linear Algebra 15 PARA 2008
FLASH
FLASH
Storage-by-blocks:
In general, storage-by-blocks enhaces data locality and,therefore, performance. . .
at the cost of more complex programming
API for Parallel OOC Linear Algebra 16 PARA 2008
FLASH
FLASH
Extending FLAME for hierarchical storage:
Natural extension: elements of matrices can be themselvesmatrices (hierarchical/recursive structure)
FLAME code is independent of actual storage (details hiddenin the objects)
OOC data structures are just one level in this hierarchy, withdata stored on disk
API for Parallel OOC Linear Algebra 17 PARA 2008
FLASH
OOC API
Driver:
% Create matrix
A = FLAOOC_Obj_create( ’FLA_DOUBLE’, rows, cols, tile_size,
file_name );
% Fill-in: A( i:i+m-1, j:j+n-1 ) = alpha*B+A( i:i+m-1, j:j+n-1 );
% B is m x n
FLAOOC_Axpy_matrix_to_object( alpha, B, A, i, j );
% Cholesky factorization
A = FLAOOC_Cholesky_var2( A );
% Free matrix
FLAOOC_Obj_free( A );
OOC to in-core and vice-versa:FLAOOC OOC to INC
FLAOOC INC to OOC
API for Parallel OOC Linear Algebra 18 PARA 2008
FLASH
OOC API
function [ A_out ] = FLA_Cholesky_blk_var2( A, nb_alg )
% ... FLA_Part_2x2( ); ...
while ( size( ABR, 2 ) ~= 0 )
b = min( size( ABR, 1 ), nb_alg );
[ A00, A01, A02, ...
A10, A11, A12, ...
A20, A21, A22 ]
= FLA_Repart_2x2_to_3x3( ATL, ATR, ...
ABL, ABR, ...
b, b, ’FLA_BR’ );
%----------------------------------------------------%
A10 = A10 \ tril( A00 )’;
A11 = A11 - tril( A10 * A10’ );
A11 = FLA_Cholesky_unb( A11 );
%----------------------------------------------------%
% ... FLA_Cont_with_3x3_to_2x2( ); ...
end
A_out = ATL;
return
API for Parallel OOC Linear Algebra 19 PARA 2008
FLASH
OOC API
function [ A_out ] = FLAOOC_Cholesky_var2( A )
% ... FLAOOC_Part_2x2( ); ...
while ( FLAOOC_Obj_width( ABR ) ~= 0 )
% ... FLAOOC_Repart_2x2_to_3x3( );
%----------------------------------------------------%
A10 = FLAOOC_Trsm( % Mode arguments
1, A00,
A10 ); % A10 := A10 * TRIL(A00)^-T
AINC = FLAOOC_OOC_to_INC( A11 );% Copy A11 in-core
AINC = FLAOOC_Syrk( % Mode arguments
-1, A10,
1, AINC ); % A11 := A11 - A10 * A10^T
AINC = chol(AINC)’; % A11 := chol( A11 )
FLAOOC_INC_to_OOC( AINC, A11 ); % Store A11 out-of-core
%----------------------------------------------------%
% ... FLAOOC_Cont_with_3x3_to_2x2( ); ...
end
A_out = ATL;
return
API for Parallel OOC Linear Algebra 20 PARA 2008
FLASH
Parallelization on Multithreaded Architectures
Traditional approach
Link with multithreaded BLAS
Advanced approach (future)
Extract parallelism at higher level: SuperMatrix runtimesystem
API for Parallel OOC Linear Algebra 21 PARA 2008
FLASH
Outline
1 Motivation
2 FLAME@lab
3 FLASH
4 OOC API
5 Experimental results
6 Concluding remarks
API for Parallel OOC Linear Algebra 22 PARA 2008
Results
Experimental Results
General
Platform Specs.
matserv SMP with two Intel [email protected] GHzand 1 GB of RAM
cat SMP with four Intel [email protected] GHzand 4 GB of RAM
Implementations
Matlab/Octave routine chol linked to multithreaded MKL(7.0 or higher)
FLA Cholesky blk linked to multithreaded MKL
API for Parallel OOC Linear Algebra 23 PARA 2008
Results
Experimental Results: matserv
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
GF
LOP
s
Matrix dimension
Cholesky factorization on 1 processor
FLAOOC_Chol + MKLMATLAB chol + MKL
API for Parallel OOC Linear Algebra 24 PARA 2008
Results
Experimental Results: cat
0
1
2
3
4
5
6
0 10000 20000 30000 40000 50000 60000
GF
LOP
s
Matrix dimension
Cholesky factorization on 1 processor
FLAOOC_Chol + MKLOCTAVE chol + MKL
API for Parallel OOC Linear Algebra 25 PARA 2008
Results
Experimental Results: matserv
0
1
2
3
4
5
6
7
8
9
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
GF
LOP
s
Matrix dimension
Cholesky factorization on 2 processors
FLAOOC_Chol + MKLMATLAB chol + MKL
API for Parallel OOC Linear Algebra 26 PARA 2008
Results
Experimental Results: cat
0
5
10
15
20
0 10000 20000 30000 40000 50000 60000
GF
LOP
s
Matrix dimension
Cholesky factorization on 4 processors
FLAOOC_Chol + MKLOCTAVE Chol + MKL
API for Parallel OOC Linear Algebra 27 PARA 2008
Remarks
Concluding Remarks
FLAOOC@lab is an easy-to-use tool to develop OOC codes fordense linear algebra operations
The new API allows the solution of large linear systems fromMatlab/Octave
Focus is on programmability but performance will come with asimilar C API
API for Parallel OOC Linear Algebra 28 PARA 2008
Remarks
Thanks for your attention!
For more information. . .
Visit http://www.cs.utexas.edu/users/flame
Support. . .
National Science Foundation awards CCF-0702714 andCCF-0540926 (ongoing till 2010)
Spanish CICYT project TIN2005-09037-C02-02
API for Parallel OOC Linear Algebra 29 PARA 2008