sparse matrix dense vector multiplication by pedro a. escallon parallel processing class florida...

Sparse Matrix Dense Vector Multiplication

byPedro A. Escallon

Parallel Processing ClassFlorida Institute of Technology

April 2002

The Problem

• Improve the speed of sparse matrix - dense vector multiplication using MPI in a beowolf parallel computer.

What To Improve

• Current algorithms use excessive indirect addressing

• Current optimizations depend on the structure of the matrix (distribution of the nonzero elements)

Sparse Matrix Representations

• Coordinate format• Compressed Sparse Row (CSR)• Compressed Sparse Column (CSC)• Modified Sparse Row (MSR)

Compressed Sparse Row (CSR)

0 A01 A02 0

0 A11 0 A13

A20 0 0 0

0 2 4 5

0 2 1 3 0

A01 A02 A11 A13 A20

rS

ndx

val

CSR Code

void sparseMul(int m, double *val, int *ndx, int *rS, double *x, double *y){ int i,j; for(i=0;i<m;i++) { for(j=rowStart[i];j<rS[i+1];j++) { y[i]+=(*val++)*x[*ndx++]; } }}

Goals

• Eliminate indirect addressing• Remove the dependency on the distribution

of the nonzero elements• Further compress the matrix storage• Most of all, to speed up the operation

Proposed Solution

{0,0} {1,A01} {2,A02} {-1,0} {1,A11} {3,A13} {-2,A20}

0 A01 A02 0

0 A11 0 A13

A20 0 0 0

A =

Data Structure

typedef struct { int rCol; double val;} dSparS_t;

{rCol,val}

Process

0 1 3 p

local_size

hdr.size

…

residual < p

local_size – hdr.size / presidual = hdr.size % p

A

Scatter

0 1 2 p

local_size

…A

0 1 2 p…local_A

Multiplication Codeif( (index=local_A[0].rCol) > 0 ) local_Y[0].val = local_A[0].val * X[index];else local_Y[0].val = local_A[0].val * X[0];local_Y[0].rCol = -1;k=1; h=0;while(k<local_size) { while((0<(index=local_A[k].rCol)) && (k<local_size))

local_Y[h].val += local_A[k++].val * X[index]; if(k<local_size) {

local_Y[h++].rCol = -index-1;local_Y[h].val = local_A[k++].val * X[0];

}}local_Y[h].rCol = local_Y[-1+h++].rCol+1;while(h < stride) local_Y[h++].rCol = -1;

Multiplication

local_size

local_A

stri d

e

local_Y

doam

in

Ran

g e

X

=*

Algorithm

local_A

X

Y.val

Y.rCol

{r0,v0}0

X[0]

=X[0]*v00

-

{c1,v1}0

X[c01]

+=X[c01]*v01

-

.. {r1,v0}1

.. X[0]

=X[0]*v00

-

{c2,v2}0

X[c02]

+=X[c02]*v02

-r1-1

{c1,v1}1

X[c11]

+=X[c11]*v11

-

Gather

…

0 1 2 p…local_Y

residual

gatherBuffer

split element striderange

Consolidation of Split Rows

…

residual

Y

nCols

…

+=

gatherBuffer

Results (vavasis3)vavasis3.rua - Total non-zero values: 1,683,902 - p = 10

Broadcast Time Scatter Time Gather Time Computation Time

P0 0.103930 2.380285 0.096051 0.012123

P1 0.107588 0.457140 0.012000 0.011504

P2 0.107667 0.706087 0.012022 0.011642

P3 0.103155 0.951814 0.011971 0.011560

P4 0.107644 1.206376 0.012210 0.011536

P5 0.109243 1.452563 0.012032 0.011506

P6 0.108477 1.702571 0.012044 0.011506

P7 0.109446 1.948481 0.012004 0.011658

P8 0.055822 2.208924 0.012079 0.011540

P9 0.059023 2.459900 0.012009 0.011438



P0 0.089478 2.264316 0.121741 0.014860

P1 0.093083 0.569091 1.711789 0.014105

P2 0.093217 0.866460 1.429352 0.014227

P3 0.091012 1.160591 1.146954 0.014457

P4 0.081719 1.462335 0.865520 0.014365

P5 0.085375 1.756941 0.582353 0.014341

P6 0.085418 2.055651 0.299847 0.014362

P7 0.089087 2.350998 0.017813 0.014728

vavasis3.rua - Total non-zero values: 1,683,902 - p = 1


P0 0.000002 1.412774 0.033015 0.112132



P0 0.051980 3.026846 0.217574 0.028587

P1 0.055605 1.725272 1.027928 0.028258

P2 0.055703 2.319343 0.451021 0.028141

P3 0.056422 3.212518 0.018073 0.027988

vavasis3.rua - Total non-zero values: 1,683,902 - p = 2


P0 0.233200 5.810814 0.426097 0.056334

P1 0.236864 6.521328 0.032125 0.055866

Results (vavasis3)

P Computation Speedup E_p Gather C_p

1 0.112132 --- --- 0.033015 1.294430

2 0.056334 1.990485 0.995243 0.426097 8.563763

4 0.028587 3.922482 0.980621 1.027928 36.957883

8 0.014860 7.545895 0.943237 1.711789 116.194415

10 0.012123 9.249526 0.924953 0.096051 8.923039

vavasis3.rua - Calculated Results

Results (bayer02)bayer02.rua - Total non-zero values: 63,679 - p = 10


P0 0.046136 0.093143 0.011733 0.000926

P1 0.048824 0.018207 0.001567 0.000423

P2 0.048627 0.027146 0.002054 0.000456

P3 0.044416 0.034386 0.002440 0.000445

P4 0.048214 0.046365 0.002457 0.000397

P5 0.048481 0.053511 0.001978 0.000425

P6 0.045666 0.063204 0.002015 0.000467

P7 0.048173 0.070167 0.002440 0.000419

P8 0.033947 0.088532 0.002323 0.000395

P9 0.032110 0.097866 0.001959 0.000479



P0 0.040159 0.103422 0.011810 0.001020

P1 0.042743 0.023353 0.001728 0.000549

P2 0.042709 0.035670 0.001777 0.000607

P3 0.039322 0.047141 0.001738 0.000599

P4 0.041584 0.064024 0.001724 0.000702

P5 0.039229 0.075528 0.001725 0.000568

P6 0.037206 0.089757 0.001733 0.000565

P7 0.039912 0.101267 0.002111 0.000541

bayer02.rua - Total non-zero values: 63,679 - p = 1


P0 0.000003 0.063824 0.010975 0.006090



P0 0.049680 0.096930 0.018308 0.001888

P1 0.052379 0.048924 0.003765 0.001555

P2 0.051944 0.076405 0.003609 0.001561

P3 0.046413 0.101871 0.003636 0.001528

bayer02.rua - Total non-zero values: 63,679 - p = 2


P0 0.025494 0.520611 0.008192 0.003445

P1 0.028157 0.504081 0.032848 0.003121

Results (bayer02)

P Computation Speedup E_p Gather C_p

1 0.006090 --- --- 0.010975 2.802135

2 0.003445 1.767779 0.883890 0.032848 10.534978

4 0.001888 3.225636 0.806409 0.018308 10.697034

8 0.001020 5.970588 0.746324 0.011810 12.578431

10 0.000926 6.576674 0.657667 0.011733 13.670626

bayer02.rua - Calculated Results

Conclusions

• The proposed representation speeds up the matrix calculation

• Data mismatch solution before gather should be improved

• There seems to be a communication penalty for using moving structured data

Bibliography

• “Optimizing the Performance of Sparse Matrix-Vector Multiplication” dissertation by Eun-Jin Im.

• “Iterative Methods for Sparse Linear Systems” by Yousef Saad

• “Users’ Guide for the Harwell-Boeing Sparse Matrix Collection” by Iain S. Duff

sparse matrix dense vector multiplication by pedro a. escallon parallel processing class florida...

Documents