chained matrix multiplication

In The Name Of GodIn The Name Of God

Chained Matrix Chained Matrix MultiplicationMultiplication

Alireza NiksereshtAlireza Nikseresht

Fall 2004Fall 2004

Multiplying unequal matricesMultiplying unequal matrices Suppose we want to multiply two matrices do not have the Suppose we want to multiply two matrices do not have the

same number of rows and columnssame number of rows and columns We can multiply two matrices A1 and A2 only if the number We can multiply two matrices A1 and A2 only if the number

of columns of A1 is equal to the number of rows of A2of columns of A1 is equal to the number of rows of A2 Example: We want to multiply a Example: We want to multiply a 2 X 32 X 3 matrix by a matrix by a

3 X 43 X 4 matrix matrix This will have 4 terms in the top row and 4 in the bottomThis will have 4 terms in the top row and 4 in the bottom Each term is the result of 3 multiplicationsEach term is the result of 3 multiplications So the total number of multiplications is So the total number of multiplications is 2*3*42*3*4

Generalizing: if we want to multiply an N X M matrix by an M X P matrix it will take N*M*P multiplications

Chained Matrix MultiplicationChained Matrix Multiplication

We are given a sequence (chain) A1, A2, We are given a sequence (chain) A1, A2, …,An of n matrices, and we wish to find the …,An of n matrices, and we wish to find the productproduct

The way we parenthesize a chain of matrices The way we parenthesize a chain of matrices can have a dramatic impact on the cost of can have a dramatic impact on the cost of evaluating the product.evaluating the product.

This problem is toThis problem is to determine the best way determine the best way to parenthesize the matrices to minimize to parenthesize the matrices to minimize the number of multiplicationsthe number of multiplications

Example: Example: A1 5 X 3A1 5 X 3 A2 3 X 4A2 3 X 4 A3 4 X 6A3 4 X 6 A4 6 X 5A4 6 X 5 The problem: what is the best order to multiply The problem: what is the best order to multiply

them?them? If we multiply If we multiply (A1( (A2A3) A4) ) takes (A1( (A2A3) A4) ) takes 237237 multiplications multiplications (A1 (A2 (A3A4) )) takes 255 multiplications(A1 (A2 (A3A4) )) takes 255 multiplications ( (A1A2) (A3A4) ) takes 280 multiplications( (A1A2) (A3A4) ) takes 280 multiplications ( ( (A1A2) A3) A4) takes 330 multiplications( ( (A1A2) A3) A4) takes 330 multiplications ( (A1(A2A3) ) A4) takes 312 multiplications( (A1(A2A3) ) A4) takes 312 multiplications

How to parenthesize the matricesHow to parenthesize the matrices

In the case of four matrices, there are only five In the case of four matrices, there are only five ways to order the multiplicationsways to order the multiplications

But with n matrices, the number of ways to But with n matrices, the number of ways to parenthesize them grows exponentially (4n/n3/2) parenthesize them grows exponentially (4n/n3/2) so we do not want to look at all the possibilitiesso we do not want to look at all the possibilities

Dividing the problem into subproblemsDividing the problem into subproblems We use the We use the principal of optimalityprincipal of optimality which is said to which is said to

apply if an optimal solution to an instance of a apply if an optimal solution to an instance of a problem always contains optimal solutions to all problem always contains optimal solutions to all substances.substances.

If A1((((A2A3)A4)A5)A6) is the optimal order If A1((((A2A3)A4)A5)A6) is the optimal order then we know that (A2A3)A4 is the optimal order then we know that (A2A3)A4 is the optimal order for A2A3A4for A2A3A4

The matrix-chain problemThe matrix-chain problem

Suppose we have matrix-chain A1 .. AnSuppose we have matrix-chain A1 .. An We divide this into subproblems A1..Ak and We divide this into subproblems A1..Ak and

Ak+1 .. AnAk+1 .. An The problem is that we do not know what the The problem is that we do not know what the

k k should beshould be We find k by looking at the optimal solutions We find k by looking at the optimal solutions

of each of the subproblems.of each of the subproblems. This means looking at all the values for kThis means looking at all the values for k

Data StructureData StructureThe 2-D array called NThe 2-D array called N

1.1. N[N[ii][][jj] will hold the number of ] will hold the number of multiplications to multiply from Amultiplications to multiply from Aii to A to Ajj

2.2. N[i][i] N[i][i] This of course is zero, since it is a This of course is zero, since it is a chain of length onechain of length one

3.3. Ni,j = min{ Ni,k + Nk+1,,j + di-1*dk*dj } Ni,j = min{ Ni,k + Nk+1,,j + di-1*dk*dj } where i <= k < j where i <= k < j

Another ExampleAnother Example

A1 30 X 35A1 30 X 35 A2 35 X 15A2 35 X 15 A3 15 X 5A3 15 X 5 A4 5 X 10A4 5 X 10 A5 10 X 20A5 10 X 20 A6 20 X 25A6 20 X 25

Ni,j = min{ Ni,k + Nk+1,j + di-1*dk*dj } Ni,j = min{ Ni,k + Nk+1,j + di-1*dk*dj } where i <= k < jwhere i <= k < j

1 2 3 4 5 6

1 0 15,750 7,875 9,375 15375 24000

2 0 2,625 4,375 8625 12375

3 0 750 2500 5,375

4 0 1,000 3,500

5 0 5,000

6 0

Sequential CodeSequential Code int minMult(int n, int [ ] d, index [ ][ ] P)int minMult(int n, int [ ] d, index [ ][ ] P)

{ index i,j, k dia;{ index i,j, k dia; int [ ][ ] M = new int[1..n][1..n] int [ ][ ] M = new int[1..n][1..n] for (i = 1; i <= n; i++) M[i][i] = 0; // initialize for (i = 1; i <= n; i++) M[i][i] = 0; // initialize for (dia=1; dia<=n-1; dia++) for (dia=1; dia<=n-1; dia++) for (i=1; i <=n-dia; i++) for (i=1; i <=n-dia; i++) { j = I + dia; { j = I + dia; M[i][j]= min M[i][k] + M[k+1][j] +d[i- M[i][j]= min M[i][k] + M[k+1][j] +d[i-1]*d[k] * d[j]; 1]*d[k] * d[j]; i≤ k ≤ j-1 i≤ k ≤ j-1 } } return M[i][j]; return M[i][j];} }

Now Now

How to Parallelize This Program? How to Parallelize This Program?

1 2 3 4 5

1 0 15,750 7,875 9,375 11,875 15,125

2 0 2,625 4,375 7,125 10,500

3 0 750 750 5,375

4 0 1,000 3,500

5 0 5,000

6 0

Ni,j = min{ Ni,k + Nk+1,j + di-1*dk*dj } Ni,j = min{ Ni,k + Nk+1,j + di-1*dk*dj } where i <= k < jwhere i <= k < j

To Calculate Diagonal 1 We need no dataTo Calculate Diagonal 1 We need no data To Calculate Diagonal 2 We need Diagonal 1 To Calculate Diagonal 2 We need Diagonal 1

ElementsElements …….... So we implement each diagonal calculation in So we implement each diagonal calculation in

one step or one Processorone step or one Processor Step (or Processor) 2 Need Data from Step 1Step (or Processor) 2 Need Data from Step 1 Step (or Processor) 3 Need Data from Step 1 Step (or Processor) 3 Need Data from Step 1

and Step 2and Step 2 ……

Pipeline DesignPipeline Design

We know that pipeline approach can provide We know that pipeline approach can provide increased speed under the following 3 types of increased speed under the following 3 types of computation :computation :

1-if more than one instance of the complete problem is 1-if more than one instance of the complete problem is to be executed.to be executed.

2-if a series of data item must be processed, each 2-if a series of data item must be processed, each requiring multiple operationsrequiring multiple operations

3-if information to start the next process can be passed 3-if information to start the next process can be passed forward before the process has completed all its forward before the process has completed all its internal operations.internal operations.

If we see the previous table we can understand If we see the previous table we can understand that step 2 can started after that step 1 calculate 2 that step 2 can started after that step 1 calculate 2 first element . in this order each step can start to first element . in this order each step can start to calculate after that previous step generate 2 first calculate after that previous step generate 2 first element . element .

P1

P2

(n-1) Message……….

P3

(n-1) + (n-2) Message……….

Centralized Work poolCentralized Work pool

In this implementation we can divide problem In this implementation we can divide problem in to n-1 step ( when we have n matrix ) in to n-1 step ( when we have n matrix )

In each step we calculate elements of this step In each step we calculate elements of this step (diagonal ) it means that we calculate 1 (diagonal ) it means that we calculate 1 diagonal in each stepdiagonal in each step

We have one server and n (n>0) clients We have one server and n (n>0) clients All clients and server know step numberAll clients and server know step number Clients request job from server and server Clients request job from server and server

send a job for this client.send a job for this client.

ServerCentralize Work Pool

Client 1 Client2 Client n

1 Jo

b R

eque

st2

Step

Num

ber

Or

Ter

min

atio

n

JobResult

DreamsDreams

Suppose that MPI is Event Driven !Suppose that MPI is Event Driven ! What happened?What happened? We can implement our program very simple We can implement our program very simple

and efficient and efficient The size of Message Passing is very low The size of Message Passing is very low

because we can use of on demand request, it because we can use of on demand request, it means that we can request of any processor if means that we can request of any processor if we need.we need.

If we go back and look at pipeline If we go back and look at pipeline implementation of chained matrix implementation of chained matrix multiplication we can see the number of multiplication we can see the number of message that pass between processes is very message that pass between processes is very high and some of them is not necessaryhigh and some of them is not necessary

Now we suppose that MPI is Event Driven and Now we suppose that MPI is Event Driven and write the pipeline program . write the pipeline program .

In this implementation each process calculate a In this implementation each process calculate a diagonal ( process p calculate p diagonal ) .diagonal ( process p calculate p diagonal ) .

Void Main()Void Main()

{{

//Process Number P that should calculate P //Process Number P that should calculate P

//// Diagonal Diagonal

For(i=1;i<=n-P;i++)For(i=1;i<=n-P;i++)

{ {

j=P+I;j=P+I;

N[i][j]=min( GetNij ( i , k ) + GetNij ( k+1 , j ) +N[i][j]=min( GetNij ( i , k ) + GetNij ( k+1 , j ) +

( di-1 )( dk )( dj ) );( di-1 )( dk )( dj ) );

//i<=k<j//i<=k<j

}}

{ {

Int GetNij(int I, int j)Int GetNij(int I, int j) {{ if ( this process don’t have N[i][j] )if ( this process don’t have N[i][j] ) {{ To=j-i;To=j-i; send(request,i,j,To);send(request,i,j,To); recv(data);recv(data); N[i][j]=data;N[i][j]=data; }} return N[i][j];return N[i][j]; }}

bool MPIEVENTS (MPIEventType Event)bool MPIEVENTS (MPIEventType Event) {{ bool Handled;bool Handled; switch ( Event )switch ( Event ) {{ case recv:case recv: MPI_Recv( message , From ) ;MPI_Recv( message , From ) ; if ( requestdata )if ( requestdata ) MPI_Send( data , From );MPI_Send( data , From ); Handled=true;Handled=true; break;break; default :default : Handled=false;Handled=false; break;break; }} }}

Now We write code for real mpi .Now We write code for real mpi . We want to implement and write code for We want to implement and write code for

centeralized centeralized work pool .work pool .In this implementation we have 4 functions:In this implementation we have 4 functions:1-void Server();1-void Server();2-void Client();2-void Client();3-calculate( i , j , Value[] );3-calculate( i , j , Value[] );4-map( i , j , im , jm );4-map( i , j , im , jm );

#include "stdafx.h"#include "stdafx.h" #include "iostream.h"#include "iostream.h" #include "stdlib.h"#include "stdlib.h" #include "stdio.h"#include "stdio.h" #include <mpi.h>#include <mpi.h>

#define n 6 //4#define n 6 //4 #define request 0#define request 0 #define value 1#define value 1 #define CONTINUE 1#define CONTINUE 1 #define STOP 0#define STOP 0 #define infinit 99999#define infinit 99999

void Server(MPI_Comm comm , int processors);void Server(MPI_Comm comm , int processors); void Client( int my_rank , MPI_Comm comm ); void Client( int my_rank , MPI_Comm comm ); int Calculate(int I , int J ,int Nv[n*2] );int Calculate(int I , int J ,int Nv[n*2] ); int map(int i, int j , int I , int J , int Nv[n*2]);int map(int i, int j , int I , int J , int Nv[n*2]); void fill_Nv (int x , int y ,int *Nv);void fill_Nv (int x , int y ,int *Nv);

int N[n+1][n+1]={0};int N[n+1][n+1]={0}; int d[n+1]={30,35,15,5,10,20,25};//{5,3,4,6,5};//{5,2,3,4,6,7,8};int d[n+1]={30,35,15,5,10,20,25};//{5,3,4,6,5};//{5,2,3,4,6,7,8};

int main(int argc, char* argv[])int main(int argc, char* argv[]) {{

int my_rank;int my_rank; int processors;int processors; MPI_Comm io_comm;MPI_Comm io_comm; MPI_Init(&argc, &argv);MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &processors);MPI_Comm_size(MPI_COMM_WORLD, &processors); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_dup(MPI_COMM_WORLD, &io_comm);MPI_Comm_dup(MPI_COMM_WORLD, &io_comm);

if(my_rank==0)if(my_rank==0) if(processors<2)if(processors<2) {{ cout<<"the number of process should be grater than 1"<<endl;cout<<"the number of process should be grater than 1"<<endl; MPI_Finalize();MPI_Finalize(); exit(0);exit(0); }}

MPI_Bcast(d, n+1, MPI_INT, 0, MPI_COMM_WORLD);MPI_Bcast(d, n+1, MPI_INT, 0, MPI_COMM_WORLD);

if(my_rank==0)if(my_rank==0) Server(MPI_COMM_WORLD,processors);Server(MPI_COMM_WORLD,processors); else else Client( my_rank , MPI_COMM_WORLD ); Client( my_rank , MPI_COMM_WORLD );

MPI_Finalize();MPI_Finalize();

return 0;return 0; }}

void Server(MPI_Comm comm, int processors)void Server(MPI_Comm comm, int processors) {{ int Step=1;int Step=1; int x[3];int x[3]; int To=0;int To=0; int Nv[n*2];int Nv[n*2]; int Count=0;int Count=0; MPI_Status st;MPI_Status st; while ( Step < n )while ( Step < n ) {{ for( int i=0 ; i < n-Step ; i++ )for( int i=0 ; i < n-Step ; i++ ) {{ MPI_Recv(&x,3, MPI_Recv(&x,3,

MPI_INT,MPI_ANY_SOURCE,request,comm,&st);MPI_INT,MPI_ANY_SOURCE,request,comm,&st); To=x[0];To=x[0]; x[0]=Step;x[0]=Step; x[1]=i+1;x[1]=i+1; x[2]=Step+1+i;x[2]=Step+1+i; fill_Nv ( x[1] , x[2] , Nv);fill_Nv ( x[1] , x[2] , Nv); MPI_Send(&x,3,MPI_INT,To,0,comm);MPI_Send(&x,3,MPI_INT,To,0,comm); Count=(Step-1)*2;Count=(Step-1)*2; MPI_Send(&Nv,Count,MPI_INT,To,1,comm); MPI_Send(&Nv,Count,MPI_INT,To,1,comm); MPI_Recv(&x,3, MPI_INT,MPI_ANY_SOURCE,value,comm,&st);MPI_Recv(&x,3, MPI_INT,MPI_ANY_SOURCE,value,comm,&st); N[x[0]][x[1]]=x[2];N[x[0]][x[1]]=x[2]; }} Step++;Step++; }}

for ( int i =1 ; i < processors ; i++ )for ( int i =1 ; i < processors ; i++ ) {{ MPI_Recv(&x,3, MPI_Recv(&x,3,

MPI_INT,MPI_ANY_SOURCE,request,comm,&st);MPI_INT,MPI_ANY_SOURCE,request,comm,&st); To=x[0];To=x[0]; x[0]=STOP;x[0]=STOP; MPI_Send(&x,3,MPI_INT,To,0,comm);MPI_Send(&x,3,MPI_INT,To,0,comm);

}} cout <<" N is :"<<endl<<endl;cout <<" N is :"<<endl<<endl; for(i=1;i<=n;i++)for(i=1;i<=n;i++) {for(int j=1;j<=n;j++){for(int j=1;j<=n;j++) cout<<N[i][j]<<" ";cout<<N[i][j]<<" "; cout<<endl;}cout<<endl;} cout<<endl<<endl<<"minimom multipliction is : " cout<<endl<<endl<<"minimom multipliction is : "

<<N[1][n];<<N[1][n]; }}

void Client( int my_rank , MPI_Comm comm )void Client( int my_rank , MPI_Comm comm ) {{ int x[3];int x[3]; int I,J,Step,Count;int I,J,Step,Count; int Val;int Val; int Nv[n*2]={0};int Nv[n*2]={0}; MPI_Status st;MPI_Status st; while( true )while( true ) {{ x[0]=my_rank;x[0]=my_rank; MPI_Send(&x,3,MPI_INT,0,request,comm);MPI_Send(&x,3,MPI_INT,0,request,comm); MPI_Recv(&x,3, MPI_INT,0,0,comm,&st);MPI_Recv(&x,3, MPI_INT,0,0,comm,&st); if( x[0] == STOP ) break;if( x[0] == STOP ) break; Step=x[0];Step=x[0]; I=x[1];I=x[1]; J=x[2];J=x[2]; Count=(Step-1)*2;Count=(Step-1)*2; MPI_Recv(&Nv,Count, MPI_INT,0,1,comm,&st);MPI_Recv(&Nv,Count, MPI_INT,0,1,comm,&st); Val=Calculate(I,J,Nv);Val=Calculate(I,J,Nv); x[0]=I;x[0]=I; x[1]=J;x[1]=J; x[2]=Val;x[2]=Val; MPI_Send(&x,3,MPI_INT,0,value,comm);MPI_Send(&x,3,MPI_INT,0,value,comm); }}

}}

int Calculate(int I , int J ,int Nv[n*2] )int Calculate(int I , int J ,int Nv[n*2] ) {{ int minval=infinit;int minval=infinit; int val;int val; int k=0;int k=0; for ( k = I ; k < J ; k++ )for ( k = I ; k < J ; k++ ) {{ val=map(I,k,I,J,Nv) + map(k+1,J,I,J,Nv) + d[I-val=map(I,k,I,J,Nv) + map(k+1,J,I,J,Nv) + d[I-

1]*d[k]*d[J];1]*d[k]*d[J]; if ( minval > val ) minval=val;if ( minval > val ) minval=val; }}

return minval;return minval; }}

int map(int i, int j , int I , int J , int Nv[n*2])int map(int i, int j , int I , int J , int Nv[n*2]) {{ if ( i == j ) return 0;if ( i == j ) return 0; if ( j == J ) if ( j == J ) return Nv[( i - I ) + ( j - i - 1 ) ];return Nv[( i - I ) + ( j - i - 1 ) ]; elseelse return Nv[( J - j ) - 1 ];return Nv[( J - j ) - 1 ];

}}

void fill_Nv (int x , int y ,int *Nv)void fill_Nv (int x , int y ,int *Nv) {{ int i=x+1;int i=x+1; int j=y-1;int j=y-1; int k=0;int k=0; while ( j > x )while ( j > x ) {{ Nv[k]=N[x][j];Nv[k]=N[x][j]; k++;k++; j--;j--; }} while ( i < y )while ( i < y ) {{ Nv[k]=N[i][y];Nv[k]=N[i][y]; k++;k++; i++;i++; }} }}

chained matrix multiplication

Documents