matrix and graph matrix binary matrix sparse matrix operations for vectors/matrices graph and...

Matrix and GraphMatrix and Graph

• Matrix• Binary Matrix• Sparse Matrix• Operations for Vectors/Matrices• Graph and Adjacent Matrix• Adjacent List

Matrix and GraphMatrix and Graph

• Matrix is a 2-dimensional structure

• Used in wide areas from physical simulations to customer management

• Graphs are also used in many areas, to represent the relations and flows between data

• Some data structures have been considered to handle matrix and graph; update, preserve, search, and operate

2-Dimensional Structure of Matrix2-Dimensional Structure of Matrix

• An n×m matrix has n×m numbers

　 can be stored in an array of size n×m

　 [i,j] element corresponds to the i*m+j th cell of the array

　　• A naïve design is done, but there are something more

2-Diemnsaional Array2-Diemnsaional Array

• There is a way to make 2-dimensional array, instead of usual 1-dimensional array

• Prepare an array of pointers of size n

• Prepare n arrays of size m, and write the place of the first cell of i-th array to the i-th cell of the pointer array

• [i,j] element of matrix a is accessed by a[i][j] (in C)

　　

O(nm) memory spaceO(nm) memory space

Simple structureSimple structure

Allocate a 2-Dimensional ArrayAllocate a 2-Dimensional Array

int *MATRIX_alloc ( int n, int m ){ int i, **a, flag =0; a = malloc ( sizeof(int *)*n ); if ( a=NULL ) return (NULL); for ( i=0 ; i<n ; i++ ){ a[i] = malloc (sizeof(int)*m); if ( a[i] = NULL ) flag = 1; } if ( flag == 1 ) return (NULL); else return (a); }

int *MATRIX_free ( int **a ){ int i; for ( i=0 ; i<n ; i++ ) free ( a[i] );}

Binary MatrixBinary Matrix

• A binary matrix is a matrix all whose cells are either 0 or 1　 + each cell is either ○ or ×　 + adjacency matrix of a graph, shown later

• Space consuming if use one integer for one 01 value (1 bit)　 motivated to compress the matrix

０１０１０１０００１１１１１０１１０００

Representation by BitsRepresentation by Bits

• A row composed of 01 values can be considered as a big integer

by chopping into some integers of 32 bits (or 64 bits), the integer becomes tractable

└m/32┘ integers are sufficient to store a row

　　 (space efficiency also increases, and also cache efficiency)

• [i,j] element can be accessed by looking at the (j%32)th bit of the j/32 th integer in the i-th row

Handling Bit AccessHandling Bit Access

• [i,j] element can be accessed by looking at the (j%32)th bit of the j/32 th integer in the i-th row

… writing a code is bothering

• Prepare an array

BIT_MASK[]= {1,2,4,8,16,…}

BIT_MASK_[]= {0xfffffffe, 0xfffffffd, 0xfffffffb, …}

+ read value: a[i][j/32] & BITMASK[j%32]

+ set to 1 : a[i][j/32] = a[i][j/32] | BITMASK[j%32]

+ set to 0 : a[i][j/32] = a[i][j/32] & BITMASK_[j%32]

Sparse MatrixSparse Matrix

• That’s all, for structures for simple matrices

• Space efficiency is in some sense optimal

• But, in application, it is often not sufficient/efficient

　 for example, if matrix is sparse, many parts are redundant

• Sparse matrix has the same value in many cells (usually 0)

• Sparse matrix should be stored by memorizing the places with non-zero values

Storing Sparse MatrixStoring Sparse Matrix

• Let’s begin from binary matrix, for simplicity　 almost cells are 0, and few 1’s

• A simple idea is to make a list of the places of the cells being 1

• That is, memorize (x1,y1),(x2,y2),(x3,y3),… , store the row ID and column ID of the cells being 1

• The memory requirement is “twice the number of 1’s” this is very efficient if there are few 1’s (sparse)

• But, bad accessibility; to read a cell, we have to scan all (binary tree / hash can be used)

Store Row-wiseStore Row-wise

• Let’s have a structure to improve the accessibility

• Classify the places of 1’s according to their row ID　 prepare n arrays, and store the column ID of 1’s in i-th row,

in the ith array

• We need to have n pointers to n arrays, but we don’t have to store the row ID’s, thus memory efficiency increases

• The memory requirement is “# 1’s ＋ #rows×2” (can be “# 1’s ＋ #rows”)• Accessibility is good; sorting ID’s in a row array, binary search

works (linear scan is enough, if few column ID’s)

Structure in Each RowStructure in Each Row

• In sparse cases, the efficiency is increased, However, the update concerned with insertion/deletions is not

efficient

• They are the same, in the situation of stacks and queues

• So, according to the purpose, we use lists bucket/hash/binary tree for structures in a row

　 (having n arrays is equivalent to having buckets)

Real World DataReal World Data

• The characteristics of sparse matrices in practice are;

+ Matrix representing mesh network (structural calculation)

few meshes are adjacent to one, in geometrical sense, thus not so many non-zeros per row

array is sufficient for structures of rows

+ Road network data (adjacency of cross points + distance)

　 almost the same, but update comes sometimes

(would be sufficient if (re-)allocate bit larger memory)

Real World Data (2)Real World Data (2)

Ex) A matrix representing, row text, column word, a cell is one if the word is included in the text, is sparse, usually

　 (POS data, Web links, Web surfing, etc.)

+ on average, #1’s in a row/column is constant, but some have so many 　 (texts having many words, words included in many texts)

+ distribution of 1’s is that so called power (zip) law, scale free; #of items of size D is proportional to 1 / ΔD

can be often seen in real world data (≠ geometric distribution)

+ Such data needs algorithms designed so that the dense part will not affect badly; will be the bottle neck of the computation

Non-binary Sparse MatrixNon-binary Sparse Matrix

• Usual matrix are of course non-binary, it is not sufficient to remember the places having non-zero value

remember (place, value)

• In the case of using array, (place, value), (place, value), (place, value),…, or place1, plcae2,…, value1, value2,…

• In the case of lists of binary tree, assign (place, value) to each cell/node

or, simple prepare two of them

ExerciseExercise

• Make data representing the following matrix in a sparse way

0,0,1,4,0

0,1,0,0,5

2,0,0,0,0

1,2,5,0,2

0,0,0,0,0

Column: Memory Saving for MatrixColumn: Memory Saving for Matrix

• Buckets, or a row of a sparse matrix needs two data

(pointer to the first cell, and the size ki)• We decrease these from two to one

• First, prepare an array of size equal to # non-zero cells. Then,

+ 0th row uses the cells of the array ranging from 0 to k0-1

+ 1st uses from k0 to k0+k1-1 …

+ i-th row uses from k0+…+ki-1 to k0+…+ki-1, and we remember only the start positions of the rows

• The size of i-th row can be obtained by (start position of i+1) - (start position of i)

• Buckets, or a row of a sparse matrix needs two data

(pointer to the first cell, and the size ki)• We decrease these from two to one

• First, prepare an array of size equal to # non-zero cells. Then,

+ 0th row uses the cells of the array ranging from 0 to k0-1

+ 1st uses from k0 to k0+k1-1 …

+ i-th row uses from k0+…+ki-1 to k0+…+ki-1, and we remember only the start positions of the rows

• The size of i-th row can be obtained by (start position of i+1) - (start position of i)

Matrix OperationMatrix Operation

• Basic matrix operations are addition and multiplication(inner product of vectors is a special case)Further, AND and OR for binary matrix

• Algorithms for the operations are trivial if the matrices are in the form of 2-dimensional array

However, not clear if they are in sparse forms

• Further, there are several structures that have advances for matrix operations

Addition of MatrixAddition of Matrix

• For the addition, it is sufficient to have algorithms for additions of each row

(so, operations of vectors are sufficient)

• First, we see the case of inner product of sparse vectors

Inner ProductInner Product

• For computing inner product of two sparse vectors, the difficulty is that we have to find the cell corresponding to each

• Sort the cells in each vector according to their column ID

• Scan two vectors simultaneously, from smaller indices

　“ simultaneously” means that iteratively pick up the smallest column ID among the two vectors

• When we find a column ID at which both vector have non-zero values, accumulate the product of the cells

11 55 55 11 77 33

11 11 33 33 55 44

A Code for Sparse Inner ProductA Code for Sparse Inner Product

int SVECTOR_innerpro (int *va, int ta, int *vb, int tb){ int ia=0, ib=0, c=0; while ( ia<ta && ib<tb){ if (va[ia*2] < vb[ib*2] ) ia++; else if (va[ia*2] > vb[ib*2] ) ib++; else { c = c + va[ia*2+1]*vb[ib*2+1]; ia++; ib++; } } return ( c );}

11 55 55 11 77 33

22 11 33 33 55 44

Addition of Two VectorsAddition of Two Vectors

• The addition can be done in a similar way

• Sort the cells in each vector according to their column ID

• Scan two vectors simultaneously, from smaller indices

• The positions of non-zero values in the resulted vectors are those having non-zero values in one of two vectors, thus can be easily identified by the scan

11 55 55 11 77 33

22 11 33 33 55 44

A Code for AdditionA Code for Additionint SVECTOR_add (int *vc, int *va, int ta, int *vb, int tb){ int ia=0, ib=0, ic=0, c, cc; while ( ia<ta || ib<tb){ if (ia == ta ){ c = vb[ib*2+1]; cc = vb[ib*2]; ib++; } else if ( ib == tb ){c = va[ia*2+1]; cc = va[ia*2]; ia++; } else if (va[ia*2] > vb[ib*2] ) { c = vb[ib*2+1]; cc = vb[ib*2]; ib++; } else if (va[ia*2] < vb[ib*2] ) { c = va[ia*2+1]; cc = va[ia*2]; ia++; } else { c = va[ia*2+1] + vb[ib*2+1]; cc = vb[ib*2]; ia++; ib++; } vc[ic*2] = cc; vc[ic*2+1] = c; ic++; } return ( ic );}

11 55 55 11 77 33

22 11 33 33 55 44

Column: Endmarks do a Good Job!Column: Endmarks do a Good Job!

• Compared to inner product, code for addition is relatively long we have exceptions at the end of the array

• So, we are motivated to simplify the code by using “endmark”

(endmark is a symbol that represent the end of the array, or something else representing the end)

• 0, -1 or a very large value is used as an endmark

• We prepare an additional cell next to the end of each array, and put an endmark at the cell

• Compared to inner product, code for addition is relatively long we have exceptions at the end of the array

• So, we are motivated to simplify the code by using “endmark”

(endmark is a symbol that represent the end of the array, or something else representing the end)

• 0, -1 or a very large value is used as an endmark

• We prepare an additional cell next to the end of each array, and put an endmark at the cell

Column: Endmarks do a Good Job! (2)Column: Endmarks do a Good Job! (2)

int SVECTOR_innerpro (int *vc, int *va, int ta, int *vb, int tb){ int ia=0, ib=0, ic=0, c, cc; while ( va[ia*2] != ENDMARK && vb[ib*2] != ENDMARK){ if (va[ia*2] > vb[ib*2] ) { c = vb[ib*2+1]; cc = vb[ib*2]; ib++; } else if (va[ia*2] < vb[ib*2] ) { c = va[ia*2+1]; cc = va[ia*2]; ia++; } else { c = va[ia*2+1] + vb[ib*2+1]; cc = vb[ib*2]; ia++; ib++; } vc[ic*2] = cc; vc[ic*2+1] = c; ic++; } vc[ic*2] = ENDMARK; return ( ic );}

int SVECTOR_innerpro (int *vc, int *va, int ta, int *vb, int tb){ int ia=0, ib=0, ic=0, c, cc; while ( va[ia*2] != ENDMARK && vb[ib*2] != ENDMARK){ if (va[ia*2] > vb[ib*2] ) { c = vb[ib*2+1]; cc = vb[ib*2]; ib++; } else if (va[ia*2] < vb[ib*2] ) { c = va[ia*2+1]; cc = va[ia*2]; ia++; } else { c = va[ia*2+1] + vb[ib*2+1]; cc = vb[ib*2]; ia++; ib++; } vc[ic*2] = cc; vc[ic*2+1] = c; ic++; } vc[ic*2] = ENDMARK; return ( ic );} 11 55 55 11 77 33

22 11 33 33 55 44 ■■

■■

Matrix MultiplicationMatrix Multiplication

• For sparse matrix multiplication, compute the inner products of all the pairs of a row and a column

• However, a sparse matrix has row representations but not column representations, getting column vectors is hard

• A simple solution is to use transposing algorithm that is explained in the section of bucket; we will have column representation

• On the other hand, some data structures are designed to be enabled to trace also columns

Four-Direction ListFour-Direction List

• Lists are good at storing sparse vectors, for tracing

• However, collection of lists isn’t good at tracing column vectors, because the cells are not connected vertically

• …so, let’s have a list connected in both row direction and column direction

• Each cell has four arms, that point the neighboring cells in directions of (←, →, ↑, ↓)

77

44

22

Pointing the NeighborsPointing the Neighbors

• Links to four directions seems to form a mesh network, but not

• …since, the links can cross

• In the other words, this structure can be seen as a superimpose of two kinds of lists; horizontal direction and vertical direction, and the identical cells are unified into one

77

44

2244

44

44

Having Lists of 2-DirectionsHaving Lists of 2-Directions

• If we have lists of row vectors and column vectors both, we can have the same accessibility, but insertions/deletions are not same

• For example, when we want to delete a cell in a row vector, we would take long time to find the corresponding cell in column lists

In four-direction lists, they are already unified

7744

44

Graph StructureGraph Structure

• A graph is a structure composed of a set of vertices and a set of edges (an edge is a pair of vertices)

• Formed by sets, so the information such as positions, shapes, and crossing edges do not matter, when it is drawn as a picture

　(a graph with shape/position information is called “graph visualization” or “embedded graph”)

• When edges have directions (from one vertex to another), it is called directed

very popular structurevery popular structure

Examples of Graph DataExamples of Graph Data

• Adjacency relation Hierarchy in an organization Similarity relation

• Web network, human network, SNS friend network,…

Graph TerminologyGraph Terminology

• Edge e is said to be incident to u, v, and vice versa, if e = (u,v)

also u and v are said to be adjacent

• The #edges incident to v is the degree of v

• A graph having edges for any two vertices is a complete graph

• When there are two or more edges connecting two vertices, the edges are called multiple edges

• If there is a partition of vertices so that any edge connects a vertex in a group and one in the other, the graph is called bipartite graph

• n vertices can be seen as numbers 0,…,n-1

• Then, an edge is a pair of numbers

can be stored by writing the pairs in array, lists, etc.

• Further, we need something for the accessibility

for example, we often visit a vertex, and go to the

neighboring vertex, and so we need to scan

all edges incident to the vertex

Storing a GraphStoring a Graph

Using MatrixUsing Matrix

• The set of edges can be represented by a matrix as follows

① j-th row/column corresponds to vertex j, and ij-cell is 1 if there is edge (i, j) (called adjacency matrix)

　 + efficient for dense graph having many edges + multiplicity of edges can be represented by the value of a cell

② j-th row corresponds to vertex j, and each column corresponds to an edge; when edge e is incident to vertex i, ij cell is 1 (called incidence matrix)

+ multiple edges represented easily

• Sparse matrix representation has advantage for incidence matrix and sparse graph

In PracticeIn Practice

• 2-dimensional array is sufficient when the matrix size is small

the cost is small, redundancy is small

• Sparse matrix such as 100 by 100 with 10 non-zero elements in a row, sparse representation will be efficient

(approximately, when density is less than 10%)

+ When we often want to scan non-zero elements, such as tracing all vertices adjacent to a vertex, sparse representation is useful

+ If we want to check whether there is an edge between two specified vertices, 2-dimensional array has advantage

Incidence MatrixIncidence Matrix

• An incidence matrix represents the incidence relation between vertices and edges

• Put indices from 0,…,n-1 to vertices, and 0,…,m-1 to edges

+ store edges incident to a vertex to the corresponding row

= storing vertices incident to an edge in

the corresponding column

0: 1,31: 0,2,4,52: 1,3,4,53: 0,24: 1,2,55: 1,2,4

0011

2233

44

550

12

3 456

87

0: 0,21: 0,1,3,42: 4,6,7,83: 2,74: 3,5,85: 1,5,6

0: 0,11: 1,52: 0,33: 1,44: 1,25: 4,56: 2,57: 2,38: 2,4

+

Advantage of Incidence MatrixAdvantage of Incidence Matrix

• In the case of incidence matrix, each edge has ID

so, easy to handle the attached information to each edge

just allocate an array of size m, and it is sufficient

• In the case of adjacency matrix, edge doesn’t have ID, thus not easy to manage correspondence of edge and its data

• Multiple edges are also easy to handle

0: 1,31: 0,2,4,52: 1,3,4,53: 0,24: 1,2,55: 1,2,4

0011

2233

44

550

12

3 456

87

0: 0,21: 0,1,3,42: 4,6,7,83: 2,74: 3,5,85: 1,5,6

0: 0,11: 1,52: 0,33: 1,44: 1,25: 4,56: 2,57: 2,38: 2,4

+

Allocate Memory for CellsAllocate Memory for Cells

• Incidence matrix can be realized by cells of lists having four links like sparse matrix

(two for vertices of the edges, and two for the edges in the vertex)

disadvantages of arrays are eliminated

• Also can be of two array lists

• or, prepare an array and edge i corresponds to

cells 2i and 2i+1, to represent four links

0: 1,31: 0,2,4,52: 1,3,4,53: 0,24: 1,2,55: 1,2,4

0011

2233

44

550

12

3 456

87

0: 0,21: 0,1,3,42: 4,6,7,83: 2,74: 3,5,85: 1,5,6

0: 0,11: 1,52: 0,33: 1,44: 1,25: 4,56: 2,57: 2,38: 2,4

+

ExerciseExercise

• Make an adjacency matrix of the following graph, and that in

A sparse incidence matrix

00

11

2233

44

5566

Bipartite GraphBipartite Graph

• A bipartite graph is often seen as a representation of a (binary) (sparse) matrix

associate nodes of one group to rows, and the others to columns

connect by edges between vertices corresponding a cell with non-zero value

• A representation of different style0: 4,61: 4,52: 5,63: 5,6

00

11

22

33

44

55

66

Column: Store Huge GraphColumn: Store Huge Graph

• A graph needs two pointer (or integer) per edge

weight, and etc. need more

• 64 bits are required in 32 bit CPU

• However, Web graphs have billion of vertices, and 20 billions of edges

160GB is necessary in this way

• This is too much. Can we reduce the storage size?

• A graph needs two pointer (or integer) per edge

weight, and etc. need more

• 64 bits are required in 32 bit CPU

• However, Web graphs have billion of vertices, and 20 billions of edges

160GB is necessary in this way

• This is too much. Can we reduce the storage size?

Column: Store Huge Graph (2)Column: Store Huge Graph (2)

① Only few edges have large degrees

Vertices are mainly adjacent to these few vertices

Put indices so that large degree vertices have small indices, and represent small indices by small number of bits, and large indices by many bits

Ex.)

• If the bit sequence representing a number begins with “0”, the following 7 bits represent [0-127]

• If “10”, the following 14 bits represent 128+[0-16383]

• If “11”, the following 30 bits represent 16384+128,…

① Only few edges have large degrees

Vertices are mainly adjacent to these few vertices

Put indices so that large degree vertices have small indices, and represent small indices by small number of bits, and large indices by many bits

Ex.)

• If the bit sequence representing a number begins with “0”, the following 7 bits represent [0-127]

• If “10”, the following 14 bits represent 128+[0-16383]

• If “11”, the following 30 bits represent 16384+128,…

Column: Store Huge Graph (3)Column: Store Huge Graph (3)

② Sort the sites in dictionary order of their URLs links are usually to near, thus difference of ID’s becomes small

• They can be recorded in the same way, to reduce the space

• Using these, one edge needs just 10 bits Further, we can reduce it to 5 bits

The storage will be 20GB, thus can fit recent computers

② Sort the sites in dictionary order of their URLs links are usually to near, thus difference of ID’s becomes small

• They can be recorded in the same way, to reduce the space

• Using these, one edge needs just 10 bits Further, we can reduce it to 5 bits

The storage will be 20GB, thus can fit recent computers

SummarySummary

• Data structures for matrix

• Structures for sparse matrix, and four directed lists

• Structures for graphs:

adjacency matrix and incidence matrix

adjacency list

matrix and graph matrix binary matrix sparse matrix operations for vectors/matrices graph and...

Documents