optimizing matrix operations in by word packing

3
Applied Mathematics Letters 22 (2009) 242–244 Contents lists available at ScienceDirect Applied Mathematics Letters journal homepage: www.elsevier.com/locate/aml Optimizing matrix operations in Z 2 by word packing Rafael Álvarez * , M. Jesús Castel, Leandro Tortosa, Antonio Zamora Departamento de Ciencia de la Computación e Inteligencia Artificial, Universidad de Alicante, Campus de San Vicente, Ap. Correos 99, E–03080, Alicante, Spain article info Article history: Received 2 July 2007 Received in revised form 29 February 2008 Accepted 24 March 2008 Keywords: Binary matrix Finite field Optimization Binary word Matrix packing abstract We propose a new storage scheme (word packing ) for matrices with elements in Z 2 that enables improved performance. This scheme is based on utilizing the full register length of modern microprocessors to perform multiple Z 2 operations in parallel. We analyze several operations over word packed matrices and compare them with their conventional equivalents. © 2008 Elsevier Ltd. All rights reserved. 1. Introduction Working in Z 2 has many advantages in multiple fields like cryptography, steganography, coding theory, correcting codes, digital telecommunication or computer networking, where digital data can be considered as a sequence of elements in Z 2 . It is much more efficient to manipulate single Z 2 elements (bits) in hardware than it is in software since microprocessors are designed to work with groups of bits and require additional operations to extract and operate single bits from these groups. Nevertheless, if bits are grouped into convenient sizes, software implementations can also be very efficient, exploiting the full parallelism capabilities of modern microprocessors; the case of binary matrices is especially well suited for this software approach. While most matrix packing schemes proposed previously are designed for sparse matrices [1,2], we propose a new general method for grouping the elements of binary matrices so that the usual operations (addition, product or transposition) can be carried out much more efficiently. We call these kinds of matrices word packed matrices; they loosely resemble block matrices but have some peculiarities that must be considered. The motivation for this packing scheme is the optimization of a previously proposed matricial pseudorandom number generator (see [3] for detailed information) that uses the powers of a block upper triangular matrix, M, with elements in Z p M h = A h X (h) O B h . In this pseudorandom generator, each subsequent X (h) block is processed, extracting some bits to produce the output sequence and the first X block acts as the seed of the pseudorandom sequence. In this case, employing p = 2, suitable matrix sizes and storing the A, B and X blocks as word packed matrices we have achieved [4] performance gains of over an order of magnitude. In the same way, employing word packed matrices in other applications can be very beneficial. Partially supported by the Spanish grant GV06/018. * Corresponding author. E-mail addresses: [email protected] (R. Álvarez), [email protected] (M. Jesús Castel), [email protected] (L. Tortosa), [email protected] (A. Zamora). 0893-9659/$ – see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.aml.2008.03.018

Upload: rafael-alvarez

Post on 26-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Applied Mathematics Letters 22 (2009) 242–244

Contents lists available at ScienceDirect

Applied Mathematics Letters

journal homepage: www.elsevier.com/locate/aml

Optimizing matrix operations in Z2 by word packingI

Rafael Álvarez ∗, M. Jesús Castel, Leandro Tortosa, Antonio ZamoraDepartamento de Ciencia de la Computación e Inteligencia Artificial, Universidad de Alicante, Campus de San Vicente, Ap. Correos 99, E–03080, Alicante, Spain

a r t i c l e i n f o

Article history:Received 2 July 2007Received in revised form 29 February 2008Accepted 24 March 2008

Keywords:Binary matrixFinite fieldOptimizationBinary wordMatrix packing

a b s t r a c t

We propose a new storage scheme (word packing) for matrices with elements in Z2 thatenables improved performance. This scheme is based on utilizing the full register lengthof modern microprocessors to perform multiple Z2 operations in parallel. We analyzeseveral operations over word packed matrices and compare them with their conventionalequivalents.

© 2008 Elsevier Ltd. All rights reserved.

1. Introduction

Working inZ2 hasmany advantages inmultiple fields like cryptography, steganography, coding theory, correcting codes,digital telecommunication or computer networking, where digital data can be considered as a sequence of elements in Z2.It is muchmore efficient to manipulate single Z2 elements (bits) in hardware than it is in software sincemicroprocessors

are designed to work with groups of bits and require additional operations to extract and operate single bits from thesegroups. Nevertheless, if bits are grouped into convenient sizes, software implementations can also be very efficient,exploiting the full parallelism capabilities of modern microprocessors; the case of binary matrices is especially well suitedfor this software approach.While most matrix packing schemes proposed previously are designed for sparse matrices [1,2], we propose a new

generalmethod for grouping the elements of binarymatrices so that the usual operations (addition, product or transposition)can be carried out much more efficiently. We call these kinds of matricesword packed matrices; they loosely resemble blockmatrices but have some peculiarities that must be considered.The motivation for this packing scheme is the optimization of a previously proposed matricial pseudorandom number

generator (see [3] for detailed information) that uses the powers of a block upper triangular matrix,M , with elements in Zp

Mh =[Ah X (h)

O Bh

].

In this pseudorandom generator, each subsequent X (h) block is processed, extracting some bits to produce the outputsequence and the first X block acts as the seed of the pseudorandom sequence. In this case, employing p = 2, suitablematrix sizes and storing the A, B and X blocks as word packed matrices we have achieved [4] performance gains of over anorder of magnitude. In the same way, employing word packed matrices in other applications can be very beneficial.

I Partially supported by the Spanish grant GV06/018.∗ Corresponding author.E-mail addresses: [email protected] (R. Álvarez), [email protected] (M. Jesús Castel), [email protected] (L. Tortosa), [email protected]

(A. Zamora).

0893-9659/$ – see front matter© 2008 Elsevier Ltd. All rights reserved.doi:10.1016/j.aml.2008.03.018

R. Álvarez et al. / Applied Mathematics Letters 22 (2009) 242–244 243

Table 1Performance results (cycles) with conventional and word packed matrices

Addition Product TranspositionSize Conv. Packed Conv. Packed Conv. Packed

128× 128 732 32.5 97307 1158 192 6.4256× 256 1077 42.5 380372 6484 764 24.0512× 512 4662 159.7 3237601 42968 3647 114.01024× 1024 17337 629.6 29986311 341342 20180 564.1

2. Description

Packed matrices allow highly efficient addition, multiplication and transposition of matrices with elements in Z2 usingbinary operations between processor registers. Amatrix, with elements inZ2, is aword packedmatrix if one of its dimensions(rows or columns) is stored as word sized groups of bits. A word usually corresponds to the register size. Therefore, if itsrows are stored as words it is a row packed matrix and if its columns are stored as words then it is a column packed matrix,denoted by the subscripts r and c respectively.

Addition. The addition of packed matrices must be done between matrices of the same type, whether packed by rows or bycolumns. Although they could be unpacked and operated normally, the optimal way is to perform an XOR operation wordby word. In the case of two packed matrices consisting of three columns of two words each we have[

ω1,1 ω1,2 ω1,3ω2,1 ω2,2 ω2,3

]c+

[γ1,1 γ1,2 γ1,3γ2,1 γ2,2 γ2,3

]c=

[ω1,1 ⊕ γ1,1 ω1,2 ⊕ γ1,2 ω1,3 ⊕ γ1,3ω2,1 ⊕ γ2,1 ω2,2 ⊕ γ2,2 ω2,3 ⊕ γ2,3

]c.

Product. The product operation between packed matrices is a little more complex than the addition. The product must bedone between matrices of different types and with compatible sizes. The multiplicand has to be a row packed matrix, whilethe matrix corresponding to the multiplier must be packed by columns.The result of each row by column operation should be a single bit and not a whole word. Therefore, it is necessary to

perform the sum of all the bits contained in that word in order to obtain the final bit.The inner sum function, denoted by φ, computes the sum of all the elements contained in a word. Let ω be a binary word

of size n; then, using the XOR operation, φ(ω) can be defined as φ(ω) = ω(0) ⊕ ω(1) ⊕ · · · ⊕ ω(n−1). It can also be definedas the parity of the number of 1 bits of ω.Using the inner sum, the product of two word packed matrices can be expressed as

[ω1,1 ω1,2 ω1,3

]r

[γ1,1γ2,1γ3,1

]c

=[φ((ω1,1 ∧ γ1,1)⊕ (ω1,2 ∧ γ2,1)⊕ (ω1,3 ∧ γ3,1))

].

Transposition. The matrix transposition operation changes the type of matrix packing (rows to columns and vice versa) andcan be carried out very efficiently because it requires only moving words. In the case of transposing a row packed matrixwe have[

ω1,1 ω1,2ω2,1 ω2,2ω3,1 ω3,2

]Tr

=

[ω1,1 ω2,1 ω3,1ω1,2 ω2,2 ω3,2

]c.

3. Results

The performance gain achieved with packed matrices is very significant, as shown in Table 1. Moreover, when thematrices are bigger, such as in the case of 1024× 1024, the difference is even more pronounced. Word packing permits theCPU to perform several parallel operations with a single instruction and reduce memory bandwidth requirements resultingin fewer memory accesses. Even in the case of the product operation the benefits are big enough to compensate for thenecessary extra cycles required to repack the resulting matrix. The logarithmic comparison graph for the product operation(see Fig. 1) shows that the difference increaseswithmatrix size, resulting in almost two orders ofmagnitude for 1024×1024matrices; performance for other operations is similar.The results obtained show that word packed matrices increase performance significantly in those applications requiring

matricial operations in Z2, even if that means working with bigger matrices. It should be remarked that the wordpacking technique can take advantage directly of future CPU architectures with wider registers (like the recent 64 bitmicroprocessors), improving performance even further.

244 R. Álvarez et al. / Applied Mathematics Letters 22 (2009) 242–244

Fig. 1. Performance comparison (logarithmic) for the product operation.

4. Conclusions

We have proposed a new storage scheme for matrices with elements in Z2 based on word packing that significantlyimproves performance for common operations such as addition, product and transposition. Utilizing the full register length,this scheme allows microprocessors to perform multiple parallel operations in Z2 with a single instruction and to reducememory bandwidth requirements, resulting in a considerable performance gain. Furthermore, it can take direct advantageof more powerful architectures with wider registers and data buses.Word packed matrices can be very useful for several fields where digital data can be considered as a stream of Z2

elements like cryptography, coding theory, telecommunication or networking. The improved performance makes softwaresolutions feasible for certain applications instead of requiring expensive dedicated hardware. The performance gain is,in some cases, of several orders of magnitude compared to conventional implementations, making word packing a veryattractive alternative for suitable applications.

References

[1] M.D. Brain, A.L. Tharp, Perfect hashing using sparse matrix packing, Inform. Syst. 15 (3) (1990) 281–290.[2] A. Pinar, M. Heath, Improving performance of sparse matrix–vector multiplication, Proc. ACM/IEEE Conf. Supercomput. art. 30 (1999).[3] R. Alvarez, J.-J. Climent, L. Tortosa, A. Zamora, An efficient binary sequence generator with cryptographic applications, Appl. Math. Comput. 167 (1)(2005) 16–27.

[4] R. Alvarez, Aplicaciones de las matrices por bloques a los criptosistemas de cifrado en flujo, Ph.D. Thesis, Dpt. of Computer Science and ArtificialIntelligence, University of Alicante (2005).