a layout for the shuffle-exchange network with Θ(n2⧸log n) area

Volume 12, number 2 INFORMATION PROCESSING LETTERS 13 April 1981

A LAYOUT FOR THE SHUFFLE-EXCHANGE NETWORK WITH O(N2/log N) AREA

David STEINBERG Department of Applied Mathematics, The Weizmann Institute, Rehovot, Israel

Michael RODEH * IBM Research, San Jose Laboratory, San Jose, CA 95193, U.S.A.

Received May 1980; revised version received November 1980

Interconnection networks, permutation network, shuffle-exchange, layout, bisection, VLSI

1. Introduction

During the past few years the shuffle-exchange network has emerged as one of the most important parai- lel processor interconnection schemata. Stone, Lang, Lawrie, Nassimi and Sahni and Schwartz [l-5] have shown that the network can perform a rich variety of data permutations useful for parallel processing on SIMD machines. Parker [6] has proven that the network can perform arbitrary permutations in no more than 3n shuffle-exchange steps, where N = 2” is the number of inputs (and outputs) to the network.

A problem of considerable practical and theoretical interest is that of finding a layout of the shuffle- exchange network with minimal area. The model for the layout which we shall adopt is due to Thompson [7,8]; it is based on the limitations imposed by VLSI technology. The wires of the network are constrained to pinths on the planar grid and the nodes are restricted to the points of intersection of the grid lines. Wires are allowed to cross one another but they may not share line segments. The area of the layout is defined as the area of the smallest rectangle containing all the nodes and wires, assuming that the distance between neighboring parallel lines is one.

The problem of finding a layout of the shuffle- exchange network which has minimal area has

* On sabbatical from IBM Israel Scientific Center, Technion City, Haifa, Israel.

attracted considerable attention recently: (1) Thompson [7] was the first to propose a non-

trivial layout - his construction requires O(N2 /h) area.

(2) Steinberg and Rodch [9] derived a @<N2/n) layout by considering elementary locality-preserving properties of the shuffle-exchange network.

(3) Hoey and Leiserson [lo] discovered a very elegant technique for laying out the network in @(N2/n) area; however their method was restricted to n of the form n = 2k.

(4) Steinberg and Rodeh [ 111 as well as Hoey and Leiserson [ 123 independently found a new layout with area @(N2/n). The result is based on [ 101 and is

valid for all n. (5) Steinberg and Rodeh [ 1 l] refined the layout

referred to in 4 and proved that its area could be im- proved to @(N2/n3n). This result is the best known to date. It has recently come to the authors’ attention that Leighton, Lepely and Miller have obtained an identical construction [ 13).

In this paper we presenlt the @(N2/n) construction described in [9]. Although. the result is not the best known (it was when the paper was first sent for publi-

cation), it is worthy of attention because it is based on a technique very different from that described in [ 111 nnd a judicious combination of the two methods might lead to an improvement over the 0(N2/n3n ) layout. Such an improvement may be possible since, as we shall see in the next section, the best known

0020-0190/8 l/0000-0000/$02.50 0 North-Holland Publishing Company 83


lower bound on the area of the shuffle-exch.snge network is 0(N2/n2).

2. The shuffle-exchange network and a lower bund on its area

The shuffle-exchange network consists a:f N = 2”

nodes, numbered by dMinct elements of A = 10, 1, l **, N - 1). For arE A, let ~“__r~n__r *=* CYO be the n-bit binary representation of (Y, then a node

h-1 0.. a0 is connected by an edge to c@,, _ 1 an_2

l -- are) = a,,_2 - ctoa, _ l (shuffle edge) and to e(Qn- i -‘- a!, a(J) = ck”-1 l =. CU~~%~ (exchange edge). The shuffle connections are uni-directional while the exchange connections are bi-directional. (For oui pur- poses, we shall neglect the difference between connecting a pair of nodes with a single bi-directional line or with two uni-directional lines of opposite direction since this can only affect the area by a constant factor.) A shuffle-exchange network for N = 8 is illustrated in Fig. 1.

Lower bounds on the area of the la.yout of a given abstractly defined network may be derived by exploihing certain area-time tradeof&. Thompson used this method to obtain a 0(N2/n2) lower bound on the area of the shuffle-exchange network by considering the capability of the network to compute the FFT [7), A somewhat simpler proof follows from the network’s ability to perform arbitrary permutations in 0(n) time. Suppose each node of the network contains an element of data and suppose that the data separated by the minimal bisection are to be exchanged. This requires the transfer of N pieces of data across the edges of the bisection in time 0(n). Denoting the minimal bisection width by w we get w = Q(N/n). Since the area of the graph is always bounded by w2/4 (see [8]) it follows that the ares is st(N2/n2).

0 I 2 3 4 5 6 7

Fig 1. The shuffle-exchange network for N = 8 (solid lines represeat shuffle comcctions and dotted fines represent exchange connections).

It should be mentioned that the CCC network proposed by Preparata and Vuillemin [ 141 performs arbitrary permutations in 0(n) time and can be laid out in 0(N2/n2) area, so in terms of the area it is optimal. However, as Hoey and Leiserson pointed out in [lo], the CCC is less attractive mathematically than the shuffle-exchange network.

3. A layout for the shuffle-exchange network with @(N*/n) area

The layout with 0(N2/n) area is based on a parti- tioning of A into subsets such that all shuffle and exchange edges interconnecting the subsets are of a local nature. In order to describe the locality properties, we introduce the following definition:

Defmition. Let a,, 1 CX,,_~ l -* a0 be the binary representation of the number cy, 0 G ar < 2’“. The altema- tion number of 01, Al@, is defined as the number of times adjacent bits of cy have different values (the least and most significant bits of cy are considered here to be adjacent).

Thus, Alt(0 11010) = 4 and Alt(000111) = 2. Note that the alternation number is always even.

Now, let

Ai = {o 1 ct is a binary number with n bits and exactly i ones),

4 = {a E Ai I Alt(a) = 2j).

Alternation number/2

0 I 2 3 4

2

3

?&nbef of I’s 4

5

6

7

Fig. 2. The connections between the 4 for n =: 8.

84


The proposed layout has matrix form: the ith row contains the elements of Ai arranged according to increasing alternation number so that the jth column contains 4 (see Fig. 2). Note that Ai is empty when- ever j > min(i, n -i)orO=j<i<n.

The locality of the connections is due to the fact that the shuffle edges can only connect two nodes belonging to the same A$ while the exchange edges GUI OIIIY connect a”_1 me* or0 E Ai with ~r,_r *a* or 1 E Ai+r and CU,,~ 0.0 CY~ 1 E Af with ~l,_r *** or0 E A i-1 such that Alt(o,_r *** CY~CU~) and Alt(~~,_r .s* c@&) differ by only -2,O, 2. Thus suggests the following layout:

(i) To Ai allocate a rectangle of width 2(1 + maxi I Ai I) and height 1 + 3 maxj I4 I. The lowest horizontal line is used for shuffle edges (see (ii) below) while the other horizontal lines are partitioned into three groups, each containing one third of the lines. (This explains the factor 3 used to compute the height, see (iii).) Notice that all Ai with the same value of i are allocated rectangles with the same height while if j is held fixed then the width is the same.

(ii) For nodes in A\i (A$i+r), allocate even (odd) locations along the lower face of the corresponding rectangle such that an element Q and its powers under the shuffle operation are allocated consecutive posi- tions, as illustrated below:

. . . r

(iii) To connect Afk to At!<‘, Af!‘r and A:!:’ use the lowest maxj IAi 1 horizontal lines (which form one third of the height of the rectangle allocated to A:“). This is always possible: For the mth node of Afk, go up m units, then turn left or right until reaching the vertical line on which the target is located, and then go straight up. To connect Afk+’ to A& A:!$’ and A:!:” use the middle third, and to connect Afk+2 to Afk;‘, Afk;2 and Afk;” _ use the upper third. Notice that neither horizontal nor vertical segments are used more than once.

The above layout shows that the (exchange) edges connecting elements of Ai to elements of Ai- may id1 be realized in @(maxi 14 I) horizontal lines. Hence the height of the layout is no more than o(Zi maxi 14 I). Since the width (i.e. the number of vertical hnes required) is of the order of maxi Zj I Ai I, the area of the layout is:

0 (( 1)

. (9

We shall show that both factors are of the order of N/dn and thereby prove the following theorem:

Theorem. The shuffle-exchange netwolk with N = 2” nodes can be laid out in an area of @(N*/n).

Let us consider Zj IA{ I first. The sets A:, Af, . . . . Ai form a decomposition of Ai. Thus

CIAjl= IAiI. j

I Ai I counts the number of n bit numbers comaining i ones, thus:

IAil= p . 0

The function

n 0 i

receives its maximum when i = n/2. Approximating

by Stirling’s formula shows that

( ) n;2 = 0(2”/&).

Thus we have:

Lemma 1. maxiCj I Ai I = @(N/6).

The second lemma which we need is:

Lemma 2. ;I;limaxj I4 I = @(N/h).

To prcve this lemma an exact expression for 14 1

is first. obtained and then maxj 14 1 is derived.

Lemma 3.

Proof. Let Al;(m) be the set of m-bit numbers with i ones and 2j alternations whose both least and most significant bits are one. Partition the elements of Ai into classes according to the total number of leading and trailing zeros in their binary representation (e.g.

85


01 BOlOO has 3 such zeros), then:

(2)

The factor k t i in the sum is due. to the fact that k > 1 zeros can be distributed at the ends of thm number in k + 1 ways. The number I All I is evahtiated by counting the number of ways to distribute j holes (to be filled with a total of m - i zeros) in a string of i ones. There are

i--l ( ) j

ways of choosing the holes and

( m-1 ‘-1

‘* j-l )

ways to distribute the m - i zeros in any given set of holes. Zherefore

Substituting (3) in (2) yields:

=(i; l)(nfil’)

z [k++-f~:_- ‘) - (4)

Replacing n - i - 2 by m, the Zexpression in (4) may be written as:

+ . . . .

We now sum the columns and use the identity

*zq (3=K)

to obtain that this sum equals

(YJ:) + [(;_I;) '(j"J

Substituting this result back in (4) yields the lemma:

IAjl =(i; ‘)(“Sf; 1)

Note that the expression

(;I ;) f?;l)

is symmetric in i and therefore !4 I= I Ai_iI . This fact is used in the sequel.

Lemma 4. FM 1 < j G i G n/2, maxi i 4 I is obtained atj= [i(n- i)/(nt 111.

Proof. Let f(i,j)= iAil/lAi+‘I for 1 Gj<iGn/2 (restricting i and j in this way guarantees I ti+l I # 0). By Lemma 3.

f(i,j)=*ij(j+ l)]/[(i-j)(n-i-j)].

Observe that if we fw i and view f as a cpntinuous function in j, then f is strictily increasing in j and the equation f(i, X) = 1 has a unique solution, namely: X = i(n - i)/(n + 1). This implies that the sequence Af, AZ 1 , . . . . Al has a unique maximum. Furthermore, it is easily seen that the maximum is attained for j E [X,X+l],i.e.forj= [i(n-i)/nt l)j.


We proceed with the proof of Lemma 2.

Proof of Lenlrna 2, Let bi = maxj I Aa/ I, then by Lemma

bi=$$(n~~_~l)~

where

di = [i(n - i)/(n + l)].

Since 1 Ai i is symmetric in i around i = n/2, it suffices to bound Z i< Ln121 bi. In order to simplify the computation we assume that n has the form n = 4s2 for some integer s. We now decompose the interval I = [ 1, n/2] into subintervals Ik (k = 0, 1, . . . . fi/2 - 1) such that

Ik = [n/2 - (k + l)G f 1, n/2 - k&j.

A simple check shows that 14 I is an increasing function of i, for i \( n/2; thus bi must also be increasing in i, for i < n/2. Therefore,

Substituting i = n/2 - kfiinto the formula for bi we obtain:

bn/2 -kG = n/2 ” k& n/2 - k+

[(n2/4 - k2 n)/(n + l)l

( n/2 + kfi - 1

’ [(n2/4 - k2n)/(n + 1>1- 1 ’ 1 @’

The third factor in (6) obeys:

n/2+k&- 1

r(n2/4 - k2n)/(n + 1>1 - 1 )< ( n$4t;F). (7)

But the right-hand side of (7) can be compared to

yielding:

= (n/4 + k&2)(n/4 + k&/2 - 1)

X l ** X (n/4 - k2 t 1)

X [(n/4+k2 +kfi)(n/4tk2 tk&- 1)

X l .* X (n/4 + k&2 t l)] -l

<

(

n/4 t k&/2 kfi2+k2 ’ n/4tk2tkt/‘;;: 1

< {[l - l/(1 tfi/2k]fi2k+‘)kZ

< e-k2.

Therefore, the third factor in (6) is bounded by

e-k2

(

n/2 + k*

n;4 + k&2 a )

The second factor in (6) is clearly bounded by

(

n/2 - kfi

) n/4 - k&/2 ’

while the first factor obeys:

n 26 fit2k 4n n/2-k&G-2kfitZk%-4k2’

since k < G/2. Thus,

4n bn/2-kJ;; <-------

(

n/2-kfi \

n - 4k2 n/4 - k&/2 /

X emk 2

(

n/2 t kfi

n/4 + k&2 1 .

Using Stirling’s approximation twice shows that:

4n bn/2-k&<n

~“12 -kfi

&~)W - hh)

2n/Z+k&

’ e-k2 &@@j + k&)

< (8/n) fi 2” eek2 (n - 4k2)-3n.

Substituting in (5) yields:

C bi < (8/7r) n 2” o< kqJ,z eBk2(n - 4k2)-3fi. (8) El

Let ak = eDk *(n -4k’)-:“. For 1 < k < G/2 - 2 we obtain:

ak+l e-(k+1)2

- = e-k2 ak (

n-4k2 3/2

n -4(kt 1)2 1

(

8kt4 3/2 =e -2k-1 I +

n-4&t 1)2 )

<e-2k-* (

1 t 4&- 12

46-4

<&e-2k-1.

87


Therefore, the sequence al, a2, . . . . a+r12_2 converges faster than geometrically, independent of n. Thus %~4~2-2 ak = o(al) = o(n -jn). A direct inspec- tion of a0 shows that it is of the order of n-3r2, while vN2_1 decreases to zero much faster than nW3’% Substituting in (8) yields:

C bi = 0(2*/+/G). El

Since la0 = 1, Lemma 2 is proven.

It is interesting to note that the only property of the shuffle permutation exploited in the above layout was that it maps the sets Ai onto themseives. The layout is therefore applicable to any network where the shuffle is replaced by an arbitrary index-bit permutation (i.e. permutations defined by on-Ion-2 l . . CYO +

%(n- l)%(n-2) l ** anto), where n is a permutation of (0, 1, . . . . n - 1)).

References

[ I] H.S. Storre, Parallel processing with the perfect shuffle, IEEE Trans. Comput. C-20 (1971) 153-161.

[ 2 J T. Lang, Interconnections between processing and mem- ory mcdules using the shuffle-exchange network, IEEE Trans. Comput. C-25 (1976) 55-66.

[ 31 D. Lawrie, Access and alignment of data in an array processor, IEEE Trans. Comput. C-24 (1975) 1145-I 155.

14 ] D. Nassimi and S. Sahni, A self-routing Benes network and parallel permutation algorithms, University of Minnesota, TR-79-13 (1979).

[ 51 J.T. Schwartz, Ultracomputers, ACM Trans. Program- ming Languages and Systems 2 (4) (1980) 484-521.

[6] D.S. Parker, Notes on shuffle/exchange-type networks, IEEE Trans. Comput. C-29 (1980) 213-222.

[ 71 CD. Thompson, A complexity theory for VLSI, Ph.D. Thesis, Carnegie-Mellon University, Comput. Sci. Dept.

(1980). [ 81 C.D. Thompson, Area-time complexity for VLSI, Proc.

11 th Annual ACM Symposium on Theory of Computing (1979) 81-88.

[9] D. Steinberg and M. Rodeh, An O(N’/iog N) layout of the shuffle-exchange network, Dept. of Applied Math., Weizmann Inst., Rehovot, Israel, TR-80-2 (1980).

[ 101 D. Hoey and C.E. Leiserson, A layout for the shuffle- exchange network, Proc. the 1980 Internat. Conference on Parallel Processing (1980) 329-336.

[ 111 D. Steinberg and M. Rodeh, A layout for the shuffle- exchange network with @(N2/log3” N) area, Dept. of Applied Math., Weizmann Inst., Rehovot, Israel, TR- SO-3 (1980).

[ 121 D. Hoty and C.E. Leiserson, Addendum to CMUCS-80- 139, Carnegie-Mellon University (1980).

[ 131 T. Leighton, M.A. Lepley and G. Miller, personal com- munication ( 1980).

[ 141 f:.P. Preparata and J. Vuillemin, The cube connected cycles: a versatile network for parallel computation, Proc. 20fh Annual Symposium on Poundations of Computer Science, IEEE Computer Society (1979).

88

a layout for the shuffle-exchange network with Θ(n2⧸log n) area

Documents