algorithms for discrete fourier transform and convolution (signal processing and digital filtering)

Algorithms for Discrete Fourier Transform and Convolution

Second Edition

Springer New York Berlin Heidelberg Barcelona Budapest Hong KO ng London Milan Par is Santa Clara Singapore Tokyo

Signal Processing and Digital Filtering

Synthetic Aperture Radar J.P. Fitch

Multiplicative Complexity, Convolution and the DFT M.T. Heideman

Array Signal Processing S.U. Pillai

Maximum Likelihood Deconvolution J.M. Mendel

Algorithms for Discrete Fourier Transform and Convolution Second Edition T. Tolimieri, M. An, and C. Lu

Algebraic Methods for Signal Processing and Communications Coding R.E. Blahut

Electromagnetic Devices for Motion Control and Signal Processing Y.M. Pulyer

Mathematics of Multidimensional Fourier Transform Algorithms Second Edition R. Tolimieri, M. An, and C. Lu

Lectures on Discrete Time Filtering R.S. Bucy ,

Distributed Detection and Data Fusion P.K. Varshney

Richard Tolimieri Myoung An Chao Lu

Algorithms for Discrete Fourier Transform

and Convolution Second Edition

c.S. Burrus Consulting Editior

4* (1,

Springer

Richard Tolimieri Myoung An Department of Electrical Engineering A.J. Devaney Associates City College of CUNY 52 Ashford Street New York, NY 10037, USA Allston, MA 02134, USA

Chao Lu Department of Computer and

Information Sciences Towson State University Towson, MD 21204, USA

Consulting Editor Signal Processing and Digital Filtering

C.S. Burrus Professor and Chairman Department of Electrical and

Computer Engineering Rice University Houston, TX 77251-1892, USA

Library of Congress Cataloging-in-Publication Data Tolimieri, Richard, 1940-

Algorithms for discrete Fourier transform and convolution / Richard Tolimieri, Myoung An, Chao Lu.

p. cm. — (Signal processing and digital filtering) Includes bibliographical references (p. — ) and index. ISBN 0-387-98261-2 (alk. paper) 1. Fourier transformations—Data processing. 2. Convolutions

(Mathematics)— Data processing. 3. Digital filters (Mathematics) I. An, Myoung. II. Lu, Chao. III. Title. IV. Series. QA403.5.T65 1997 515 '.723 —dc21 97-16667

Printed on acid-free paper.

0 1997 Springer-Verlag New York, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereaf-ter developed is forbidden. The use of general descriptive names, trade names, trademarks, etc., in this publication, even if the former are not especially identified, is not to be taken as a sign that such names, as understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone.

Production managed by Anthony Battle; manufacturinesupervised by Johanna Tschebull. Photocomposed pages prepared from the authors' LATEX files: Printed and bound by Braun-Brumfield, Inc., Ann Arbor, MI. Printed in the United States of America.

9 8 7 6 5 4 3 2 1

ISBN 0-387-98261-2 Springer-Verlag New York Berlin Heidelberg SPIN 10629424

Preface

This book is based on several courses taught during the years 1985-1989 at the City College of the City University of New York and at Fudan Univer-sity, Shanghai, China, in the summer of 1986. It was originally our intention to present to a mixed audience of electrical engineers, mathematicians and computer scientists at the graduate level a collection of algorithms that would serve to represent the vast array of algorithms designed over the last twenty years for computing the finite Fourier transform (FFT) and finite convolution. However, it was soon apparent that the scope of the course had to be greatly expanded. For researchers interested in the design of new algorithms, a deeper understanding of the basic mathematical concepts underlying algorithm design was essential. At the same time, a large gap remained between the statement of an algorithm and the implementation of the algorithm. The main goal of this text is to describe tools that can serve both of these needs. In fact, it is our belief that certain mathematical ideas provide a natural language and culture for understanding, unifying and implementing a wide range of digital signal processing (DSP) algo-rithms. This belief is reinforced by the complex and time-consuming effort required to write code for recently available parallel and vector machines. A significant part of this text is devoted to establishing rules and procedures that reduce and at times automate this task.

In chapter 1, a survey is given of basic algebra. It is assumed that much of this material is not new; in any case, the facts are easily described. The tensor product is introduced in chapter 2. The importance of the tensor product will be a reoccurring theme throughout this text. Tensor product factors have a direct interpretation as machine instructions on many vector

vi Preface

and parallel computers. Tensor product identities provide linguistic rules for manipulating algorithms to match specific machine characteristics. In-herent in these rules are certain permutations, called stride permutations, which can be implemented on many machines by a single instruction. The tedious effort of readdressing, required in many DSP algorithms, is greatly reduced. Also, the data flow is highlighted, which is especially important on supercomputers where the data flow is usually the major factor that determines the efficiency of the computation.

The design of fast FFT algorithms can be dated back historically to the times of Gauss (1805) [1]. The collected work of some unpublished manuscripts by Gauss contained the essentials of the Cooley-Tukey FFT algorithm, but it did not attract much attention. In 1965, when Cooley and Tukey published their paper [2], known as the fast Fourier transform algorithm, the computing science started a revolutionary era. Since then, many variants of the Cooley-Tukey FFT algorithm have been developed. In chapters 3 and 4 of this book, the Cooley-Tukey FFT algorithm, along with its many variants, are unified under the banner of tensor product. Prom one point of view, these algorithms depend on mapping a one-dimensional array of data onto a multidimensional array of data (depending on the degree of compositeness of the transform size). Using tensor product we need to derive only the simplest case of mapping into a two-dimensional array. Tensor product identities can then be used to derive the general case. An explicit description of the data flow is automatically given along with rules for varying this data flow, if necessary. In chapter 5, the Good-Thomas prime factor algorithm is reformulated by tensor product.

In chapters 6 and 7, various linear and cyclic convolution algorithms are described. The Chinese Remainder Theorem (CRT) for polynomials is the major tool. Matrix and tensor product formulations are used wherever possible. Results of Cook and Toom and Winograd are emphasized. The integer CRT is applied in chapter 7 to build a large convolution algorithm from efficient small convolution algorithms (Agarwal-Cooley).

The scene changes in chapter 8. Various multiplicative FFT algorithms (depending on the ring structure of the indexing set) are described. The prime size algorithms are due to Rader. Winograd generalized Rader's method to composite transform size. We emphasize a variant of the Rader-Winograd method. Tensor product language is used throughout, and tensor product identities serve as powerful tools for obtaining several variants that offer arithmetic and data flow options.

In chapter 13, we consider the duality between periodic and decimated data established by the Fourier transform. This duality is applied the computation of the Fourier transform of odd prince power transform sizes, say pk . The ring structure of the indexing set has an especially simple ideal structure (local ring). The main result decomposes the computation of the Fourier transform into two pieces. The first is a Fourier transform of transform size pk-2 . The description of the second is the main objective

Preface vii

of chapters 14 and 15, where we introduce the theory of multiplicative characters and derive formulas for computing the Fourier transform of multiplicative characters.

The authors are indebted to the patience and lcnowledge of L. Auslan-der, J. Cooley and S. Winograd, who, over the years at IBM, Yorktown Heights, and at the City University of New York, have taken time to ex-plain their works and ideas. The authors wish to thank C. S. Burrus, who read the manuscript of the book and suggested many improvements. We wish to thank DARPA for its support during the formative years of writing this book, and AFOSR for its support during the last two years, in which time the ideas of this book have been tested and refined in applications to electromagnetics, multispectral imaging, and imaging through turbulance. Whatever improvements in this revision are due to the routines written in these applications.

Richard Tolimieri Myoung An

Chao Lu

Contents

Preface

1 Review of Applied Algebra 1 1.1 Introduction 1 1.2 The Ring of Integers 2 1.3 The Ring Z/N 5 1.4 Chinese Remainder Theorem (CRT) 7 1.5 Unit Groups 11 1.6 Polynomial Rings 13 1.7 Field Extension 17 1.8 The Ring F[x]/ f (x) 18 1.9 CRT for Polynomial Rings 21 References 23 Problems 23

2 Tensor Product and Stride Permutation 27 2.1 Introduction 27 2.2 Tensor Product 28 2.3 Stride Permutations 33

-2.4 Multidimensional Tensor Products 40 2.5 Vector Implementation 44 2.6 Parallel Implementation 50 References 53 Problems 53

x Contents

3 Cooley-Tukey FFT Algorithms 55 3.1 Introduction 55 3.2 Basic Properties of FT Matrix 56 3.3 An Example of an FT Algorithm 57 3.4 Cooley-Tukey FFT for N = 2M 59 3.5 Twiddle Factors 61 3.6 FT Factors 63 3.7 Variants of the Cooley-Tukey FFT 64 3.8 Cooley-Tukey FFT for N = ML 66 3.9 Arithmetic Cost 68 References 69 Problems 70

4 Variants of FT Algorithms and Implementations 71 4.1 Introduction 71 4.2 Radbc-2 Cooley-Tukey FFT Algorithm 72 4.3 Pease FFT Algorithm 76 4.4 Auto-Sorting FT Algorithm 79 4.5 Mixed Radix Cooley-Tukey FFT 81 4.6 Mixed Radix Agarwal-Cooley FFT 84 4.7 Mixed Radix Auto-Sorting FFT 85 4.8 Summary 87 References 89 Problems 90

5 Good-Thomas PFA 91 5.1 Introduction 91 5.2 Indexing by the CRT 92 5.3 An Example, N = 15 93 5.4 Good-Thomas PFA for the General Case 96 5.5 Self-Sorting PFA 98 References 99 Problems 100

6 Linear and Cyclic Convolutions 101 6.1 Definitions 101 6.2 Convolution Theorem 107 6.3 Cook-Toom Algorithm 111 6.4 Winograd Small Convolution Algorithm ' 119 6.5 Linear and Cyclic Convolutions 125 6.6 Digital Filters 131 References 133 Problems 134

Contents xi

7 Agarwal-Cooley Convolution Algorithm 137 7.1 Two-Dimensional Cyclic Convolution 137 7.2 Agarwal-Cooley Algorithm 142 References 145 Problems 145

8 Multiplicative Fourier Transform Algorithm 147 References 153

9 MFTA: The Prime Case 155 9.1 The Field Z/p 155 9.2 The Fundamental Factorization 157 9.3 Rader's Algorithm 162 9.4 Reducing Additions 163 9.5 Winograd Small FT Algorithm 167 9.6 Summary 169 References 171 Problems 171

10 MFTA: Product of Two Distinct Primes 173 10.1 Basic Algebra 173 10.2 Transform Size: 15 175 10.3 Fundamental Factorization: 15 176 10.4 Variants: 15 178 10.5 Transform Size: pq 181 10.6 Fundamental Factorization: pq 183 10.7 Variants 185 10.8 Summary 189 References 190 Problems 191

11 MFTA: Composite Size 193 11.1 Introduction 193 11.2 Main Theorem 193 11.3 Product of Three Distinct Primes 196 11.4 Variants 197 11.5 Transform Size: 12 198 11.6 Transform Size: 4p, p odd prime 198 11.7 Transform Size: 60 199 References 202 -Problems 202

1/ MFTA: 203 12.1 Introduction 203 12.2 An Example: 9 203

xii Contents

12.3 The General Case: p2 206 12.4 An Example: 33 212 References 214 Problems 215

13 Perioclization and Decimation 217 13.1 Introduction 217 13.2 Periodic and Decimated Data 220 13.3 FT of Periodic and Decimated Data 223 13.4 The Ring Z/pm 225 Probletns 227

14 Multiplicative Characters and the FT 229 14.1 Introduction 229 14.2 Periodicity 232

14.2.1 Periodic Multiplicative Characters 232 14.2.2 Periodization and Decimation 235

14.3 F(p) of Multiplicative Characters 237 14.4 F(r) of Multiplicative Characters 239

14.4.1 Primitive Multiplicative Characters 239 14.4.2 Nonprimitive Multiplicative Characters 240

14.5 Orthogonal Basis Diagonalizing F(p) 242 14.6 Orthogonal Basis Diagonalizing F(pm) 245

14.6.1 Orthogonal Basis of W 245 14.6.2 Orthogonal Diagonalizing Basis 246

References 247 Problems 248

15 Rationality 249 15.1 An Example: 7 250 15.2 Prime Case 252 15.3 An Example: 32 254 15.4 Transform Size: p2 256 15.5 Exponential Basis 260 15.6 Polynomial Product Modulo a Polynomial 260 15.7 An Example: 33 262 References 264

Index 265

Review of Applied Algebra

1.1 Introduction

In this chapter we will give a brief account of several important results from applied algebra necessary to develop the algorithms in this text. In particular, we will describe the main properties of the following rings:

• Ring of integers Z.

• Quotient ring Z/N of integers modulo N.

• Ring of polynomials F[x] over the field F.

• Quotient ring F[x]I f (x) of polynomials.

The Chinese Remainder Theorem for the ring of integers and the ring of polynomials will be treated in detail with special emphasis on the use of complete systems of idempotents to define the Chinese Remainder ring-isomorphism. This ring-isomorphism diagonalizes convolutional operations, in a sense to be described in later chapters.

In the next chapter, we will introduce the tensor or Kronecker product of matrices, a subject in applied linear algebra, and develop the algebra of ten-soi products, especially the commutation theorem of tensor products. This algebra, along with the algebra of stride permutations, will provide power-fill tools for modeling a wide range of algorithms and for constructing large classes of algorithmic variants with well-defined parameters quantifying computational and communication characteristics.

2 1. Review of Applied Algebra

1.2 The Ring of Integers

The ring of integers Z satisfies the following important condition:

Divisibility Condition. If a and b are integers with b 0, then we can write

a = bq + r, 0 < r < b, (1.1)

where q and r are uniquely determined integers. The integer q is called the quotient of the division of a by b, and it is the

largest integer satisfying bq < a.

The integer r is called the rem,ainder of the division of a by b and is given by the formula

r = a — bq.

If r = 0 in (1.1), then a = bq,

and we say that b divides a or that a is a multiple of b, and we write

b I a.

An integer p > 1 is called a prime if its only divisors are ±1 and ±p. An integer c is called a cornm,on divisor of integers a and b if c a and

c I b. The integers 1 and —1 are common divisors of any two integers. If 1 and —1 are the only common divisors of a and b, we say that a and b are relatively prime. There are only a finite number of common divisors of two integers a and b as long as a 0 or b O. Denote the largest common divisor of integers a and b by

(a , b).

We call (a, b) the greatest common divi,sor of a and b. If b = p, a prime, then either (a,p) = 1 and a and p are relatively prime or (a,p) = p and

P a- Fix an integer N, and set

(N) = NZ = {Nk : k E Z},

the set of all multiples of N. The set NZ is an ideal of the ring Z in the sense that it is closed under addition,

Nk + N1= N(k +l),

and closed under multiplication by Z,

m(Nk) = (mN)k = N(mk).

We will now prove a fundamental property of the ring Z.

1.2 The Ring of Integers 3

Lemma 1.1 Every ideal of Z has the form NZ, for some integer N > O.

Proof Suppose that M is an ideal in Z. If M = (0) = OZ, we are done. Otherwise, M contains positive integers and a smallest positive integer, say N. Take any c E M and write, using (1.1),

c = Ng + r, 0 < r < N.

Since c and N are in M, r c — Ng

is also in M. However, 0 < r < N, which contradicts the definition of N unless r =- O. Thus c = Ng and M = NZ, proving the lemma.

We see from the proof of lemma 1.1 that any ideal M (0) in Z can be written as NZ, where N is the smallest positive integer in M. We will use lemma 1.1 to give an important description of the greatest common divisor.

Lemma 1.2 For nonzero integers a and b,

(a,b) = axo + byo,

for some integers xe and Yo•

Proof The set M = fax ± by : x, y E Z}

is an ideal of Z. By lemma 1.1,

M = dZ,

where d is the smallest positive integer in M. In particular, since a and b are in M, d is a common divisor of a and b. Now write

d = axo + byo,

and observe that any common divisor of a and b must divide d, proving the lemma.

From the proof of lemma 1.2, we have that every common divisor of a and b divides (a, b), which can also be characterized as the smallest positive integer in the set

M = tax + by : x, y E Z1.

A's a consequence, (a, b) = 1 if and only if axo byo = 1, for some integers ro and yo. We will use this to prove the following result.

Lemma 1.3 If a I bc and (a,b) = 1, then a I c. In particular, if p is prime and p I bc, then p b or p I c.


Proof Since (a, b) =- 1, axo + byo = 1,

for some integers xo and yo. Then

c = car° + cbyo.

Since a I caxo and, by assumption, a I bc, we have that a I c. To prove the second statement, we observe that if p does not divide b, then (p, b) = 1. Applying the first part completes the proof.

We have all of the ingredients needed to prove the fundamental prime factorization theorem.

Theorem 1.1 ff N > 1 is an integer, then N can be written uniquely, up to ordering of the factors, as

N = 'Prar,

where pi, , pr are distinct primes and ai > 0, , a, > 0 are integers.

Proof We first prove the existence of such a factorization. If N is prime, we are done. Otherwise, write N = ATIN2, where 0 < N2 < N. By mathematical induction, assume that Ni and N2 have factorizations of the given form. Then their product N = Ni N2 can be written as a product of primes. Collecting the like primes, N has a factorization of the form. Suppose that

N = qbil • • • (1,17'

is a second factorization of the same form. Then qi I N. If qi 0 pi, 1 < j < r, then (qi,p3) = 1. Lemma 1.3 now implies that qi does not divide pcp . In this case, a second application of lemma 1.3 implies that qi does not divide N, a contradiction. It follows that qi = p3 for some 1 < j < r. Continuing in this way, and reordering the factors if necessary, we have s < r and

qk — Pk, 1 < k < s.

Reversing the roles of the prime factors, we have s = r and

qk = Pk, 1 < k < r.

Suppose that ai < bi. Applying the above argument to the integer .

rn — N _ p2a2 prar _ pbraipb22 prb,

Pal 1

we have bi = al. Continuing in this way, uniqueness follows, completing the proof of the theorem.

1.3 The Ring Z/N 5

Take an integer N > 1 and a prime p dividing N. Suppose that a is the largest positive integer satisfying

Pa I N.

By the proof of Theorem 1.1, pa appears in the prime factorization of N. If q is another prime divisor of N, and b is the largest positive integer satisfying

qb N,

then pa qb appears in the prime factorization of N. This discussion leads to the following corollary of Theorem 1.1.

Corollary 1.1 If a I c, b I c and (a,b) =1, then

ab I c.

Proof Since (a, b) = 1, the prime fa.ctors of a and b are distinct. Repeated application of the above discussion proves the corollary.

1.3 The Ring Z/N

Fix an integer N > 1. For any integer a, set

a mod N

equal to the remainder of the division of N into a. In particular,

0 < (a mod N) < N.

Set Z/N = {0, 1, 2, ..., N — 11.

Define addition in Z/N by

(a + 6) mod N, a, b E Z/N,

and multiplication in Z/N by

(ab) mod N, a, b E Z/N.

Straightforward computation shows that Z/N becomes a commutative ring with identity 1 under these operations.

Consider the mapping z Z/N,

defined by n(a) = a mod N.


The mapping n is a ring-hornoniorphism in the sense that

n(a b) =- eq(a) ii(b)) mod N,

?gab) = eri(a)n(b)) mod N.

Two integers a and b are said to be congruent mod N if 97(a) = n(b) or, equivalently,

N I (a — b).

In this case we write

a b mod N.

The unit group of Z/N, denoted by U(N), consists of all elements a E Z/N that have multiplicative inverses bEZIN:

1 = (ab) mod N.

To show that a E U(N), it suffices to find an element b E Z such that

1 ab mod N, (1.2)

since it then follows that

1 = (a(b mod N)) mod N.

Straightforward verification shows that U(N) is a group under the ring-multiplication in Z/N.

Example 1.1 Take N = 9. Then

U(9) = {1, 2, 4, 5, 7, 81.

The group table of U(9) under multiplication mod is as follows:

Table 1.1 Multiplication table of U(9).

1 2 4 5 7 8 1 1 2 4 5 7 8 2 2 4 8 1 5 7 4 4 8 7 2 1 5 5 5 1 2 7 8 4 7 7 5 1 8 4 -2 8 8 7 5 4 2 1

Lemma 1.4 U(N) = ta EZIN : (a, N) = 11.

1.4 Chinese Remainder Theorem (CRT) 7

Proof By the remarks following lemma 1.2, (a, N) = 1 if and only if

axo + Nyo = 1,

for some integers xo and yo. Equivalently, (a, N) = 1 if and only if

axo -a 1 mod N,

which, by (1.2), implies that xo mod N is the multiplicative inverse of a in Z/N, proving the lemma.


U(15) = {1, 2, 4, 7, 8, 11, 13, 141.

As a special case of lemma 1.4, if p is a prime, then

U(p)= {1, 2, ..., p — 11,

and every nonzero element in Z/p has a multiplicative inverse. Since Z/p is a commutative ring with identity, it follows that Z/p is a finite field.

Lemma 1.5 Zip is a finite field if and only if p is a prime.

Proof We have shown that if p is a prime, then Z/p is a field. Suppose that N is not a prime, and write N = NiN2, where

1 < N2 < N.

By lemma 1.4, since (N, = 1, Ni does not have a multiplicative inverse in Z/N and Z/N is not a field, completing the proof.

1.4 Chinese Remainder Theorem (CRT)

Suppose that N = NiN2, where (Ni, N2) = 1. Form the ring direct product

ZINixZ/N2. (1.3)

A typical element in (1.3) is an ordered pair

(ai,a2), al Z/Ni, a2 E Z/N2.

Addition and multiplication in (1.3) are taken as componentwise addition

(al, a2) + (bi, b2) = ((ai + bi) mod (a2 + b2) mod N2)

and componentwise multiplication

(al,a2)(bi,b2) = ((aibi) mod (a2b2) mod N2).


The CRT constructs a ring-isomorphism

Z/N Z/Ni x Z/N2. (1.4)

We will construct the ring-isomorphism using idempotents. Since NI. and N2 are relatively prime,

+ N2f2 = 1, f2 E Z. (1.5)

Set ei = (N2f2) mod N, (1.6)

e2 = (Nifi) mod N. (1.7)

Rewrite (1.6) as el = N2f2 + Nq, q E Z. (1.8)

We see from (1.5) that el 1 mod (1.9)

and from (1.8) that el 0 mod N2 . (1.10)

In the same way, e2 0 mod (1.11)

e2 1 mod N2 - (1.12)

The element ei E Z/N is uniquely determined by conditions (1.9) and (1.10). Suppose that a second element gi E Z/N can be found satisfying

91 1 mod 0 mod N2 •

Then gi ei mod Ni and gi ei mod N2 implying that

Ni I 91 — ei N2 I g 1 - ei.

Since (Ni, N2) = 1, corollary 1.1 implies that

N = NiN2 I 91 — ei. (1.13)

Without loss of generality, we assume that gi — ei > O. We have

0 < — ei < N,

which, in light of (1.13), implies that gi = el. This same argument shows that if a and b are elements in Z/N satisfying

a b mod a b mod N2,

where N = NiN2 and (Ni, N2) = 1, then

a = b.

Conditions (1.9)—(1.12) uniquely determine the set

fei,e21, (1.14)

which is called the system of idem,potents for the ring Z/N corresponding to the factorization N = NiN2, (Ni, N2) = 1.

1.4 Chinese Remainder Theorem (CRT) 9

Example 1.3

Table 1.2 Examples of idempotents.

n n2 ei e2 6 2 3 3 4 10 2 5 5 6 12 3 4 4 9 15 3 5 10 6 21 3 7 7 15 28 4 7 21 8

The system of idempotents given in (1.14) will be used to define a ring-isomorphism (1.4). First we need the following result.

Lemma 1.6 If {el, e2} is the system of idempotents for ZI N corresponding to the factorization N = NiN2, (Ni, N2) =1, then

ei mod N , e2 mod N , (1.15)

ele2 0 mod N, (1.16)

ei + e2 1 mod N. (1.17)

Proof By (1.8) and (1.9), Ni I (el — 1) and N2 el. Thus

Ni ei(ei — 1), N2 I ei(ei — 1).

Since (NI., N2) = 1 and e7— ei et(ei —

N = NiN2 (ei — et),

proving =— ei mod N.

In the same way, d e2 mod N, proving (1.15). Since NI I ei and N2 I e2, (NI., N2) = 1 implies that N = NiN2 I eie2 and that

eie2 0 mod N.

Finally, Ni i (et - 1) and Ni e2, implying that Ni I (et ±e2 —1). In the same way, N2 I (ei + e2 — 1). Again, (NI., N2) = 1 implies that N I (el + e2 — 1) and that

ei + e2 1 mod N,

completing the proof.

Define the mapping

cb Z/N

by the formula

0(ai,a2) = (alei + a2e2) mod N, al E Z/N1, a2 G Z/N2.


Theorem 1.2 0 is a ring-isonimphism of Z/Ni x Z/N2 onto Z/N.

Proof Take

a = (ai,a2), b = (bi,b2) E Z/Ni x Z/N2.

We will write addition and multiplication in Z/Ni x Z/N2 by a-kb and ab. Straightforward computation shows that

¢)(a + b) = (4)(a) + OM) mod N.

Lemma 1.6 implies that

0(ab) = (0(a)cb(b)) mod N.

By (1.15) and (1.16),

0(a)0(b) aibie? (alb2 ±a2bi)eie2 -1-a2b2d

aiblei + a2b2e2 mod N.

Formula (1.17) implies that

q5(1, 1) 1 mod N,

proving that 0 is a ring-homomorphism. To prove that cb is onto, take any k E Z/N and observe that

k ((k mod NO el + (k mod N2) e2) mod N.

Since Z/Ni x Z/N2 and Z/N have the same number of elements, this proves that 0 is onto, completing the proof of the theorem.

From the above proof, we see that the inverse 0-1 of 0 is given by

0-1(k) = (k mod k mod N2), k E Z/N.

This implies that every k E Z/N can be written uniquely as

k kiei + k2e2 mod N,

where ki E Z/Ni and k2 E Z/N2. This fact will be used later.

Example 1.4 Take N = 15 with Ni = 3 and N2 = 5. Then ei = 10, e2 = 6 and (b is given in table 1.3.

1.5 Unit Groups 11

Table 1.3 Isomorphic mapping between Z/3 x Z/5 and Z/15.

Z/3 x Z/5 (0, 0) (0, 1) (0, 2) (0, 3) (0, 4) (1, 0) (1,1) (1,2) (1, 3) (1, 4) (2, 0) (2,1) (2, 2) (2, 3) (2, 4)

Consider the direct product of unit groups

U(Ni) x U(N2),

with componentwise multiplication. U(Ni) x U(N2) is the unit group of the ring Z/Ni x Z/N2. In general, any ring-isomorphism maps the unit group isomorphically onto the unit group. Thus, we have the following theorem.

Theorem 1.3 The ring-isoinorphism q5 restricts to a group-isomorphism of U(Ni) x U (N2) onto U(N).

The extension of Theorem 1.3 to factorization,

N = NiN2 • • Arr,

where the factors N2, . NT. are pairwise relatively prime, will be given in problems 8 to 12.

1.5 Unit Groups

Properties of unit groups play a major role in algorithm design. In this section, we will state, at times without proof, several important results that will be used repeatedly throughout the text.

Denote the number of elements in a set S by o(S). o(S) is called the brder of S. In section 3, we proved that

Z/15

6 12 3 9 10 1 7 13 4 5 11 2 8 14

o(U(p)) = p — 1, for a prime p. (1.18)


The same argument, using lemma 1.4, proves for a prime p,

0(U (Pa» = 1), a

CRT, especially Theorem 1.3, can be used to extend these results to the general case. Suppose that

N = Al • • • A.",

is the prime factorization of N. Then

U(N) === U x U(pr) x • • x U (pr.").

It follows that

o(U(N)) = pcp—i pra 1 (p 1 ) (pr 1).

The function o(U(N)), N > 1, is called the Euler quotient function.

Table 1.4 Values of the Euler quotient function.

N I 5 I 52 53 7 I 72 73 I 35

o(U(N)) I 4 I 20 100 6 I 42 294 I 24

We require the following results from general group theory.

Theorem 1.4 If G is a finite group of order m with composition law written multiplicatively, then, for all x E G,

xm = 1. Applying this result to the unit group of the finite field Z/p, we have

from (1.18)

XP-1 1 mod p, x E U(p).

Equivalently, _= x mod p, x E Z/p.

Similar results hold in the unit groups U(pa) and U(N). The next two results are deeper and will be presented without proof.

Theorem 1.5 For an odd prime p, and integer a > 1, the unit group U(pa) is a cyclic group.

This important result is proved in many number theory books, for instance N. As a consequence of Theorem 1.5, an element z E U(pa), called a generator, can be found such that

U(pa) = lzk : 0 < k,<,o(U(pa))}. -

The corresponding result for p = 2 is slightly more complicated. The unit group

U(22) = {1, 3}

is cyclic, but U(2a), a > 2, is never cyclic. The exact result follows.

1.6 Polynomial Rings 13

Theorem 1.6 The group

U(2°), a > 3

is the direct product of two cyclic groups, one of order 2 and the other of order 2'2. In fact, if

Gi = {1, -lb G2 = {5k : 0 < k <2'2 1,

then U(2a)-= Gi X G2.

Example 1.5 For p = 3 and a = 2,

U(32) = {21` : 0 < k < 61.

Example 1.6 Take p = 2 and a = 3. Then

U(23) = {1, 3, 5, 71 = {1, 71 x {1, 51.

Take a = 4. Then

U(24) = {1, 151 x {1, 5, 9, 131.

1.6 Polynomial Rings

Consider a field F and denote by F[x] the ring of polynomials in the inde-terminate x having coefficients in F. A typical element in F[x] is a formal expression

f (x) = Efkxk, fk c F. (1.19) k=0

If fr 0 in (1.19), we say that the degree of f (x) is r and write

deg f(x)= r.

The elements of F can be viewed as polynomials over F. The nonzero elements in F can be identified with the polynomials over F of degree O. The zero polynomial, denoted by 0, has by convention degree —oo. Then we have the important result

deg (f (x) g(x)) = deg f (x) + deg g(x).

iThe integer ring Z and polynomial rings over fields have many properties in common. The reason for this is that the following divisibility condition holds in F[x].


Divisibility Condition. If f (x) and g(x) 0 are polynomials in F[x], then there is a unique pair of polynomials q(x) and r(x) in F[s] satisfying

f (x) = q(x) g(x) + r(x) (1.20)

and deg r(x) < deg g(x). (1.21)

The polynomial q(x) is called the quotient of the division of g(x) into f (x). In practice, we compute q(x) by long division of polynomials. The polynomial r(s) is called the remainder of the division of f (x) by g(x).

If r(x) = 0 in (1.20), we have

f (x) = q(x)g(x), q(x) E F[x],

and we say that g(x) divides f (x) or f (x) is a multiple of g(x) over F, and we write

g(x) I f (x).

The elements of F viewed as polynomials are called constant polynomi-als. Nonzero constant polynomials divide every polynomial. A nonconstant polynomial p(x) in F[x] is said to be irreducible or prime over F if the only divisors of p(x) in F[x] are constants or a constant times p(z). Constant polynomials play the same role in F[x] that the integers 1 and —1 play in Z. To force uniqueness in the statements below, we require the notion of a monic polynomial, which is defined by the property that if fr 0 in (1.19), then fr = 1.

The division relation satisfies the following properties:

(D1) If h(x) I g(x) and g(x) I f (x), then h(x) I f (x).

(D2) If h(x) I f (x) and h(s) I g(s), then h(x) I (a(x)f(x) + b(x)g(x)), for all polynomials a(x) and b(r).

(D3) If f (x) I g(x) and g(x) I f (x), then f (x) = ag(x), a E F.

Consider polynomials f (x) and g(x) over F. A polynomial h(r) over F is called a. common divisor of f (x) and g(x) over F if

h(s) I f (x) and h(x) g(s).

We say that f (x) and g(x) are relatively prirne over F if the only divisors of both f (x) and g(x) over F are the constant polynomials.

A subset J of F[x] is called an ideal if J satisfies the folloviring two properties:

(I1) If f (x), g(x) E J, then f (x) + g(x) E J.

(I2) If f (x) E J and a(x) E F[xj, then a(x) f (x) E J.

1.6 Polynomial Rings 15

Equivalently, J is an ideal if, for any two polynomials f (x) and g(x) in J,

a(x)f (x) + b(x)g(x) E J,

for all polynomials a(x) and b(x) in F[x]. The set

(f (x)) = la(x)f(x) : a(x) E F[x]l

is an ideal of F[x]. The divisibility condition will be used to show that all ideals J in F[x] are of this form. The proof is the same as that in section 2, where we now use the divisibility condition for polynomials. First note that if an ideal J contains nonzero constants, then

J = (1) = F[x],

since, by (I2), if a 0 is in J, then

f (x) = (a-1 f (x))a

is in J for arbitrary f (x) in Fix].

Lemma 1.7 If J is an ideal in F[x] other than (0) or F[x], then

J = (d(x)),

wh,ere d(x) is uniquely determ,ined as the monic polynomial of lowest positive degree in J.

Proof By (I2), J contains a monic polynomial of lowest positive degree, say d(s). Take any f (x) in J and write

f (x) = q(s) d(x) + r(x),

where deg r(x) < deg d(x). By (1.20),

r(x) = f (x) — q(x)d(x),

is in J . Since deg r(x) < deg d(x), we must have

deg r(x) = —oo or O.

But J contains no nonzero constants, implying that r(x) = O. Since f(x) is arbitrary,

J = (d(x)),

afid all polynomials in J are multiples of d(x). By (D3), d(x) is uniquely determined as the lowest positive degree monic polynomial in J , proving the lemma.


Take any two polynomials f(x) and g(x) over F. The set

J = {a(x)f(x) + b(x)g(x) : a(x), b(x) E F[x11

is an ideal in F[x]. By lemma 1.7,

J =- (d(x)),

where d(x) is the monic polynomial of lowest degree in J. In particular, d(x) is a common divisor of f (x) and g(x). Write

d(x) = ao(x)f (x) + bo(x)g(x), ao(x), bo(x) E Fix].

By (D2), every common divisor of f(x) and g(x) divides d(x). We have proved the following result.

Lemma 1.8 If f(x) and g(x) are polynomials over F, then there exists a unique monic polynomial d(x) over F satisfying:

I. d(x) is a common divisor of f (x) and g(x). II. Every divisor of f(x) and g(x) in F[x] divides d(x).

Equivalently, d(x) is the unique monic polynomial over F, which is a common divisor of f (x) and g(x) of maximal degree. We call d(x) the greatest common divisor of f(x) and g(x) over F and write

d(x) = (f (x), g(x)).

By the divisibility condition above,

(f (x), g(x)) =- a(x) f (x) b(s)g(x),

where a(x) and b(x) are polynomials over F . In particular, if f (x) and g(x) are relatively prime over F, then

1= ao(x)f (x) bo(x)9(x), (1.22)

for some polynomials ao(x) and bo(x) over F. Arguing as in section 2, we have the following corresponding results.

Lemma 1.9 If f (x) I g(x)h(x), (f (x), g(x)) = 1, then f (x) I h(x).

Theorem 1.7 (Unique Factorization) If f (x) is a polynomial over F, then f(x) can be written uniquely, up to an ordering of factors, as

f (x) = apV (x) • • • gr (x),

where a E F, pi(x), , pr(x) are nicrnir irreducible polynomials over F and al > 0, , > 0 are integers.

Corollary 1.2 For polynomials over F, if f(x) I g(x), h(x) I g(x) and (f(x),h(x)) = 1, then

f (x)h(x) I 9(x).

1.7 Field Extension 17

1.7 Field Extension

Suppose that K is a field and F is a subfield of K in the sense that F is a subset of K containing 1, and it is closed under the addition and multiplication in K . For example, the rational field Q is a subfield of the real field R, which is a subfield of the complex field C. We also say that K is an extension of F. Observe that

F[x] c K[x].

A polynomial p(x) in F[x] can be irreducible over F without being irreducible as a polynomial in K[x]. For example, the polynomial

x2 + 1

is irreducible over the real field, but over the complex field

x2 +1= (x +i)(x -i).

Thus, the field of definition is necessary when referring to irreducibility. Consider now the greatest common divisor d(x) over F of two polynomi-

als f (x) and g(x) over F. View f (x) and g(x) as polynomials over K and denote the greatest common divisor of f(x) and g(x) over K by e(x). By lemma 1.8,

d(x) I e(x),

meaning that

e(x) = g(x)d(x), g(x) E K[x].

Write d(x) = a(x)f (x) + b(x)g(x), a(x), b(x) E F[x].

Since e(x) is a common divisor of f (x) and g(x), by (D2) of section 6,

e(x) I d(x).

Applying (D3) of section 6,

e(x) = d(x).

Thus, the greatest common divisor d(x) of two polynomials f (x) and g(x) over F does not change when we go to an extension K of F. In particular, f (x) and g(x) are relatively prime over F if and only if they are relatively prime over any extension K of F.

Consider a polynomial f(x) over F and suppose that K is an extension of,F. The polynomial f (x) can be evaluated at any element a of the field K by replacing the indeterminate x by a. The result,

f (a) = fka , k=0


is an element in K. We say that a is a root or zero of f (x) if f (a) = O. The main reason to consider extensions K of a field F is to find roots of polynomials f (x). For instance, x2 ± 1 has no roots in the real field but in the complex field i and —i are roots.

Lemma 1.10 A nonzero polynomial f (x) over F has a root a in an extension field K if and only if

f (x) = (x — a)g(x),

for some polynomial g(x) over K . In any extension field K of F, a polynomial f(x) over F has at most n roots, n = deg f (x).

Proof Applying the divisibility condition in Kix],

f (x) = (x — a) g(x) + r(x),

where g(s), r(s) E K[s] and deg r(s) < 1. Since f (a) = 0, we have r(a) = 0, implying that r(x) = O. Suppose that ai, , an, are distinct roots of f(x) in K. Then (x — a3) I f (x). The polynomials

x — — am,

are relatively prime. By the extension of corollary 1.2 to several factors, we have

(x — al) (x — am) I f (s).

Thus m deg f (x),


1.8 The Ring F [x] f (x)

Fix a polynomial f (x) over F of degree n. Set

F[x] / f (x)

equal to the set of all polynomials g(x) over F satisfying

deg g(x) < n.

Every polynomial g(x) in F[xl/ f (x) can be wriiten as

n-1

g(x) = E gkxk , gk E F, k=0

1.8 The Ring F[x]I f (x) 19

and we can regard F[x] I f (x) as an n-dimensional vector space over F having basis

1, xn-1.

We place a ring-multiplication on F[x] f (x) as follows. For any g(x) E

F[x], denote by g(x) mod f (x)

the remainder of the division of g(x) by f (x). Then

g(x) mod f (x) E F[x] 1 f (x).

Define multiplication in F[x] I f (x) by

(g(x) h(x)) mod f (x), g(x), h(x) E F[ I f (x). (1.23)

Direct computation shows that the vector space F[xl/f(x) becomes an algebra over F with the multiplication (1.23).

Two polynomials g(x) and h(x) over F are said to be congruent mod f (x), and we write

g(x) h(x) mod f (x) (1.24)

if g(x) mod f (x) = h(x) mod f (x). Equivalently, (1.24) holds if

f (x) (g(x) — h(x)).

Define the mapping F[x] —> F[x] I f (x)

by the formula n(g(x)) = g(x) mod f (x).

_Straightforward computation shows that n is a ring-homomorphism of F[x] onto F[x]I f (x) whose kernel

{g(x) E F[x] : n(g(x)) = 0}

is the ideal (f (x)). In lemma 1.5, we gave a method of constructing a finite field of order p,

for a prime p. We will now construct fields using the rings F[x] I f (x).

Lemma 1.11 The ring F[x] I f (x) is a field if and only if f (x) is irreducible over F

Proof Suppose that f(s) is irreducible. Take any nonzero polynomial g(t) in F[x] I f (x). By (1.22),

1 = ao(x)g(x) + bo(x)f (x),


where ao(x) and b0(x) are polynomials over F. Then

1 ao(x)g(x) mod f (x),

so ao(x) mod f (x) is the multiplicative inverse of g(x) in F[x]l f (x). Since g(x) is an arbitrary nonzero polynomial in F[x]/f(x), the commutative ring F[x]I f (x) is a field. Conversely, suppose that f (x) is not irreducible. Then

f(x) = fi(x)f2(x),

where 0 < deg fk(x) < deg f(x), k = 1,2.

Then, fi(x) and f2(s) are in F[x]I f (x) and

0 = (.fi(x).f2(x)) mod .f(x).

If fi(x) has a multiplicative inverse, then

0 f2(x) mod f(x),

a contradiction, completing the proof of the converse of the lemma.

More generally, we have the next result, which we give without proof.

Lemma 1.12 The unit group U of F[x]I f (x), consisting of all polynomials g(x) in F[x11 f (x) having multiplicative inverse, is

U = {h(x) c F[x]I f (x) : (h(x), f(x)) = 11.

Identifying F with the constant polynomials in F[x]I f (x), we have

F c F[x]I f (x).

If p(x) is an irreducible polynomial of degree n, then

K = F[x]Ip(x) (1.25)

is a field extension of F that also can be viewed as a vector space of dimension n over F. Suppose that

F = Z/p.

Then K is a finite field of order pn. We state the next result withqut proof.

Lemma 1.13 If K is a finite field, then K has order pn for some prime p and integer n > 1. Two finite fields of the sam,e order are isom,orphic.

In addition, every finite field K can be constructed as in (1.25).

1.9 CRT for Polynomial Rings 21

1.9 CRT for Polynomial Rings

Consider a polynomial f(x) over F, and suppose that

i(x) = it(x).f2(x), Ch(x), .i2(x)) = 1- (1.26)

We will define a ring-isomorphism

F[xl/fi(x) x F[x]I f2(x) F[x]l f(x), (1.27)

where the ring-direct product in (1.27) is taken with respect to component-wise addition and multiplication. First, following section 4, we define the idempotents. Since fi(x) and f2(x) are relatively prime, we can write

1 = al (x)fi (x) + a2(x)f2(x),

with polynomials al (x) and a2(x) over F. Set

ei (x) = (a2(x) f2(x)) mod f(x),

e2(x) = (ai(x) fi(x)) mod f (x).

Arguing as in section 4,

ei(x) 1 mod fi(x), ei(x) =_ 0 mod f2(x), (1.28)

e2(x) 0 mod fi(x), e2(s) 1 mod f2(x). (1.29)

Conditions (1.28) and (1.29) uniquely determine the polynomials el (x) and e2(x) in F[x]I f (x). The set

fei(x), e2(x)}

is called the system of idempotents corresponding to factorization (1.26). Arguing as in lemma 1.6, we have the next result.

Lemma 1.14 The system of idempotents fei(x),e2(x)} satisfies

qc(x)- ek(x) mod f(x), k = 1, 2,

ei(x)e2(x) 0 mod f(x),

ei(x) + e2(x)- 1 mod f (x).

Define

(t)(gi (x), g2(x)) = ( (s) ei(x) + g2(x) e2(x)) mod f (x),

gk(x) e F[x]/fk (x), k = 1, 2. As in Theorem 1.2, the next result follows.


Theorem 1.8 cb is a ring-isomorphism, of the ring-direct product F[x]I fi(x)x F[x]I f2(x) onto F[x11 f(x) having inverse 0-1 given by the formula

0-1(g(x)) = (g(x) mod fi(x), g(x) mod f2(x)),

for g(x) E F[x11 f (x).

In particular, every g(x) in F[x]l f(x) can be written uniquely as

g(x) (gi(x)ei(x) + g2(x)e2(x)) mod f (x),

where gk(x) E F[x11 fk(x), k = 1, 2. The extension of these results to factorization of the form

i(x) =.fi(x).f2(x) fr(x), (1.30)

where the factors fk(x), 1 < k < r, are pairwise relatively prime, is straightforward. To construct the system of idempotents,

fek(x) : 1 < k < r}, (1.31)

corresponding to the factorization (1.30), we reason as follows. First,

(h (x), f2(x), . . . , fr(x)) = 1,

and we can apply the above discussion to find a unique polynomial ei(x) in F[x]I f (x) satisfying

ei(x) 1 mod fi(x), (1.32)

ei(r) 0 mod /2(x) • • • fr(x). (1.33)

Condition (1.33) implies that

ei(x) 0 mod fk(x), 1 < k < r, k 1.

Continuing in this way, we find polynomials ek(x) in F[x]l f(x), 1 < k < r, satisfying

ek(s) 1 mod fk(x), 1 < k < r,

ek(r) 0 mod fi(x), 1 < k,1 < r, k 1.

These conditions uniquely determine the set (1.31). As before, we have the next result.

Lemma 1.15 The system of idernpotents (1.31) satisfies the properties

eZ(x) e k(x) mod f (X)-, 1 < k < r,

ek(x) ei(x) 0 mod f (x), 1 < k,1 < r, k 0 1,

E ek(x) E. 1 mod f (x). k=1

Problems 23

From lemma 1.14, we have the ring-isomorphism 0 of the direct product

F[xlifk (X) k=1

onto F[x]I f(x) given by the formula

0(gi(x), . . . , gr(x)) = ( E gk(s) ek(s) ) mod f (x).

The inverse 0-1 of 0 is given by the formula

0-1 (g(x)) -= (g(x) mod fi(x), , g(x) mod fr(x)),

and every g(x) in F[x]I f (x) can be written uniquely as

g(x) E g k (x) ek (x) mod f (x), k=1

where gk(x) is in F[x]/fk(x), 1 < k < r.

References

[1] Ireland, K. and Rosen, M. A Classical Introduction to Modern Number Theory, Springer-Verlag, 1980.

[2] Halmos, P. R. Finite-Dimensional Vector Spaces, Springer-Verlag, 1974.

[3] Herstein, I. N. Topics in Algebra, XEROX College Publishing, 1964.

Problems

1. Show that

(a + (b + c) mod N) mod N = ((a + b) mod N + c) mod N.

This is the associative law for addition mod N.

2. Show that

(a • ((19 c) mod N)) mod N

= ((a • b) mod N + (a • c) mod N) mod N.

This is the distributive law in the ring Z/N.

24 Chapter 1. Review of Applied Algebra

3. Describe the unit group U(N) of Z/N explicitly for N = 12, N = 21, N = 44 and N = 105.

4. Give the table for addition and multiplication in the field Z/11.

5. Give the table for addition and multiplication in the ring Z/21.

6. Find the system of idempotents corresponding to the factorizations N=3x7,N=4x 5, N = 2 x 7 and N =7 x11.

7. Give the table for the ring-isomorphism q5 of the CRT corresponding to factorizations N = 4 x 5 and N = 2 x 7.

8. Suppose that N = N2 • • • NT, where the factors Ni N2, • • • , Nr are relatively prime in pairs. Show that

(Nk, NINk) =1, 1 < k < r.

9. Continuing the notation and using the result of problem 8, define integers ei, e2, , er satisfying

0 < ek < N,1 < k < r,

ek 1 mod Nk,1 < k < r,

ek 0 mod Ni,1 < k,1 < r, k 01.

These integers are uniquely determined by the above conditions and form the system of idempotents corresponding to the factorization given in problem 8.

10. Continuing the notation of problems 8 and 9, prove the analog of lemma 1.6:

e2k ek mod N,1 < k < r,

ek 0 mod N,1 < k, 1 < r, k 1,

E ek 1 mod N. k=1

11. Define the CRT ring-isomorphism of the direct product Z/Ni x Z/N2 x • • • x Z/Nr onto Z/N 'and describe its inverse 0-1.

12. Extend Theorem 1.3 to the case of several factors given in the above problems.

13. Find a generator of the unit group U(N) of Z/N where N = 5, N = 25, N = 125.

14. Show that U(21) is not a cyclic group. Use Theorem 1.3 to find generators of U(21).

Problems 25

15. For N — Pi P2 • ' • Pr, where the factors Pi, P2, - pr are distinct primes, show that the unit group U(N) of Z / N is group-isomorphic to the direct sum

Z/(pi — 1) e Z/(p2 — 1) e • • • 0) Z/(pr — 1).

16. Prove that

deg (f (x) g(x)) -= deg f (x) + deg g(r).

17. Write out the divisibility condition for the polynomials

g(x) = xl° + 4x8 + 2x2 + 3,

f(x) 4x2o 2xio L

18. For any two polynomials f (x) and g(x) over Q[x], show that the following set is an ideal:

J = {a(x) f (x) + b(s)g(x) : a(x), b(x) E Q[x]}.

19. Let F be a finite field and form the set

L = {1, 1 + 1, , 1 + 1 + • • + 1, •

Show that L has order p for some prime p and that L is a subfield of F isomorphic to the field Z/p. The prime p is called the characteristic of the finite field F.

20. Show that every finite field K has order pm for some prime p and integer n > 1.

21. For the polynomial f (x) over Q

f (x) = (x — 1) (x + 1),

find the idempotents corresponding to this factorization and describe the table giving the CRT ring-isomorphism.

22. Find the idempotents corresponding to the factorization

f (x) = (x — ai)(x — a2) • • • (x — ar),

where al, , a, are elements in some field F . Describe the t corresponding CRT ring-isomorphism (/) and its inverse 0-1.

2 Tensor Product and Stride Permutation

2.1 Introduction

Tensor product offers a natural language for expressing digital signal pro-cessing (DSP) algorithms in terms of matrix factorizations. In this chapter, we define the tensor product and derive several important tensor product identities.

Closely associated with tensor product is a class of permutations, the stride permutations. These permutations govern the addressing between the stages of tensor product decompositions of DSP algorithms. As we will see in the following chapters, these permutations distinguish the variants of the Cooley-Tukey FFT algorithms and other DSP algorithms.

Tensor product formulation of DSP algorithms also offers the convenience of modifying the algorithms to adapt to specific computer architectures. Tensor product identities can be used in the process of automating the implementation of the algorithms on these architectures. The formalism of tensor product notation can be used to keep track of the complicated index calculation needed in implementing FT algorithms. In [1], the implemen-tation of tensor product actions on the CRAY X-MP was carried out in detail.

28 2. Tensor Product and Stride Permutation

2.2 Tensor Product

In this section, we present some of the basic properties of tensor products which are encountered in the algorithms that we will describe in future chapters of this work. Tensor product algebra is an important tool for presenting mathematical formulations of DSP algorithms so that these al-gorithms may be studied and analyzed in a unified format. We first define the tensor product of vectors and present some of its properties. We then define the tensor product of matrices and describe additional properties. These properties will be very useful in manipulating the factorizations of discrete FT matrices.

Let Cm denote the M-dimensional vector space of M-tuples of complex numbers. A typical point a E Cm is a column vector,

ao

a =

am._i

We say that a has size M. If the size of a E CM is important, we write a as am.

The tensor product of two vectors a e Cm and b E CL is the vector a 0 b CN, N = ML, defined by

[ aob 1 [ . am_ib

Example 2.1

aobo aobi

ao aibo

ai 0 [ b° = aibi

a2 a2bo a2bi

The tensor product is bilinear in the following sense. For vectors a b, c of appropriate sizes

(a+b)0c=a0c±boc, (2.1)

a0(b+c)=-a0b+a0c, (2.2)

but it is not commutative. In general, a 0 b b 0 a.

2.2 Tensor Product 29

Tensor product constructions often involve the following relationship between linear arrays and multidimensional arrays. An M x L array

= [xm,i [0<m<m,o<RL

can be identified with the vector x of size N, N -= ML, formed by running down the columns of X in consecutive order. Conversely, given a vector x of size N, we denote by MatmxL(x) the M x L array formed by first segmenting x into L vectors of size M,

xo Xm X(L-1)M

X1 XM-F1 X0 = , X1 = [

1 7 " •1 XL-1 =[

1 [ Xm. _i 1 X2M-1 XML-1

and then placing these vectors in L consecutive columns,

MatmxL(x) = [Xo Xi XL-1] •

Consider the tensor products a 0 b and b 0 a with a E Cm, b E CL. Identify a 0 b with the L x M array

MatLxm(a b) = [aob • am_ib] ,

and b 0 a with the M x L array

MatmxL(b 0 a) -= [boa • • 19L_ia] .

We see that Matm x L (b 0 a) = (MatLx m(a b))t.

Thus, interchanging order in the tensor product corresponds to matrix transpose. In example 2.1, a 0 b corresponds to the 2 x 3 array

[ aobo al bo a2 bo

aobi al a2bi

while the vector b 0 a corresponds to the 3 x 2 array

[boa° bi ao boa' bi al • bo a2 bi a2

general, we can describe matrix transposition in terms of a permuta-tion of an indexing set. Consider first the L x M array

Y = [ Y1,TTI]o<I<L, o<m<m •


For any 0 < r, s < N, we can write uniquely

r = mL, 0 <1 < L, 0 < < M,

s = rn /M, 0 < m < M, 0 < / < L.

The vector y formed from the array Y has components given by

Yr = Yt,m, r = / mL, 0 < rn < M, 0 <1 < L,

while the vector z formed from Y.' has components given by

zs = s = m + /M, 0 < < M, 0 < / < L.

This corresponds to the permutation of the indexing set,

7(1 + rnL) = 771+ 1M, 0 < M < M, 0 <1 < L, (2.3)

and we have Yr = Zw(r) 0 < r < N. (2.4)

Example 2.2 Taking M = 2 and L = 3, we have

= ( 0 2 4 1 3 5 )

and Zo -

Z2

Z4 Y = zi

Z3

_ Z5 _

To form y from z, we 'stride' through z with length two.

In general, to form a 0 b from b 0 a, we first initialize at boa° the 0-th component of b a, and then stride through b 0 a with length the size of a. After a pass through b0a, we reinitialize at boa', the first component of b® a, and then stride through b 0a with length the size of a. This permutation of data continues until we form a b. This procedure is an example of the important notion of a stride permutation. Stride permutation will be discussed in great detail beginning in the next section.

Denote by emm, 0 < < M, the vector of size M with 1 in the m-th place and 0 elsewhere. The set of vectors

.

: 0 < rn < M}

is a basis of Cm called the standaryl basis. Set N = ML and form the tensor products ernm 0 et, 0 < rn < M, 0 < / < L. Since

oN M eL 0 < m < M, 0 <1 < L, — ern I 7 -

2.2 Tensor Product 31

the set lemmOet : 0<m<M,0</<L1

is the standard basis of CN. In particular, as a runs over all vectors of size M and b runs over all

vectors of size L, the tensor products a0b span the spa.ce CN (see problems 4 and 5). To prove that the actions of two matrices on CN are equal, we only need to prove that they are equal on tensor products of the form am 0 bL.

The tensor product of an R x S matrix A with an M x L matrix B is the RM x SL matrix, A 0 B, given by

420,0B ao,03 • • • ao,s_iB

•

_aR—LoB " • aR-1,s—iB

Setting C = A 0 B, the coefficients of C are given by

Cm±rM,1-1-sL = ar,s brn,1-

It is natural to view the tensor product A0B as being formed from blocks of scalar multiples of B. The relationship between tensor products of matrices and vectors is contained in the next result.

Theorem 2.1 If A is an R x S matrix and B is an M x L matrix, then

(A 0 B)(a b) = Aa Bb,

for any vectors a and b of sizes S and L.

Proof The vector a 0 b, aaobb

as_ib

can be viewed as consisting of consecutive segments,

aob, alb , , as_ib,

each of size L. Since the M x SL matrix formed from the M rows of A0 B is

[a0,013 ao,iB • - • ao,L-1.81,

we have that the vector of size M formed from the first M components of (A B)(a b) is

(ao,oao aojai + • • + ao,s_ias_i) Bb.

Continuing in this way proves the theorem.


Theorem 2.2 If A and C are M x M matrices and B and D are L x L matt-ices, then

(A 0 B)(C D) = AC 0 BD.

Proof Take vectors a and b of sizes M and L. By Theorem 2.1,

(A 0 B)(C D)(a = (A 0 B)(Ca Db)

= ACa BDb,

proving the theorem, in light of the preceding discussion.

More generally, the tensor products

a0b0c= a® (b0c)= (a® b) 0c, a E Cm, b E CL, c E CK,

span the space CN, N = MLK, and the observation about matrix expressions can be extended.

An important special case of Theorem 2.2 is the following decomposition. Denote by /L, the L x L identity matrix. Then

A 0 B = (Im B)(A 0 IL) = 0 .ELY-rm B), (2.5)

where A is an M x M matrix and B is an L x L matrix. In order to better understand the computation (A 0 B)x, we need to examine the factors I'm 0 B and A 0 /L.

im B is the direct sum of M copies of B,

Im B — 0

0

Bi'

and its action on x is the action of B on the M consecutive segments of x of size L. We call im B a parallel operation. For a vector x E CN, N = ML, we have

MatLxm((/m B)x) = BMatLxm(x).

Write MatL,,m(x) = [X0 - • Xm-i],

where X„., is the m-th column of MatLx m(x). The computation (A 0 /L)x can be interpreted as a vector operation of

A on the vectors X0, Xi, , Xm_i by

a0,0X0 + • • • + ao,m_iXm-i

(A 0 /L)x = (2.6)

_am_i,o)Co + • • • + am-i,m-iXm-i

2.3 Stride Permutations 33

Operations involved in (2.6) are scalar-vector multiplication and vector addition.

Factorization (2.5) decomposes the operation (A 0 B) into the parallel operation IM B followed by the vector operation A 0 /L. For a vector x E CN,

MatLx m((A IL )x) = (MatLxm(x ) )At

and

MatLxmail B)x)= BMatLxm(x)At.

2.3 Stride Permutations

In this section, we discuss the stride permutations that govern the data flow required to parallelize or vectorize a tensor product computation. Stride permutations play a crucial role in the implementation of FT computations. On some machines, the action of a stride permutation can be implemented as elements of the input vector being loaded from the main memory into registers. For architectures where this is the case, considerable savings can be obtained by performing these permutations when loading the input into the registers.

Take N = 2M. The tensor products a 0 b where, a E C2 and b Cm span CN. We define the N -point stride M permutation matrix P(N,M) by the rule

P(N, M)(a b) = b 0 a, a E C2, b E Cm. (2.7)

More generally, if x is a vector of size N, then y = P(N, M)x satisfies

Mat2x m (Y) = (Matm x 2 (x))t.

The matrix P(N, M) is usually called the perfect shuffle. It strides through x with stride of length M.


1000 xe

0010 x2

P(4, 2) = [0 1 0 0 P(4' 2)x = xi •

0001 13



'1000000°- 00001000 01000000 00000100

P(8,4) = 00100000' 00000010 00010000 _00000001_

xo x4

P(8, 4)x = x5 X2

X6

X3

_X7

Example 2.5

P(6, 3)

Take

=

N = 6. Then

-1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0000 0 0 1 0 0

_ 0 0 0 0 0

0 0 0 o 0 1_

P(6, 3)x =

x3

x1 X4

X2

_X5 _

•

Suppose now that N = ML. The N-point stride L perm,utation matrix P(N, L) is defined by

P(N,L) (a b) = b 0 a, (2.8)

where a and b are arbitrary vectors of sizes M and L. More generally, if x CN, then y = P(N, L)x if and only if

MatmxLY = (Mat', mx)t

To compute P(N, L)x, we stride through x with stride L. Formula (2.8) will be used repeatedly to derive matrix identities involving stride permutations.

The algebra of stride permutations has an important impact on the design of tensor product algorithms.

Theorem 2.3 If N = MLK, then

P(N, LK) = P(N, L)P(N, K).

Proof Take vectors a E Cm, b E CL and c E CK. Then

P(N, LK)(a b c) .=_13 c 0 a, -

and

P(N, L)P(N, K)(a 01: c) = P(N,L)(c a® b) =b0c0a,

proving the theorem.


In particular, from Theorem 2.3,

P(NM,M)-1 = P(NM, N). (2.9)

Example 2.6 Take N = 4 x 2.

- X0-

X2

X4

P(8,2)X= X6 Xi

X3

X5

_X7 _

Then

, P(8,2)2X=

- X0-

X4

Xi

X5 X2

16

X3

_X7 _ from which we see that

P(8, 2)2 = P(8, 4), P(8, 2)3 =

In general, an N x N permutation matrix can be given by a permutation of Z/N. Let ir be a permutation of Z/N. We represent ir using the following notation:

7r = (7r(0),7r(1),...,7r(N - 1)).

Define the N x N permutation matrix P(7), by the condition

P(7r)x = y,

where yi = 0 < 3 < N.

- Example 2.7 Take N = 8 and

7r = (0,4,1,5,2,6,3,7).

Then - X0-

X 4

Xi

P (70X = 5 , X2

X6

X3

_X7 _

and we see that P(ir) = P(8,4).


Example 2.8 Take N = 12 and

7r = (0, 3, 6, 9, 1, 4, 7, 10, 2, 5, 8, 11).

Then P(7r) = P(12,3).

Direct computation shows that the mapping

7r P(7r)

satisfies

P(7r2 71-1) = P(70P(71-2),

P(7r-1) -= P(70 - 1.

Example 2.9 Take N = 8, M = 2 and L -= 4. Then

7r = (0, 2,4,6,1,3,5,7)

and P(7r) = P(8, 2).

Example 2.10 Take N = 12, L = 3 and M = 4. Then

ir = (0,3,6,9,1,4,7,10,2,5,8,11)

and Pew) = P(12,3).

In general, we have the next result.

Theorem 2.4 If N = ML and 7r is the perm,utation of ZIN defined by

7r(a bM) = b aL, 0 < a < M, 0 < b < L,

then P(70= P(N,L).

Consider the set of N x N permutation matrices

{P(N,L) LI NI . (2.10)

We will describe the permutation matrices in this set in terms of the unit group U(N - 1) of Z/(N - 1). The unit group U(N -1) is given by

U (N - 1) = {0 < T < N - 1 : (T - 1) = 1} .

If T E U(N - 1), then multiplication by T mod (N - 1) is a bijection of the set

{0,1,..., N - 21.


Define the permutation 71-7-, of Z/N by the two rules

72-(k) kT mod (N —1), 0 < k < N —1,

71-2-(N —1) = (N —1).

Observe that, if T I N, then (T,N —1) = 1 and we can define rr-

Example 2.11 Take N = 12 and T = 4. Then

= (0,4,8,1,5,9,2,6,10,3,7,11).

We see that

74(a + 3b) = b + 4a, 0 < a < 3, 0 < b < 4,

and P(71-4) = P(12,4).

Theorem 2.5 If N = ML, then

Pen L) = P(N, L).

Proof We must show that

irL(a + bM) = b+aL, 0 < a < M, 0 < b < L.

N — 1=- ML —1=Omod (N —1).

irL(a + bM) aL + bN

aL + b + b(N — 1) mod (N —1)

=aL+b.

Consider the set of N x N permutation matrices

{PerT) : T E U(N —1)} .

This set is group-isomorphic to U(N — 1). In fact,

P(7rR)P(7rs) = P(7ru), U RS mod (N —1)

P(1-R)-1 = P(z-R-1), R-1 taken mod (N —1).

Theorem 2.6 {P(2m,2m) : 0 < m < MI

is the cyclic group generated by P(2m , 2). In fact,

ppm ,2777,)p(2m ,2k) _ ppm ,2m-f-k),

where m + k is taken mod M.


Proof Consider integers 0 < k < M. If rn + k < M, then there is nothing to prove. Suppose that M < nt+ k. Set / = m + k — M. We have 0 < / < M and / 7/1+ k mod M:

2m+k 21+1 I ,

2"i±lc — 2/ = 2/±m — 21 -= 21(2m — 1) =- 0 mod (2m — 1).

It follows that

2m2k mod (2m — 1),


More generally, we have the following result, which we state without proof.

Theorem 2.7 If p is a prime, then the set

fP(Pm,PL) (L, m) = 11

is a cyclic group of order M generated by P(pm ,p).

It is sometimes useful to represent permutations and general computa-tions by diagrams that give a picture of data flow.

Example 2.12 The permutation P(4, 2) can be represented by

10. .X0

Xi. -X2

X2. .X1

X3. .X3

Example 2.13 represented by

The permutation /2 P(4,2) and P(4, 2) 0 /2 can be

XO. .X0 X0 .X0

Xl• .X2 Xi .XI

X2. „

.X1

X3. .X3 X3 .X5

X4. -X4 X4 -X2

X5. .X6 X5 .X3

X6. .X5 X5 .X6

X7. .X7 X7 .X7


Example 2.14 The permutations P(8, 2) and P(8, 4) can be represented by

Xo. .X0 lo. .X0

Xl• .X2 Xi. •X4

X2. •X4 X2. •Xi

X3. .X6 13. .X5

X4. .X1 X4. .X2

X5. .X3 X5. .X6

X6. .X5 X6. .X3

X7. .X7 X7. .X7

We see from the diagram of example 2.12 that 12 0 P(4, 2) consists of two parallel copies of P(4, 2). To compute the action of P(4, 2) oh, we can first form the vectors

x(0) = , x(1) = [x2] , x(2) = [x41 , x(3) = [x61 X1 X3 X5 X7

and compute the vector operation of P(4, 2) on these four vectors,

[x(0)

P(4, 2) x(1) x(2) x(3)

In the preceding section, we discussed how certain tensor product ex-pressions could be viewed as vector operations, parallel operations, or as a combination of vector and parallel operations. An important tool for inter-changing the operations in a given algorithm is the commutation theorem.

Theorem 2.8 If A is an M x M matrix and B is an L x L matrix, then

P(N, L)(A 0 B) = (B A)P(N, L), N = M L.

Proof Set z = x y, where x has size M and y has size L. Then, by definition,

(A 0 B)(x y) = Ax 0 By,

P(N, L)(A B)z = By 0 Ax.

Arguing in the same way,

(B A)P(N, L)z = By Ax,


COrollary 2.1

P(N, L)(Im B) = (B Im)P(N,L), N = M L.


As an important application of the commutation theorem, we observe that

A0 B = (A 0 IL)P(N,M)(B Im)P(N,M)-1, (2.11)

A0 B = P(N,M)(IL, A)P(N,M)-1(Im B). (2.12)

Factorization (2.11) decomposes A 0 B into a sequence of vector oper-ations; the first operates on vectors of size M while the second operates on vectors of size L. The intervening stride permutations provide a math-ematical language for describing the readdressing between stages of the computation. In the same way, we interpret (2.12) as a sequence of parallel operations.

2.4 Multidimensional Tensor Products

Tensor product identities will be used to obtain factorizations of multidi-mensional tensor products, which then can be applied to implementation problems. The rules of implementation established in this section will have important consequences in the rest of this text. The first application will be to the various Cooley-Tukey FFT algorithms in the next chapter. The stride permutations appearing in these factorizations will make explicit the readdressing needed to carry out computations. We begin with an example. Take positive integers N2 and N3. Set N = NiN2N3. AN denotes any N x N matrix. The product rule implies that

AN, 0 AN, 0 AN3 (2.13)

= (AN, 0 IN2N3)(INI AN2 IN3)(INiN2 ® AN3 )•

The factor AN, 0 I N2N, is a vector operation while the factor /NiN, AN3 is a parallel operation. The middle factor, IN„ AN,0 N,, is of mixed type involving NI copies of the vector operation AN, 0 IN3. There are several ways of modifying these factors, using the commutation theorem. Since

AN, IN2N3 P(N, Ni)(IN2N3 ANi)P(N, N2N3), (2.14)

IN, 0 AN2 /N2 = P(N,NiN2)(iNIN, AN2)P(N, N3) (2.15)

and P(N, N2N3)P(N, NiN2)_= P(N, N2),

we can rewrite (2.13) as

AN, 0 AN, A N3 = P(N, Ni)(IN2N, ANi) (2.16)

P(N, N2)(INi N3 0 AN2)P(NI N3)(IN1N2 AN3)•

2.4 Multidimensional Tensor Products 41

A second parallelization comes from replacing the middle factor by

IN, 0 AN2 IN3 Q(INiN3 AN2)C2 1, (2.17)

where Q = P(N2N3, N2). These two parallel factorizations differ in data flow. In the first, the readdressing between the computational stages is given by P(N,N3), P(N,N2) and P(N, NO while in the sec-ond the readdressing is given by Q-1, P(N,N2N3)Q and P(N, NO. Each will have advantages and disadvantages that can be made explicit when implementing on a specific computer.

In general, the permutations that arise from commuting terms in a multi-dimensional tensor product are built up from products of terms of the form / P /, where / denotes an identity matrix and P denotes a stride per-mutation. In particular, IN, 0 P(N2N3, N3) iS NI copies of the permutation P(N2N3, N3). As such, it performs a stride permutation on Ni segments of the input vector beginning at different offsets. It can be implemented as a loop of stride permutations, where the same permutation is performed, but the initial offset is incremented by N2N3 at each iteration. The second type of permutation can be thought of as permuting blocics of the input vector. Thus, P(N2N3, N3)0 /N1 permutes segments of length Ni at stride N3. This can be implemented by loading blocks of Ni consecutive elements, beginning at offsets given by the permutation P(N2N3, N3).

If M = = N2 = N3 and A, B and C are M x M matrices, the factorization becomes

A0 B 0C = P(Im2 A)P(Im2 B)P(Im2 0C),

where P = P(M3,M). In this case, the readdressing between each of the stages of the computation is the same and given by P.

Factorizations of the permutation occurring in computing terms in mul-tidimensional tensor products offer programming options that can be used to match the algorithm computing the action of these multidimensional tensor products to the specific machine architecture. Depending on the machine parameters such as maximal vector length, minimal vector length, number of processors and communication network, a full parallelization or vectorization may not be desired. The rules established can be modified to partially parallelize or vectorize. The next result describes a factorization that is especially useful.

Theorem 2.9 ././ N = NiN2N3, then

P(N, N3) = (P(NiN3, N3) 0 INMINi P(N2N3, N3))•

Prroof Take a E CN1, b E CN2 and c E CN3. Then

(P(NIN3, N3)0 ./N2)(iNi P(N2N3, N3))(a b c)


= (P(NiN3, N3) 0 /N2)(a c b)

=c0a013

= P(N, N3)(a b c),


Let N = NiN2 • • • NK. For M x M matrices X2, • • - Xl•C

H Xk = X1X2 • • • XK k=1

is the sequence of matrbc actions beginning with XK and ending with Xi. Denote an arbitrary Nk X Nk matrix by ANk. Set N(k) = NiN2 • • Nk and N(0) = 1. The product rule implies that

AN, ' • ' ANK = H IN(k-1) ANk IN/N(k)• (2.18) k=1

Set Pk = P(N, N(k)). Since

/N(k_i) AN,, -/N/N(k) — Pk(1N/Nk ANk)Pk 1, (2.19)

we can parallelize (2.32) by the factorization

AN, 0 ' ' ANK = H Pk(IN/Nk ANk)Pk 1* (2.20) k=1

The description of the intervening permutation can be simplified by combining permutations. Since

= P(N, NI N(k))P(N, N(k +1)) = P(N, Nk+i),

we have the next result.

Theorem 2.10 (Parallel I)

AN, 0 • • • 0 ANK = H P(N, Nk)(IN/Nk ANk)• k=1

As in the above example, if Ni = N2 = • • • = NK, then the readdressing

between computational stages is exactly the same. A second parallelization can be obtained from the identity

/N(k_i) ANk ININ(k) = Ch(ININk ANk)Qkl, - (2.21)

where Qk = 0 P(NIN(k — 1), Nk). This leads to the next result,

AN, 0 • • • ® ANK = H Qk (iNiNk (8) ANk )(27, 11 k=1

2.4 Multidimensional Tensor Products 43

where

Qk = 0 P(NIN(k — 1), Nk)•

Combining permutations, we have the next result.

Theorem 2.11 (Parallel II)

ANi AN2 ' "0 ANK Qk_iicik(iN,N„ ANk), k=1

where Qo = I.

The permutation Qk—lQk+i has the interpretation that it maps the multidimensional tensor product

Nk Nk-1-2 aN IC 0 aNk+i a 0•--0a ca). a

into aNi aNk_i aNk+1 ell< aNk

interchanging the k-th and k 1-th positions. Since

ININk ANk = P(N, NINk)(ANk IN/NJP(N, NINk)-1 ,

we obtain the vector factorization analogs of the preceding two theorems.

Theorem 2.12 (Vector I)

ANi • • ANK = ( A Nk /N/Nk )P(N,Nk)• k=1

- Theorem 2.13 (Vector II)

ANi ANK = H (A NI, ININJRk, k=1

where Rk = P(N,Nk)QkiCh+iP(N,NINk+1).

If ANk, < k < K, is symmetric, then applying the transpose operation to both sizes of the factorizations in the preceding theorems produces ad-ditional factorizations which, although similar in form, can have significant data flow differences. We will use

(A 0 B)t = At 0 Bt ,

Pt = P-1, P a permutation matrix.


Theorem 2.14 (Parallel III) If AN,,, 1 < k < K, is symmetric, then

AN, 0 AN2 0 • " A NK =

(ININK 0 ANK)P(N,NINK)- • (INN, 0 AN,)P(N,NINO•

Theorem 2.15 (Parallel IV) If ANk, 1 < k < K, is symmetric, then

AN, 0 AN2 ' • ' ANK =

(INNK ANKY 2K-1 UNIN2 AN2)W ° 1 -1,-IN/N, 0 A ) •

In Parallel I, a typical stage

P(N,Nk)( IN/Nk 0 A Nk

can be implemented by the parallel operation

ININk ANk

followed by the stride permutation P(N,Nk), while in Parallel III the parallel operation

iNiNk ANk

follows the stride permutation P(N, NINk). If there are many small factors Nk, < k < K, in the factorization of N, these two parallel forms can be distinguished by the small strides in the first as compared with the large strides in the second.

2.5 Vector Implementation

In this section, tensor product identities will be used to design algorithms computing tensor product operations on a sample vector processor. Our model of a vector processor includes the main memory, vector registers and a communication network between the main memory and vector reg-isters, which will be described in detail as required. Vector operations are performed on vectors located in vector registers. Some standard vector operations are vector addition, subtraction, scalar-vector and vector multi-plications. To take advantage of the high-speed computational rate offered by vector operation, it is essential to keep memory transfers to a minimum and to perform vector operations on vectors residing in vector registers as much as possible. Also, the transfer of data between the main memory and vector registers, on many processors, is especially suited to implement the stride permutations arising from tensor product operations.

There are several key ma,chine parameters that must be kept in mind when designing algorithms. First, vector registers have a maximum size,

2.5 Vector Implementation 45

which limits the size of vectors that can be used on vector instructions. Also, due to 'start-up costs', there is usually a lower bound on the size of vectors that can efficiently be operated on by vector operations. If a computation requires operations on larger vectors than allowed, then the computation must be segmented and several vector instructions combined to perform the computation. The language of tensor products is ideally suited to design algorithms that satisfy this key design parameter. Memory transfer can also be performed with vector operations. These vector operations correspond to stride permutations. A vector of elements in the main memory can be loaded into a vector register with the following instruction:

oVi X, L

The vector of elements in memory having the initial address X is loaded into the vector register V/ at stride L. A special register called the vector length register V L determines the number of elements that are loaded. For example, if

Xo -

Xi

X = S2 , V L = 3, X3

X4

_ X5 _

then

WO X, 2

loads the vector register VO with the elements of X beginning at xo with stride 2,

X0

VO = [X21 •

X4

The result of the load instruction

.1/1 X +1, 2

iS

Xi

111 = [X3] .

X5

The memory transfer operation that takes a vector in memory of size M N and fills at stride N , N vector registers with vectors of size M will be


denoted by LATIN. Thus,

- X0 -

Xi X0 1 [ X1

X2 14 14 ,-.., P(6, 2) ---* [ X2 X31 .

13

X4 15 X4

_ X5 _ mainmemory VO V1

The contents of a vector register can be stored into the main memory with the instruction

• ,Y, L VK.

The contents of the vector register VK are stored in memory having the initial address Y at stride L. For example, if

VO= [x°1 , xi

then the result of the vector instruction

0, Y, 3 VO

is Y = X0 Xi

If

V1= [x2] ,V2 = [x41 , X3 s5

then the result of the sequence of store instructions,

• Y, 3 VO go, Y + 1, 3 V1 ▪ Y+2, 3 V2

is the sequence of stores

Y = X0

Y = X0 X2 _ Xi X3 ___

Y = X0 X2 X4 Xi "" X3 _X5 . "-

The memory transfer operation that takes the contents of N vector registers with vector size M and stores them in memory with stride N will be denoted by LiAri N :

Lir I Lir


-X0 -

X2

[xi [ [X41 xo x4 (Lgyi = Lg = P(6, 2) X5 Xi .

X3

_ X5 _

VO V1 V2 memory

A load-stride followed by a store-stride can carry out a stride permuta-tion. The stride permutation P(6,2) can be implemented with a sequence of operations. Take VL = 2:

• VO X, 1 • V1 X + 2, 1 load at stride 1 • V2 X + 4, 1

41, Y, 3 VO • ,Y + 1, 3 V1 store at stride 3. • ,Y +2, 3 V2

Tensor product operations of the form A 0 IN can be implemented di-rectly with vector instructions as long as N is less than or equal to the maximum vector register length. For example, if

y = (A 0 /3)s,

where

[1 1 A =

1 —11 ' then

xo + X3-

Xi +

X2 ± X5 Y = — X3

Xi - X4

-X2 - X5_

If X0 X3

VO = [Xi , = [X41 ,

X2 X5

then y is computed by the vector instructions

• V2 VO+ V1

• V3 VO — V1


The first instruction is the vector addition of VO and V1 placed in the vector register V2. The vector Y in memory is obtained by storing V2 followed by V3 back in memory. If Y is the location of the output vector, we obtain this with the following instructions:

Y, 1 V2 0,Y +3, 1 V3

The computation y = (A 0 /3)P(6, 2)x offers a more complicated ex-ample. The first step is to load x into two vector registers at stride 2.

so xi

VO = x21 , -= [x31 . X4 X5

The next step is

V2 = VO + V1 = [xo + xi x2 + X3 X4 ± X5 7

V3 = VO — V1 = [xo — xi x2 — x3 x4 — x5 .

Finally, the contents of V2 are stored at stride 1 beginning at Y and the contents of V3 are stored at stride 1 beginning at Y + 3. In effect, the stride permutation P(6, 2) is implemented for free since, in order to program A g ./3 as a vector operation, we must load the input vectors and store the results even in the absence of an input permutation.

The operation y = P(6, 3)(A 0 /3)x can be performed in the same way. Implementing a tensor product operation becomes significantly more dif-

ficult if segmentation is required, i.e., vector operations are required on vectors that do not fit inside vector registers. To be concrete, assume that the maximum size of vector registers is 64, and we would like to evaluate A /128. This acts naturally on vectors of size 128. The maximum size of the vector registers is 64, and we would like to replace A 0 /us by a vector operation on vectors of size 64. Since

A 0 /128 = P(256, 128)(1Z 0 A 0 /64)P(256, 2),

the computation of A /128 is equivalent to /2 0 A 0 /64 up to input and output permutations. Two copies of a vector instruction on vectors of size 64 are required. But P(256, 2) naturally forms vectors of size 128,

_ X0 X1

X2 X3

_ X254 X255

This problem can be solved by the factorization,

P(256,2) = (P(4,2) 0 /64X/2 0 P(128, 2))•


The first factor, /2 P(128, 2), decomposes the input vector of size 256 into two consecutive segments of size 128, and performs the stride permutation P(128, 2) on each of the segments:

X0 X4 X128 X126

VO = X.2 7 V1 = X,3 , V2 = [x1.3° , V3 — X1.31 .

X126 X127 X254 X255

Setting VL = 64, these vectors can be loaded by the instructions,

• VO X, 2 • V1 X + 1, 2 • V2 X + 128, 2 • V3 X + 129, 2

The second factor, P(4, 2) /64, can be thought of as a permutation of the segments, giving

VO V2 V1 V3

These two steps can be combined by carrying out the load-strides as be-fore, and by changing the initial offsets of the load-strides to perform the permutations of the segments.

The vector operation A /64 is performed on (VO, V2) and on (V1, V3):

V4 = VO + V2

V5 = VO — V2

V6 = V1 + V3

V7 =V1— V3

To complete the computation, the vectors V4, V5, V6 and V7 must be stored back in the memory in the order given by P(256, 128). This can be done with the store instructions by first permuting the segments to

V4 V6 V5 V7

and then storing the results with the store instructions

• Y, 2 V4 0, Y + 1, 2 V6 • Y + 128, 2 V5 ▪ Y + 129, 2 V7

This corresponds to the factorization

P(256,128) = (/2 0 P(128, 64))(P(4, 2) /64).


2.6 Parallel Implementation

Tensor product identities provide powerful tools for matching tensor prod-uct factor computations to specific machine characteristics such as locality and granularity. Consider the tensor product factor IN 0 A, where A is taken as in section 5. In the simplest case, N separate processors are avail-able for the computation, and ea.ch processor has access to a shared memory containing the input vector X and the output vector Y. (In this section, we will use capital letters as variable names of the data used in codes.) Number the processors

0, 1, , N — 1.

Define the action A by

Y(0) = X(0) + X(1),

Y(1) = X(0) — X(1).

Assign to each processor the code

A(2, Y, X).

The n-th processor acts by this code on the components X(2n), X(2n + 1) of the input vector X and places the results in memory as the components Y(2n), Y(2n + 1) in the output vector Y. In the same fashion, A 0 IN

is computed by having the n-th processor act on the components X (n), X (n N). The results are placed in memory as the components Y(n), Y(n + N) in the output vector Y.

If the number M of processors is less than N , then the problem is more complicated. Suppose that N = M L. Using the identity

IN 0 A = /Ai (/"L 0 A),

we assign the code //, 0 A to each processor to perform the computation as above with M replacing N in the discussion. In the same way, the identity

A® = P(2N,21,)(.1 m 0 (A 0 I L))P(2N, M)

suggests that each processor be assigned the code for A0h with addressing determined by the input and output stride permutations. In particular, the m-th processor, 0 < rn < M, performs A 0 IL on the 2L components

.

X (rn), X (nr + M), , X (m, e2L — 1)M),

and places the result in memory as the 2L components

Y(m), Y(rn + M), , Y (m (2L — 1)M).

2.6 Parallel Implementation 51

Alternatively, we can use the identity

A 0 = (P(2M, 2) 0 /L)(im 0 A® h)(P(2M, M) 0 IL).

Consider the factor im 0 A0 IN. As above, we can implement the action by M parallel computations of A0 IN. If MN processors are available, we can use the identity

im 0 A 0 = P(2MN,2M)(1114N 0 A)P(2MN,N)

or the identity

im 0 A 0 = (Im P(2N,2))(ImN 0 A)(1m P(2N, N))

to compute Im A0 IN as MN paxallel computations of A. In this way, we naturally control the granularity of the parallel computation and fit the computation to granularity and to the number of available processors. The stride permutations give an automatic addressing to the processors.

Theses ideas can be used to compute the tensor product of, say T, factors of A in parallel. By the fundamental factorization,

A 0 • A = H(/2T-t 0 A0 I2t-.), t=i

we decompose the computation into a sequence of computations which, at the t-th stage, is given by

-r2T-t 0 A 0 ht-i.

To carry out the computation in this way requires a barrier synchroniza-tion to guarantee that the input to the next stage is correct. The natural interpretation of ea,ch stage leads to a different degree of parallelism at each stage. The factorization must be modified by different addressing, and

-,•hence different programming at each stage is required to get a consistent degree of parallelism. We turn to the factorization,

A0 A = P(2T, 2)(12T-1 A), t_-1

given in section 2. The addressing is the same at each stage, and the natural interpretation has the maximal degree of parallelism at each stage. For example,

3

A0A0A=HP(8,2)(/40A), t=i

arid at each stage of the computation we compute

y = P(8,2)(I4 A)x. (2.22)


We compute this as Y(0)

Y(4)

Y(1)

Y(5)

Y(2)

Y(6)

Y(3)

Y(7)

=

=

=

=

=

=

=

=

X(0)

X(0)

X(2)

X(2)

X(4)

X(4)

X(6)

X(6)

+ X(1)

— X(1)

+ X(3)

— X(3)

+ X(5)

— X(5)

+ X(7)

— X(7).

(2.23)

If four processors are available, the rn-th processor, 0 < rn < 4, computes

Y(m) = X (2m) + X (27rt + 1),

Y (m + 4) = X(2m) — X(2rn + 1).

Suppose that we have two parallel processors. Then we rewrite (2.22) as

y = P(8, 2)(/2 (/2 0 A))

and compute the first four lines of (2.23) on the 0-th processor and the second four lines on the first processor. Thus, on the rrt-th processor, 0 < m < 2, we compute

for n = 0, 1,

Y(2rn + n) = X(4771, + 2n) + X(4rn + 2n + 1),

Y(2m + n + 4) = X (4rre + 2n) — X(4rn + 2n + 1).

Suppose that we wish to compute

(DA = P (21°, 2) ( /29 0 A), t=1 t=1

with eight processors. Writing

P(219, 2)(/29 0 A) = P(219, 2)0-23 0 (/26 A)),

at each stage the m-th processor, 0 < < 7, computes —

for n = 0, , 63,

Y(27rn + = X(27m + 2n) + X(27in + 2n + 1)

Y(27rn + n + 29) = X(27rn + 2n) — X(27rn + 2n + 1).

Problems 53

In this example, ten passes are required. After each computational stage, the results are stored back to the main (shared) memory. It may be advan-tageous to do more computations before doing the memory operation. We can do this by the factorization

5 0 A = H p(210, 22 )(/25 (h5 A 0 A))•

t=i

There are only five computational stages that reduce transfers to the main memory. However, the granularity has been increased since A has been replaced by A 0 A.

In section 3, we discussed the importance of stride permutation factorizations in implementation. For example,

P(21°, 2) -= (P(24, 2) 0 /26)(/23 0 P(27, 2))•

In the case of eight processors, /23 0 P(2', 2) is carried out by permuting elements in local memory for each of the processors by P(27, 2). The results then can be transferred to the main memory in segments of length 26 permuted by P(24, 2). In this way, the transfer to the main memory given by P(21°, 2) is replaced by decomposing this transfer into a collection of local permutations followed by a global block permutation.

References

[1] Johnson, J., Johnson, R., Rodriguez, D. and Tolimieri, R. "A Method-ology for Designing, Modifying, and Implementing Fourier Transform Algorithms on Various Architectures", IEEE Trans. Circuits Sys., 9(4), 1990.

[2] Hoffman, K. and Kunze, R. Linear Algebra, Second Ed., Prentice-Hall., 1971.

[3] Nering, E. D. Linear Algebra and Matrix Theory, John Wiley & Sons., 1970

Problems

11. Show that the tensor product of vectors is bilinear.

2. Show that the tensor product of matrices is bilinear.


3. Compute (A 0 B) (a b) for

B [0 1 01 [2 1

A =

1 ' 2 1 1 0 1 1

—

a [11

0 b = [231 • 1

4. For vectors a and b running over all vectors of sizes 2 and 3, respectively, show that the tensor products a 0 b span C6.

5. Show the general result: For vectors a and b running over all vectors of sizes L and M, respectively, the tensor products a 0 b span CLm.

6. The canonical basis of CL is the set of vectors given by the vectors of size L,

-1- - 0- - 0 0 1 0

(L) 0 (L) 0 e(L) e„, — e —

0 0 _0_ _1 _ _

Show that the LM tensor products

e.,L) e(8m) 0 < r < L, 0 < s < M

describe the canonical basis of CLm. (Explicitly derive eV') e(sm).)

7. Describe P(27,3), P(27,9) and show that

P(27, 3)P(27, 9) = /27.

8. Compute the matrix product P(12, 2)P(12, 3).

9. Show that the set {P(81,38) : 0 < s < 4} is a cyclic group. List the generators.

10. Compute P(8, 2)(A B)P(8, 4), where

01 00 01 00 [1 1 1

A = B-= 1 0 0 0 ' 1 —1 ' 0001

and show that it is equal to B 0 A.

3 Cooley-Tukey FFT Algorithms

3.1 Introduction

In the following two chapters, we will concentrate on algorithms for com-puting the Fourier transform (FT) of a size that is a composite number N. The main idea is to use the additive structure of the indexing set Z/N to define mappings of input and output data vectors into two-dimensional arrays. Algorithms are then designed, transforming two-dimensional arrays which, when combined with these input/output mappings, compute the N-point FT. The stride permutations of chapter 2 play a major role.

The first additive fast Fourier transform (FFT) algorithm is described in the fundamental work of J. W. Cooley and J. W. Tukey [2] in 1965. Straight-forward computation of the N-point FT requires a number of arithmetic operations proportional to N2. In scientific and technological applications, the transform size N is commonly too large for direct digital computer im-plementation. The Cooley-Tukey FFT algorithm significantly reduces the computational cost for many transform sizes N to an operational count proportional to N log N. This result set the stage for widespread advances in digital hardware, and is one of the main reasons that digital computation has become the overwhelmingly preferred method for computing the FT in most scientific and technological applications.

The years following publication of the Cooley-Tukey FFT saw vari-mls implementations of the algorithm on sequential machines[1]. Recently, hciwever, as vector and parallel computer architectures began to play in-creasingly important roles in scientific computations, the adaptation of the

56 3. Cooley-Tukey FFT Algorithms

Cooley-Tukey FFT and its variants to these new architectures have become a major research effort. The tensor product provides the key mathematical language in which to describe and analyze, in a unified format, similarities and differences among these algorithms. An account of these variants not using this language can be found in [8]. In 1968, M. Pease [5] utilized the language of the tensor products to formulate a variant of the algorithm that is suitable for implementation on a special purpose parallel computer. In 1983, C. Temperton [9] provided tensor product formulations of the most conunonly known variants.

One of the advantages of using tensor product language to describe FT algorithms is that this mathematical language may be used as an analytic tool for the study of algorithmic structures for machine hardware and soft-ware implementations as well as the identification of new algorithms. For instance, an inherent part in the study of computer implementation of FT algorithms is the analysis of the data communication aspects of the algo-rithms that manifest themselves during implementation procedures. These data communication aspects can be best studied, in turn, through the anal-ysis of the permutation matrices, the stride permutation, which appear in our tensor product formulation of the FT algorithms.

We present, in tensor product form, the description of FT algorithms with the following objective in mind: To provide the user of these al-gorithms with guidelines that will enable him to effectively study their implementation on either special purpose or general purpose computers. By "effectively studying their implementation," we mean to be able to pro-duce algorithms that best conform to the inherent constraints on any given machine hardware architecture.

In this chapter, we consider the Cooley-Tukey FFT algorithm corre-sponding to the decomposition of the transform size N into the product of two factors. The convention introduced in chapter 2 relating two-dimensional arrays to one-dimensional arrays will still be enforced: If X is an M x L matrix, then we associate to X the ML-tuple x formed by reading, in order, down the columns of X. In the sections that follow, al-gorithms will be designed by using the additive structure of the indexing set to associate a two-dimensional array to a one-dimensional array.

3.2 Basic Properties of FT Matrix

The FT matrix of order N, denoted by F(N), is defined as

F(N) = [w3k ] , w = exp(27rillV), i =

The conjugate of w, denoted by w*, is

w* = exp(-27rilN) =

3.3 An Example of an FT Algorithm 57

and (wk)* _ w—k mod N = W N—k

Direct computation shows that

F(N)F(N)* N IN.

The inverse FT matrix is

F(N)-1 =- IT7.1 F(N)* ,

and F(N) is symmetric, i.e.,

F(N)t = F(N).

3.3 An Example of an FT Algorithm

The eight-point FT is given by the formula

7

= E W 1kXk, 0 < 1 < 8, w = exp (27ri/8). (3.1)

k—o

Associate to the input vector x the 4 x 2 array p X41

X = Xi X5

X2 X6

X3 X7

and set ,,, v-t [xo xi x2 x3] Xi = f 1 = .

X4 X5 X6 X7

The vector xi corresponding to Xi can be obtained by

xi = P(8, 4)x,

where P(8,4) is the eight-point stride 4 permutation. Associate to the output vector y the 2 x 4 array

y YO Y2 Y4 Y6

Y1 Y3 Ys Y7

*e will rewrite (3.1) in terms of the arrays Xi and Y. First,

Xi(ki,k2) = x(k2 +4ki), 0 < < 2, 0 < k2 < 4,


Y(11,12) = + 212), 0 < ti < 2, 0 < /2 < 4.

Placing these formulas into (3.1), we have

3 1 w(k2+4ki)(h+212)]

Y(ii /2) = E [E xicki,k2) (3.2) k2=0 ki=0

Set v = w2 -= i and u = w4 = —1. Then

w (k2+4ki)(ii+212) = ukiiivk2/2wk2ii

We can rewrite (3.2) as

3 1 wk211 vk2/2.

Y(11,12) = E 1E xi(ki, k2)uk1111 (3.3) k2=o ki=o

We can decompose (3.3) into a sequence of operations as follows. First we compute the inner sum

k2) = E 0 < /, < 2, 0 < k2 < 4. (3.4) ki=o

We see from (3.4) that the array Yi is computed by taking the two-point FT of each column of the array Xi. In tensor product notation,

yi =- (/4 F(2))xi,

where yi is the vector corresponding to The next stage of the computation,

Y2(ii,k2) = Yi(ii,k2)wk2/1, (3.5)

introduces the twiddle factor. In matrix notation,

Y2 = TY17

where y2 is the vector corresponding to Y2 and T is the diagonal matrix

T = diag(1, 1, 1, w , 1, w2 , 1, w3)

We complete the computation of Y from (3.4) by

3

Y(/1, /2) = E y201,kovk212,, 0 < < 2, 0 < /2 <A, k2=o

which is given by the four-point FT of each row of the array Y2. In tensor product notation, (3.5) is written

Y = (F(4) 0 /2)Y2-

3.4 Cooley-Tukey FFT for N = 2M 59

Combining these formulas, we have

y = (F(4) 0 /2)T(Li F(2))P(8, 4)x.

This leads to the factorization

F(8) = (F(4) 0 /2)T(/4 F(2))P(8, 4).

3.4 Cooley-Tukey FFT for N = 2M

The N-point FT is given by the formula

N-1 ik Yi =- E w xk, 0 < < N, w = exp(2rilN). (3.6)

k=0

Associate to the N-point input data x the M x 2 array

X: :mm4

X =

X m SN _1

and set

Xi = — [ 0X Xi • • • X M-1

Xm Xm+1 • • • XN-1

The corresponding N-tuple xi is given by

xi = P(N, M)x.

t Associate to the N-point output data y the 2 x M array

y YO Y2 • • • YN-2

Y1 Y3 • YN-1

We can rewrite (3.6) in terms of the two-dimensional arrays Xi and Y. First,

(ki , k2) = x(k2 + Mki), 0 < ki < 2, 0 < k2 < M, (3.7)

and Y(/i, /2) = + 2/2), 0 < /1 < 2, 0 < /2 < M. (3.8)

Using (3.7) and (3.8), we have

m—i Y(/1,/2) = E (E xi(ki,k2).,11)wk2iivk212, (3.9)

k2=o ki=o


where v = w2, u = wm = —1 and wN = 1. The inner sum,

Yi(zi,k2) = E )(lc/4, k2)ukill,

ki=0

computes, for each 0 < k2 < M, the two-point FT of the corresponding column of Xi. Let xi be the vector associated with the two-dimensional array Xi. To compute (3.9), we partition xi into M vectors each of length 2 given by the columns of Xi, and compute the two-point FT of these vectors. The ouput is placed in the columns of In tensor product notation,

-= (Im 0 F(2))P(N,M)x.

There are two remaining steps in the computation. First, we compute

Y2(11, k2) = k2)wk211,

which can be described by the diagonal matrix multiplication

Y2 - TY1,

where T = diag(11 1 w 1 wm-1)

The final computation,

m_i

Ycii,/2) = E Y20,,k2),k212,

k2=0

computes the M-point FT of the rows of Y2 that can be written as

y = (F(M) 12)Y2.

This discussion leads to the following theorem.

Theorem 3.1 Let N = 2M. Then

F(N) = (F(M) 0 12)T(Im F(2))P(N,M),

T = diag(11 lw ... 1 wm-1).

The permutation P(N, M) naturally forms vectors of size 2 on which the action of im 0 F(2) can be computed in parallel. The twiddle faqtor T can be thought of as a block diagonal matrix consiaing of M diagonal blocks of size 2. Each of the M blocics acts, in parallel, on the two-dimensional vector resulting from the action of /A4 0 F(2). The computation is completed by the vector FT, F(M) /2, the vector FT F(M) acting on the set of M two-dimensional vectors.

3.5 Twiddle Factors 61

3.5 Twiddle Factors

In this section, we consider the twiddle fa.ctors or the diagonal matrices appearing in Cooley-Tukey FFT algorithms.

For N > 1, set

- 1

D(N) = , w = exp(27ri N),

wN-1

and for 1 < M < N

_[1

D m (N) —

wm —1

Example 3.1 [01 01] D(2) =

Example 3.2

1

D(4) = i _i , D2(4) = [

—i

[01 Oi ] ,

and we have D(4) = D(2) 0 D2(4).

- Assume that N = 2M. With w = exp(27ri/N) and w2 = exp(27ri I M),

[ D2 (N) w2 D2(N)

D(N) = .

w2(m — 0 D2(N)

By the definition of the tensor product,

D(N) = D(M) 0 D2(N).

The general result will be stated as a theorem.

iheorem 3.2 If N = ML, then

D(N) = D(L) 0 Dm(N).

62 3. Cooley-Tulcey FFT Algorithms

Proof Since

w M = e (2wi I L)

we can write

D m (N)

D(N) = wm D m (N)

wm (L-1) Dm(N)

which by definition proves the theorem.

Let N = M L. Define the matrix direct sum

[im L-1 D m (N)

Tm (N) = sED Dim(N) = t=o

The matrix Tm (N) can be viewed as a block diagonal matrbc consisting of L diagonal blocks of size M. In this way, it naturally acts, in parallel, on L vectors of size M. The diagonal matrix T of Theorem 3.1 is T2(N).

Stride permutations act on these diagonal matrices as follows.

Theorem 3.3 If N = M L, then

m—i P(N, M)Tm(N)P(N, L) = (131 (N) = TL(N).

rre=0

Proof The matrix on the left-hand side is the diagonal matrix formed from the product of P(N, M) with the vector formed from the diagonal components of Tm(N). Listing the diagonal elements of Tm (N) as a row,

... 1; W ... WM-1; ... ; WL-1 W(L-1)(M-1)

and striding through the row with stride M, we have

... 1; W . . . WL-1; ; WM-1 ... W(L-1)(M-1),


As expected, the natural block-like structure of Tm(N) is transformed into M diagonal blocks of size L by conjugation by P(N,M). The judi-cious application of this result provides a means of keeping consistency throughout the computation.

3.6 FT Factors 63

3.6 FT Factors

In general, we will need to compute the actions of

Im FM, FM Im.

Factors of these types were studied in chapter 2. Although the arithmetic cost of both actions is the same, the efficiency of implementation can vary on different types of machine architecture. Diagrams can be helpful. The action of F(2) can be represented by the butterfly diagram

Xo. .X0 ±

Xi. .so —

Example 3.3 The action of 12 0 F(2) iS represented by

Xo• .so si X i. .X0 —

X2. .X2 ± Xo

So. .X2 — X3

which we see consists of two parallel two-point FTs.

In general, the action of im F(L) can be computed by M parallel L-point FTs.

Example 3.4 The action of F(2) h is represented by

X0. .X0 ± X2

Xi. .X1 + X3

X2. •X0 — X2

Xo. .X1 — X3

In the previous chapter, we saw that this is a vector operation. Associate to x the equivalent two-dimensional array

X [ X° X2 . Xi X3

Form the two vectors of length 2 from the columns of X,

x(0) = [x° , x(1) = [x2] .

xi 1 s3

We say that these vectors are formed with stride 1. The action of F(2) h is. given by the two-point vector FT,

iy(0)1 = Ft21 [x(0)1

[y(1)] [x(1)]


We read out the vector y = (F(2) 0 /2)x in natural order. As in chapter 2, the commutation theorem can be used to interchange

parallel and vector operations by the formula

P(N, L)(/m F(L))P(N, M) = F(L) Im, N = ML.

As an example, observe that the following diagram also computes the action F(2) 0 /2 (which is a vector operation) 8.s a parallel operation using the commutation theorem:

Example 3.5 F(2) 0 h = P(4, 2)(/2 0 F(2))P(4, 2):

xo • •XO •XO ± X2 •XO ± X2

Xl• •X2 •Xo - X2 •Xi -I- X3

X2. •Xl .XI ± X3 •XO - X2

X3. .X3 .X1 - X3 •Xl - X3

Example 3.6 y (F(2) 0 /3)x. Then

Yo = xo X3/ Y3 = XO X3/

Y1 = xl ± x49 Y4 = X1 - X49

Y2 = X2 ± X5/ Y5 = X2 - X59

which can be represented by

Xo. .X0 ± X3

Xi. .XI + X4

X2. •X2 ± X5

X3. •Xo - X3

X4. •Xl - X4

X5. •X2 - X5

This computation can be carried out in three stages, as indicated by the diagram

Xo. .X0 •Xo ± X3 .X0 ± X3

Xi. .X3 .X0 - X3 .X1 ± X4

X2. .X1 .XI ± X4 .X2 ± X5

X3. .X4 .X1 - X4 •XO - X3

X4. .X2 .X2 ± X5 .X1 - X4

X5. .X5 .X2 - X5 .X2 - X5

which corresponds to the factorization

F(2) o h = P(6, 2)(/3 F(2))P(6, 3).

3.7 Variants of the Cooley-Tukey FFT

In the notation of the preceding section,

F(N) = (F(M) 12)T2(N)(Im F(2))P(N,M), N = 2M. (3.10)

3.7 Variants of the Cooley-Tukey FFT 65

We will now derive other factorizations of F (N). These variants are dis-tinguished by the flow of the data through the computation. They make available to the algorithm designer several possibilities for computing N-point FT, all having the same arithmetic, but differing in the storage and gathering of data.

By the commutation theorem,

P(N, M)(I2 F(M))P(N, 2) = F(M) h.

Using this formula in (3.10), we have

F(N) = P(N, M)(I2 F(M))P(N,2)T2(N)(Im F(2))P(N, M).

Both of the FT factors are parallel factors; the first is M copies of F(2) and the second is two copies of F(M). The first part of the computation,

T2 (N)(I m F(2))P(N, M),

is naturally thought of as a parallel action on vectors of size 2, while the second part,

P(N, M)(I2 F(M))P(N, 2),

is a parallel action on vectors of size M, where the permutation matrices describe data readdressing.

Applying the commutation theorem in the form

P(N, 2)(Im F(2))P(N, M) = F(2) 0 m

leads to the factorization

F(N) (F(M) I2)T2(N)P(N, M)(F (2) 0 m).

The FT factors are now vector factors; the first acting on vectors of size M and the second acting on vectors of size 2.

A second technique for manipulating factorizations comes from the transpose. Taking the transpose on both sides of (3.10) and using the formulas

F(N)t = F(N), pt p-1,

(A 0 .13)t = 0 .13t ,

we have F(N) = P(N,2)(Im F(2))T2(N)(F(M) /2).

';%/k permutation of output data is now required. Applying transpose and the commutation theorem, other factorizations

can be derived. There are several features that distinguish between these

66 3. Cooley-Tukey FFT Algoritluns

factorizations: input permutation, output permutation and internal per-mutation, the type of lower order FT factors and their placement in the computation. We single out the four factorizations derived in this section for future reference. Set P = P(N, M) and T = T2(N), N = 2M. Using Theorem 3.1, we have

Case N = 2M:

(al)F(N) = (F(M) I2)T(Im F(2))P,

(b1)F(N) = P-1(Im F(2))T(F(M) 0 12),

(c1)F(N) = P(I2 F(M))13-1T(Im F(2))P,

(d1)F(N) = (F(M) I2)T P(F(2) 0 m).

3.8 Cooley-Tukey FFT for N = ML

Let N = ML and consider the N-point FT

N-1 yk = E wicn.„, 0 < k < N, w = exp (27i/N). (3.11)

n=0

We will derive a Cooley-Tukey algorithm computing the N-point FT. Associate to the N-point input vector x the L x M array

xxoi x7+1 xxol(m-1)1L)L+1

X =

xL-1 X2L-1 XN-1

and set

[ X0 X1 • • - XL-1

XL XL+1 • • • X2L-1 XI = Xt = .

• .

X(M-1)L X(M-1)L-1-1 • • • XN - 1

The corresponding N-tuple xi is given by applying the N-point stride-L permutation P(N,L) to x,

xi = P(N, L)x.

Associate to the output vector y the M Ax L array

Yo Ym • • • Y(L—i.)m

yi Ym+1 • • • Y(L-1)M+1 Y =

_Ym-1 Y2M-1 YN-1

3.8 Cooley-Tukey FFT for N ML 67

We can write

k2) = s(k2 + kiL), 0 < < M, 0 < k2 < L,

Y(11,12) = Y(11 +12M), 0 <11 < M, 0 <12 < L.

Formula (3.11) can be rewritten as

L-1 M-1

Y(11, /2) = E E x,(kl,kow(k24-ki..0(h±/2m) • k2=0k,=0

on)

Now, (k2 kiL)(11 + /2M) -= k2/1-1- ki/iL + k2/2M mod N.

Set u = wL and v = wm. Since wN = 1, we can rewrite (3.12) as

L-1 M-1 )wk2/ivk212.

Y(/1, /2) = E E (3.13)

k2=0 ki=o

The argument proceeds as in section 2. First observe that the inner sum,

m—i

= E k2).kiti ,

k2=o

computes, for each 0 < k2 < L, the M-point FT of the k2-th column of Xi and places the result in the k2-th column of Let be the vector formed by reading, in order, down the columns of Then

Yi = 0 F(M))xi = (./1 0 F(M))P(N,L)x.

The next stage of the computation,

Y2(4,k2) = Yi(ii,k2)iuk2/1,

can be given by the diagonal matrix multiplication

Y2 = Tm(N)Yi-

We complete the computation by

L-1

Y(11, /2) = E Y2(ii,k2)vk212,

k2=0

airhich computes the L-point FT of the rows of 172)

Y = (F(L) Im)Y2.


Theorem 3.4 If N = M L, then

F(N) = (F(L) 0 m)Tm(N)(IL F(M))P(N,L).

The first part of the computation,

Tm(N)(IL F(M))P(N,L),

can be viewed as a parallel action on vectors of size M, while the second part, F(L) /m, is the vector FT on the resulting L vectors of size M.

The transpose and the commutation theorem can be used to derive other factorizations as in section 7. We single out the following list for future reference:

Case N = L:

(a2)F(N) = (F(L) 0 I m)Tm (N)(I L F(M))P(N, L).

(b2)F(N) = P(N,L)(Im 0 F(L))P(N, M)Tm(N)(IL 0 F(M))P(N, L).

(c2)F(N) = (F(L) 0 I m)Tm (N)P(N, L)(F(M) 0 IL).

(d2)F(N) = P(N, M)(IL 0 F(M))Tm(N)(F(L) 0 I m).

3.9 Arithmetic Cost

The number of arithmetic operations required to carry out a computation is an important part of the cost of the computation and has traditionally occupied the most attention. On modern machines, a large part of the computation time can be spent on data communication; but, there is as yet little general theory measuring this aspect of the overall computation. We gave some general guidelines in the previous sections, but much more, especially on specific architecture, remains unanswered. Arithmetic cost is much easier to estimate.

In the class of algorithms listed in section 7, each algorithm has the same arithmetic cost if we ignore the underlying arithmetic involved in ad-dressing. Consider factorization (c2). We require an input permutation at no arithmetic cost. Then the L M-point FT must be computed, followed by a diagonal matrix multiplication. In the last stage, since F(L) /m and /m F(L) are the same up to data permutation, the equivalent of M L-point FTs must be computed. If we have some algorithm comput-ing M-point FT with in(M) multiplications and a(M) additions, then the algorithms of section 8 compute N-point"*FT

Ma(L) + La(M) (3.14)

additions and Mm(L) + N + Lm(M) (3.15)

Rderences 69

multiplications. The N in (3.15) comes from the diagonal matrix multipli-cation. Since many of the diagonal entries are 1 in practice, we can reduce this cost.

If we take a(M) = M(M — 1), (3.16)

rn(M) = M2 , (3.17)

then (3.15) becomes N(M±L±1), (3.18)

which should be compared to m(N) = N2.

References

[1] Cochran, W. T., et al. "What is the Fast Fourier Transform?," IEEE Trans. Audio Electroacoust. 15 45-55, 1967.

[2] Cooley, J. W. and Tukey, J. W. "An Algorithm for the ma.chine Calculation of complex Fourier Series," Math. Comp. 19 1965, 297-301.

[3] Gentleman, W. M., and Sande, G. "Fast Fourier Transform for Fun and Profit," Proc. AFIPS, Joint Computer Conference 9 563-578, 1966.

[4] Korn, D. J., and Lambiotte, J. J. "Computing the Fast Fourier Transform on a vector computer," Math. Comp. 33 977-992, 1979.

[5] Pea.se, M. C. "An Adaptation of the Fast Fourier Transform for Parallel Processing," J. ACM18 843-846, 1971.

'3‘

[6] Burrus, C. S. "Bit Reverse Unscrambling for a Radix 2' FFT," IEEE Trans. Acoust., Speech and Signal Proc. 36 July, 1988.

[7] Singleton, R. C. "An Algorithm for Computing the Mixed-Radix Fast Fourier Transform," IEEE Trans. Audio Electroacoust. 17 93-103, 1969.

[8] Swartztrauber, P. N. "FFT algorithms for vector computers," Parallel Computing, 1 45-63, North Holland, 1984.

[9] Temperton, C. "Self-Sorting Mixed-Radix Fast Fourier Transforms," J. Comput.Phys., 52(1) 198-204, 1983.

[10] Burrus, C. S. and Park, T. W. DFT/FFT and Convolution Algo-rithms, John Wiley and Sons, 1985.


[11] Oppenheim, A. V. and Schafer, R. W. Digital Signal Processing, Prentice-Hall, 1975.

[12J Nussbaumer, H. J. Fast Fourier Transform and Convolution Algo-rithms, Springer-Verlag, 1981.

Problems

1. Show that F(N)F(N)* = N IN .

2. Compute T4(16), T4(64) and T4(128).

3. Compute T3(9), T3(27) and T3(81).

4. Show directly that

P(12,4)T4(12)P(12,3)= T3(12).

5. Diagram the computation of F(3) 0 14 using the identity

F(3) 0 = P(12,3)(14 0 F(3))P(12,4).

6. Derive directly the factorization

F(27) = (F(9) 0 /3)T3(27)(/9 F(3))P(27,9).

7. From Theorem 3.3, use the transpose to derive the factorization

F(N) = P(N, S)(IR 0 F(S))P(N, R)TR(N)(Is 0 F(R))P(N, S).

8. Prove that (A 0 B)t = Bt

9. Show directly, without using Theorem 3.4, that the factorization

F(N) = P(N,2)(Im F(2))T2(N)(F(M) I2)

implies the factorization

F(N) = (F(2) 0 m)Tm(N)(I2 F(M))P(N,2).

10. Give an example showing that die arithmetic count foy.the computation y = F(12)x depends on the factorization taken for 12.

4 Variants of FT Algorithms and Implementations

4.1 Introduction

In chapter 3, additive FT algorithms were derived corresponding to the fac-torization of the transform size N into a product of two factors. Analogous algorithms will now be designed corresponding to transform sizes given as a product of three or more factors. In general, as the number of factors increases, the number of possible algorithms increases.

In this chapter, we derive the Cooley-Tukey [3] and Gentleman-Sande [4] FT algorithms. They are related by matrix transpose and distinguished by whether bit-reversal is applied at input or output. In both cases, FT factors of mixed-type,

im F(K) h (4.1)

appear in the factorization as discussed in chapter 2. This factor can be viewed as M concurrent FTs on vectors of length L. Applying the commutation theorem, this factor can be replaced by the 'vector' factor

F(K) IML) (4.2)

which can be viewed as the vector K-point FT on vectors of length ML. In theory, a vectorization of the Cooley-Tukey FFT algorithm is produced by systematically replacing all mixed-type FT factors by their corresponding v,ector factors. However, implementing a vector factor on a specific vec-WI- computer cannot, in general, be accomplished without breaking up the computation into pieces that can be fit into the vector registers. This par-titioning of the cornputation introduces concurrency back into the factor,

72 4. Variants of FT Algorithms and Implementations

and is one of the main difficulties in matching algorithm to architecture. This problem was discussed in chapter 2.

Parallel algorithms and vector algorithms are easily related by the commutation theorem. The commutation theorem introduces explicit permutation matrices into the factorization. Not surprisingly, these per-mutation matrices are built from the stride permutations. Variants of the Cooley-Tukey FFT algorithms, to a large extent, depend on the permuta-tion matrices that are used to bring about vectorization or parallelization. For example, for N = MLK, we have the two formulas

F(K) = P(N, LK)(Im F(K) IL)P(N, LK)-1 (4.3)

and

F(K) 0 IML

(4.4)

= (P(MK, K) 0 IL)(Im F(K) IL)(P(MK,K)-1 ).

The variants derived by Pease [6], Korn and Lambiotte [5], and Agarwal and Cooley [1] depend on factorization (4.3) while the auto-sort variant derived by Stockham in [9] depends on factorization (4.4). The Korn-Lambiotte FFT algorithm is the vector analogue of the parallel FFT algorithm of Pease. Two features distinguish these algorithms. First, the main compu-tational stages are the same in all of these algorithms and are given by the vector FT factor (4.2). In the Korn-Lambiotte FFT algorithm and the Agarwal-Cooley FFT algorithm, bit-reversal is required at input or output, which can be a time-consuming step on many vector computers. However, the internal permutations introduced by the commutation theorem, as seen in (4.3), have uniform structures throughout the different stages of the computation, and can be implemented by the stride load memory feature of many vector computers on vectors of maximal lengths. The auto-sort variant does not require bit-reversal at input or output. It accomplishes this savings by distributing bit-reversal throughout the computation. From (4.4), we see that the internal permutations are tensor products and can be viewed as vector stride permutations; but, the vector lengths are not maximal and change throughout the computation.

In section 2, we derive the radix-2 Cooley-Tukey FFT algorithm and the radix-2 Gentleman-Sande FFT algorithm. Bit-reversal is defined. In section 3, the Pease FFT algorithm is derived, and its vector form due to Korn and Lambiotte is discussed. The auto-sort FFT algorithm is derived in section 4. In the final three sections, the mixed radix generalizations of these algorithms are given.

4.2 Radix-2 Cooley-Tukey FFT Algorithm

We derive a Cooley-Tukey FFT algorithm for transform size N = 2k. The algorithm decomposes the computation of an N-point FT into a se-

4.2 Radix-2 Cooley-Tukey FFT Algorithm 73

quence of k operations each requiring two-point FTs followed by an output permutation. Our derivation is based on the factorization

F(N) = P(N)(12 0 F(M))T(N)(F(2) I m), M = N12, (4.5)

where P(N) = P(N, M) and T(N) = Tm(N). We begin with two examples.

Example 4.1 F(4) = P(4)(/2 F(2))T(4)(F(2) 12).

Example 4.2 F(8) = P(8)(/2 F(4))T(8)(F(2) /4).

The operation 1-2 OF(4) can be factored using example 4.1 and the tensor product identity,

/ 0 (BC) = (I B)(I C), (4.6)

with the result

12 ® F(4) — (1.2 0 P(4))(/4 F(2))(/2 T(4))(/2 0 F2 0 /2). (4.7)

Placing (4.7) into example 4.2, we have

F 8 = P ( 8 ) (1"2 P ( 4 ) (I4 ® F ( 2 (4.8)

(12 0 T(4))(1-2 0 F(2) 0 /2)T(8)(F(2)' 0 /4)•

We organize the computation into stages by setting

= /4 0 F(2),

X2 = (12 0 T(4))(I2 0 F(2) 12),

X3 = T(8)(F(2) 0 14), Q = P(8)(I2 0 P(4)),

and writing F(8) = QX1X2X3. Each computational stage is carried out by using a two-point FT, but readdressing is necessary between stages. The loops that implement these stages were discussed in chapter 2.

Direct computation shows that Q satisfies the condition

Q(Xi 0 x2 0 x3) = x3 0 x2 0 xi, (4.9)

where xi, x2, x3 are two-dimensional vectors. Since these tensor products span C8, condition (4.9) uniquely defines Q. We call Q the eight-point bit-reversal for the following reason. Each integer 0 < n < 8 can be uniquely written as

n = ao 2ai 4a2, 0 < ao, ai, a2 < 2 (4.10)

aid we call the ordered triple,

(ao,ai,a2),


the binary bit representation of n. Consider the permutation of the indexing set,

7r(ao,ai,a2) = (a2,ai, ao), 0 < ao, ai,a2 < 2,

given by reversing the bits. The following table describes 7r.

l'able 4.1 Bit-Reversal ir:

000 000 001 100 010 010 011 110 100 001 101 101 110 011 111 111

Direct computation shows that the permutation matrix corresponding to 7r is Q.

More generally, if N = 2k, the N-point bit-reversal is the permutation Q uniquely defined by the condition,

cl(xi 0 ... 0 xk) = xk 0 ... 0 xi, (4.11)

where xi is a two-dimensional vector. Arguing as above, Q corresponds to the indexing set permutation 7r given by bit-reversal. Explicitly, each integer 0 < n < N can be uniquely written as

n = ao + 2ai + • • + 0 < an < 2,

and we call the ordered k-tuple

(ao,a1,...,ak —1),

the binary bit representation of n. Define the permutation

7r(ao, ,ak_i) = (ak_1,... ,ai, ao).

The corresponding permutation matrix satisfies (4.10) and is N-point bit-reversal.

Denote N-point bit-reversal by Q(N), N = 2k . We will show that

C2(N) = P(N)(12 0 P(NI2)-)- - (IN/4 P(4)) (4.12)

= P(N)(I2 Q(NI2)).

Define the sequence of permutations

Qi = P(N),

4.2 Radix-2 Cooley-Tukey FFT Algorithm 75

Q2 = /2 0 P(N/2),

Qk-1 = /N/4 0 P(4) = /2 0 (/N/8 P(4))•

By (4.6),

Qz. • • Qk-i = 120 Q',

where = P(N12)(I2 P(NI4)) • • (IN18 0 P(4)),

which we see is the factorization on the right-hand side of (4.12) corre-sponding to N/2 = 2k-1. By induction, we can assume that Q' -= Q(NI2) is N/2-point bit-reversal. Definition (4.11) implies that

(Q2 " Qk-1)(xi 0 X2 • • • 0 Xk) = Xi 0 (Xk 0 • • • 0 x2)-

We complete the induction step using

P(N) (xi 0 x) = 0 xi,

where x is an N/2-dimensional vector. In the same way, we can show that

C2(N) -= (P(4) 0 iN/4) • • • (P(NI2) 0 12)P(N).

Throughout we will set T(21) = T21-1(21). By (4.5),

F(2k+1) = P(2k+1)(/2 0 P(2k))T(2k+1)(P(2) 0 ik)•

Arguing as in example 4.2, which is an induction on transform size, i.e., on k, we have the following result of Gentleman and Sande [41.

Theorem 4.1

F(2k) = Q(2k) H(/2k-i 0 T(21))(1.2k-t 0 F(2) 0 /2/-1). /=1

We continue the convention that, for matrices Xi, • , Xk,

H = Xi " • Xk • 1=1

Setting Xi = T(21))(i2k_i F(2)0 /21-i), we can write

F(2k) = Q(2k)X1X2 • • • Xk •

Observe that the first stage computation

X = T(2k)(F(2) I2k- 1)


is a vector operation while the last stage computation

= 12k- 0 F(2)

is a parallel operation. In general, the i-th stage computation, computes 2/-1 copies of the two-point FT on the vectors of size 2k-1 fol-lowed by 21-1 copies of the twiddle factor T(2k-1+1). It can be viewed as a parallel action on vectors of size 2k-/. Vector length varies through the computation from 2k-1 to 1.

Taking the transpose yields the Cooley-Tukey radix-2 FFT algorithm [3].

Theorem 4.2

F(2k) = [11(1-2E-1 0 F(2) 0 I2k-i)((.121-1 Mk-1+1)1(2(2k). 1=1

Setting IT/ = XL_/±1, we can write F(2k) = YiY2 • • • YkQ(2k). The first stage computation

Yk = 12k-i 0 F(2)

is now a parallel operation while the last stage computation

= (F(2) 0 /2k_f.)T(2k)

is a vector operation. The Cooley-Tukey FFT has bit-reversal at input (decimation in time),

while the Gentleman-Sande FFT has bit-reversal at output (decimation in frequency). As written, the Gentleman-Sande FFT performs an FT followed by a twiddle factor at every stage, but regrouping reverses the order. It is standard to combine these steps in code.

4.3 Pease FFT Algorithm

In [6], Pease designed a variation of the Cooley-Tukey FFT which he asserts is 'better adapted to parallel processing in a special purpose machine'. A few examples will show what he had in mind.

Set P(N)= P(N,M) and T(N)= Tm(N) with N = 2k and M = 2k-1.

Example 4.3 Consider the factorization

F(4) = P(4)(/2 F(2))T(4)(F(2) 12).

In section 2, we described the data flow 'orthe corresponding computation. One of the main features of the Pease FFT is that we have constant

data flow in all stages of the computation. To accomplish this, we use the commutation theorem in the form

F(2) 0 12 = P(4)(I2 F(2))P(4).

4.3 Pease FFT Algorithm 77

We have

F(4) = P(4)(/2 0 F(2))T(4)P(4)(/2 0 F(2))P(4).

The smaller size FT factors are all the same. This factor, as discussed in chapter 2, is especially suited for parallel processing. The data flow is now explicitly part of the factorization and is constant throughout the compu-tation. As envisioned by Pease, a single hardwired device can implement the action of P(4).

Example 4.4 We will derive a variation of the factorization (4.8) where each stage of the computation has same data flow. Set

P = P(8, 2).

From chapter 2, P2 = P(8, 4) = P-1, P3 = 18.


F(2) 014 = P(I4 0 F(2))P-1,

/2 0 F(2) g /2 = P2(/4 0 F(2))/3- 2

Placing these identities in the factorization in example 4.2, we have

F(8) = Q(8)(/4 0 F(2))(/2 0 T(4))

P2 (14 0 F(2))P-2T(8)P(14 0 F(2))13-1 . (4.13)

Diagonal matrices remain diagonal matrices upon conjugating by permu-tation matrices. This idea will be used repeatedly to change data flow at the cost of changing twiddle factors. Setting

T2 = P(I2 0 T(4))/3-1,

T3 = P-1T(8)P,


F(8) = Q(8)(/4 0 F(2))/3-1T2(/4 0 F(2))P-1T3(/4 0 F(2))/3-1

Since P-1 = P(8,4), we have the Pease eight-point FT.

Theorem 4.3

F(8) = C1(8)(140F(2))P(8, 4)

T2(14 0 F(2))P(8,4)T3(/4 0 F(2))P(8,4),

where T3 = T2 ( 8) and T2 = P(8, 2)T(4)P(8, 4).


We distinguish three stages in the computation:

= Ti(/4 0 F(2))P(8, 4), / = 1, 2, 3, Ti = 181

and we can write F(8) = Q(8)XiX2X3. Each stage begins with the same readdressing P(8, 4) followed by the same FT computation /4 0 F(2). Only the twiddle factors vary from stage to stage.

To derive the general case, we consider the factorization of Theorem 4.1. The goal is to design an algorithm that has the same data flow in each stage of the computation. Set

P -= P(2k,2).

In chapter 2, we proved that

pl p (2k 21), pk = N

As in the preceding section, set

= (/2k_i T(21))(/2k-t 0 F(2) 0 /21-1).


p1-1 xipk-1+1 =

where Ti is the diagonal matrix

= P1-1(/2k_i T(2/))Pk-1+1.

Observe that Xi = /2k_i OF(2). From the Gentleman-Sande FT algorithm, we have

F(2k) = Q(2k)xip-ipx2p-zpzx3p-3... pk-lxk

Q(2k)xip-1(px2p-1)p-1(p2x3p-2)p-1 (pk-1 xkp)p-1

= Q(2k),C1P-1T2X1P-1T3X1P-1 • • • TkX1/3-1,

proving the generalized Pease FFT.

Theorem 4.4

F(2k) = (.1(.2 ) Ti(i2k-i 0 F(2))P(2k, 2k-1), 1=1

= P1-1(12k-1 T(21))Pk-1+1, 1 <1 < k.

4.4 Auto-Sorting FT Algoritlun 79

Each stage of the Pease FFT,

Ti(I2k-i. 0 F(2))P(2k, 2k- 1 ),

begins with the same readdressing P(2k, 2k-1) followed by the same parallel FT computation /2k-i 0 F(2). Only the twiddle factors vary through the stages.

A vector version of the Pease FFT was presented by Korn and Lambiotte [5] for implementation on the STAR 100. Setting

Z = F(2)0 /2k-i,

we have Z = PX1P-1, P = P(2k, 2)

and can rewrite the Pease factorization as

p(2k) C2(2k)p-1ZT2p-1Z Tkp-1Z.

Setting = PTIP-1 = T(21) 0 I2k--1,

we can write

F(2k ) = Q(2k ) 13-1 Z (P-1t2Z)P- 1 • (13- 111 Z),

proving the Korn-Lambiotte FFT.

Theorem 4.5

F(2k) = Q(2k) p(2k,2k-ixT(2/) /2k_0(F(2) /2k_i). i=i

The Korn-Lambiotte FFT has complete vectorization and constant data flow. Only the twiddle factor varies at different stages of the computation.

Factorizations in Theorems 4.4 and 4.5 are decimation-in-frequency since the output is bit-reversed. Taking transpose results in decimation-in-time, since now the input is bit-reversed. This form is due to Singleton [7].

4.4 Auto-Sorting FT Algorithm

The cost of performing N-point bit-reversal on either output or input data can be an important part of the overall cost of an FT computation on many mitchines. In Cochran et al. [2], an FFT algorithm attributed to Stockham is designed which computes the FT in proper order without requiring per-mutation either after or before the computational stages. We call such an


algorithm an auto-sort algorithm. Temperton examines, in detail, the im-plementation of the Stockham FFT and mixed radix generalizations on the CRAY-1 in a series of papers [10, 11, 12].

The main idea underlying the Stockham auto-sort FFT is to distribute the N-point bit-reversal throughout the different stages of the computation. At the same time, the FT computations are unchanged. However, there is a trade-off. First, while the data flow in each stage of the Pease FFT is the same, in the Stockham FFT the data flow varies from stage to stage. Also, the data permutation required in each of the computational stages of the Pease FFT can be effectively implemented using generally available features of vector machines. In particular, in the radix-2 case, the perfect shuffle

P-1 = P(N, NI2), N = 2k

is matched to the vector instruction

stride by N/2

applied to the N-point data. In the Stockham FFT, the corresponding data permutations can also be implemented using the vector instruction 'stride', but it operates on data of varying sizes analogous to the changing data flow or vector lengths in the Cooley-Tukey FFT.

Example 4.5 Denote four-point bit-reversal by Q(4) and eight-point bit-reversal by Q(8). Then

Q(8)(1-4 0 F(2))Q(8)-1 = F(2) 0 /4,

(Q(4) 0 /2)(1-2 0 F(2) 0 /2)(Q(4)-1 0 /2) = F(2) 0 1.4.

To derive the eight-point auto-sort FFT, first vectorize the factorization of Theorem 4.3 using bit-reversals:

F(8) = (F(2) 0 1.4)(2(8)(/2 0 T(4))

(Q(4) 0 .12)(F(2) 0 14)(Q(4) I2)T(8) (F(2) 0 /4)•

Again we conjugate the twiddle factors by permutations and obtain

F(8) = (F(2) 0 /4)Q(8)

(Q(4)-1 0 /2)T2 (F(2) 0 /4)(Q(4) 0 /2)T(8)(F(2) 14),

where T2 = (Q(4) 0 12)(12 0 T(4))(Q(4)'..-1 0 12). Since

Q(8)(Q(4) 0 /2) = P(8, 2),

Q(4) = P(4, 2),

we have the eight-point Stockham FFT.

4.5 Mixed Radix Cooley-Tukey FFT 81

Theorem 4.6

F(8) = (F(2) 0 14)P(8,2)T2(F(2) 0 /4)(P(4, 2) 0 h)Ti (F(2) 0 14),

where 71 = T4(8) and T2 = (P(4,2) 0 12)(12 T2(4))(P (4,2) 0 /2)-

The Stockham FFT is a complete vector FT in which bit-reversal has been distributed through the computation. Data flow is no longer constant between stages.

Denote the 2i-point bit-reversal by Q(2i) and set

= Q(21) , < < k.

By the definition of bit-reversal,

Qk-i-Fi(I2k-, 0 F(2) 012 1 -0 = (F ( 2 ) 0 I2k-i

Set Z = F(2) 0 /2k-i . Since

1 t-Fi = Z

= P(2/, 2) 0 /2k_i,

where = Q k - 1 -1- 1 (1.2k -t 0 T(2 1 ) ) Q- 1

the Gentleman-Sande FFT can be rewritten as

F (21') = (C2 kX1C kl)Q kC2 kli'''Xk

= ZQ 172ZC2k-1(2;_i 2 • • • Tit,Z,

and we have the Stockham auto-sorting FFT.

- Theorem 4.7

F(2k) = TAF(2) 0 1.2k-i)(P(2k-i+1 ,2) 0 121-i), 1=].

= (2k-1+1U-2k-I T(2I))Q k-1 l+1'

4.5 Mixed Radix Cooley-Tukey FFT

Radix-2 algorithms were the first to be designed and dominated on serial machines. On vector machines, where arithmetic processing is very fast, the cost of data transfer becomes a significantly more important ratio of the overall cost. Radix-4 algorithms reduce this data transfer cost without


appreciably increasing the arithmetic cost. Agarwal and Cooley have de-signed radix-4 FFT algorithms for implementation on the IBM 3090 vector facility. Mixed radix FFTs offer additional tools for utilizing high-speed processing without being hampered by data transfer problems. The theory underlying mixed radix FFT algorithms has been developed in several pa-pers and is completely analogous to the theory developed in the preceding sections. (See [8] for an early account.) In this section, we generalize the radix-2 FFT to the mixed radix case.

We begin with the factorization

F(N) P(N,L)(Im ® F(L))71(N)(F(M) 0 N = ML. (4.14)

Suppose that N = NiN2N3. With M = N2 and L = N3, we have

F(N2N3) = P(N2N3, N3)(iN2 F(N3))TN3 (N2N3)(F(N2) iN3 ).

Taking M = Ni and L = N2N3, we have

F(N) = P(N, N2N3)(iNI 0 P(N2N3))TN2N3(N)(P(Ni) 0 -T.N2N3)•

Using the general tensor product identity

I (BC) = (I B)(I C),

we have the three-factor mixed radix FFT. .

Theorem 4.8 If N = NiN2N3, then

F(N) Q(IN,N2 F(N3))T2(IN, F(N2) IN,)Ti(F(Ni) IN,iv3),

where Q is the permutation matrix

Q = P(N , N2N3)(I P(N2N3, N3)),

and Ti and T2 are the diagonal matrices

= TN2N3(N), T2 = 0 TN,(N2N3).

F(Ni) IN,N, is the vector NI-point FT on vectors of length N2N3, F(N2) IN, is an Ni independent vector N2-point FT on vectors

of length N3, and IN, N2 0 F(N3) is an NiN2 independent vector N3-point FTs. In particular, vector length varies as follows:

N2N3, N3-;` L

The general mixed radix factorization is proved by induction on the num-ber of fa.ctors. Suppose throughout that N = NiN2 • • • NK . Set N(0) =- 1, N'(K) = 1 and

N(k) = • • Nk,

4.5 Mixed Radix Cooley-Tukey FFT 83

Ni(k) = NIN(k) = Nk±i. - NK•

Define the N x N permutation matrix Q(N) by

Q(N)(x1 0 • • • 0 xx) = xi( 0 • • • 0 xi , xk E C Nk

Q(N) is the mixed radix analog of bit-reversal as defined in section 4.2 and is called the N-point data transposition corresponding to the ordered factorization N = Ni • • • NK. A direct computation shows that

Q(N) = P(N,N(1))(1N1 Cl(Ar'(1)))-

Set

T(N/ (k — 1)) = TN,(k)N'(k — 1).

By (4.14),

F(N) = P(N,N(1))(IN, F(N'(1))T(N)(F(Ni) 0 IN,0.)))-

Arguing by induction, we have the next result.

Theorem 4.9

F(N)

Q(N)11(IN(K-00T(N'(K — k)))(1- r(x_k) F(NK —k+i) IN'(K-k+1))• k=1

Setting

Xk = (4(K-k) T(N' (K — k)))(IN(R- _k) 0 F(NK—k+i) 0 I iv, (K--k+1.)),

we can write

F(N) = Q(N)XiX2 • • • XK

The first computational stage,

XK = T(N)(F(Ni) IN'(1)),

is a vector operation and the last computational stage,

X1 = IN(K_1) F(NK),

is a parallel operation.


4.6 Mixed Radix Agarwal-Cooley FFT

A generalization of the radix-2 Pease FFT to mixed radix was designed by R.C. Agarwal and J.W. Cooley [1] for implementation on the IBM 3090 Vector computer. The goal, as stated, is to produce a fully vector-ized mixed radix FFT algorithm requiring all of the loads/stores with only small strides.

Consider N = NiN2N3. By the transpose of the factorization given in Theorem 4.8, we have

F(N) = (F(Ni)01N2N3)

Ti(IN, 0 F(N2) 0 IN3)T2(I NiN2 F(N3))Q 1, (4.15)

where Ti = TN2N3(N) and T2 = TN3(N2N3). Set Pi = P(N, NO. By the commutation theorem, we can rewrite (4.15) as

F(N) = (F(Ni) 0 hv2N3 )TiPi(F(N2) hviN3)

(Pi 1T2P1)(Pi. 1P3 1)(F(N3) iNiN2)P3Q-1-

Applying the commutation theorem once more,

1T2P1 = 1(-fiv, 0 TN3(N2N3))Pi = TN3(N2N3) •

Since P2 = /Y1PV, we have proved the three factor Agarwal-Cooley algorithm.


F(N) -=

(F(Ni) IN2N3)TiPi(F(N2) IN1 N3 )71132 (F (NO INiN2)P3Q,

where 71 = TN,N2(N), = TN3(N2N3) and Pi = P(N, 1 < < 3.

Setting = (F(Ni) 0 IN2N3)TiPi,

X2 = (F(N2) 0 INiN3)T2P25

X3 = (F(N3) IN,N2)P39

we can write F(N) = XiX2X3(el.

The form of each factor Xi, X2, X3 is the same (7-' = I), beginning with a stride permutation P(N,Nk) followed by a twiddle factor 71 and completed by a vector FT F(Nk) 0 /N/Ark •

4.7 Mixed Radix Auto-Sorting FFT 85

This factorization offers complete vectorization and data flow given by

small strides P(N, = 1, 2, 3. To prove the general case, N = NiN2 • • • NK, we take the vector form of

Theorem 4.9 and regroup the terms using the identity

P(N, (k))P(N, N (k ± 1)) = P(N, Nk-F1)•

We now have the general Agarwal -Cooley FFT.

Theorem 4.11 For N Ni • NE

F(N) =[11 ((F(Nk) 0 IN IN3731,P(N, Nk))]Q(N)-11 k=1

where Tk is the diagonal matrix

= T(N' (k — 1)) 0 IN(k—i)•

From

((F(Nk) 0 I N INkgrc) P(N, Nk) = P(N, Nk)(IN /iv, 0 F(Nk))71' ,

where Tic' is the diagonal matrix

= P(N, (0)(1 N(k_i) 0 T(N' (k — 1))P(N, N(k)),

we have the parallel version of the preceding theorem.

Theorem 4.12 For N = Ni • • NK

F(N) = F(Nk),Tini Q(N)-1, k=1

= p(N,rr(k))(IN (k_i) T( N /(k -1))P(N,N(k)).

Talcing matrix transpose in the preceding two theorems, we arrive at two additional factorizations which, although similar in form, transfer between decimation-in-time and decimation-in-frequency and change the interstage data readdressing from P(N, Nk) to P(N, NI Nk). In this way we can vary the place where data transposition is taken and the sizes of the strides required in the data readdressing stages.

4.7 Mixed Radix Auto-Sorting FFT

In the auto-sorting FFT, data transposition is distributed throughout the computation, removing the need for data transposition at input or


output. The trade-off is that interstage data readdressing becomes more complicated. Consider N = NiN2N3 and the factorization of Theorem 4.8:

F(N) = Q(1N,N2 F(N3))

T2(IN, F(N2) IN3)71.(F(Ni) 0 1N2N3)• (4.16)

Q is data transposition corresponding to the ordered factorization

N = NiN2N3.

It follows that

Q(1-NIN2 P(N3)) — (F(N3) iNiN2)Q•

Let Q2 denote data transposition corresponding to the ordered factor-ization

N/N3 = NiN2,

and set R2 = Q2 0 IN3•

Observe that Q2 = P(N/N3, N2).

Since R2(1-NI 0 F(N2) /N3)R2 1 — F(N2) IN/N2


F(N) = (F(N3) 0 INiN2)QR21

(VF(N2) 0 IN1 N3 )R2 )(Ti. (F(N1) IN2N3 ))7

where is the diagonal matrix

= R2T21=q1.


Q1c1 = P(N, N3),

proving the three factor mixed radix auto-sorting FFT algorithm.


F(N) = (F(N3) INiN2)P(N, N3)TRE(N2) 0 INiNa)

(P(Ni N2, N2) 0 IN3)71(F(N1) 0 IN2N3))

where T1 = = TN2N3(N) and

= (Q2 0 IN3)(IN, 0 TN3(N2N3))(Q21 IN3)-

4.8 Summary 87

The general mixed radix auto-sort FFT is derived using the same arguments. Define

(2k(xNi xNk xNk xNi

Setting RK = Q(N) and

Rk = Qk 0 I isp(k), 1 < k < K ,

we have

F(Nk) ININk = Rk(IN( k- 1) F(Nk) INI(k))Rk 1

and Rk_iRk-1 = P(N(k), N (k - 1)) 0 IN, (k)•

Arguing as above, from Theorem 4.9, by regrouping interstage permuta-tions, we have the general auto-sorting FFT.

Theorem 4.14

F(N) = 11 (F(Nk) 0 I NiNkTr (P(N(k), N(k - 1)) IN, (0), k=1

where Tr is the diagonal matrix

= Rk(I N(k_i) 0 T(N' (k - 1)))1c1

Three additional auto-sorting FFTs can be derived using transpose and the commutation theorem.

4.8 Summary

I. N = 2k T(21+1) = T2/ (21+1).

Q(2k)(ai 0 • • 0 ak) = ak 0 ak-i 0 • • al , 6., ak E C2.

Gentleman-Sande

F(2k) = Q(2k)11(12k-1 0 T(21))(12k-1 0 F(2) 0 /2/-1). i=

Cooley-Tukey

F(2k) =[11(I2i-i F(2)0 I2k-t)((.£21-1 T(2"+1))1C2(2k)

1=1


Pease

F(2k) = Q(2k) HTiv2k_I F(2))p(2k,2k-i),

where 1 < 1 < k, is the diagonal matrix

= P(2k ,21-k)(12k-1 T(21))P(2k , 21-1)-1

Korn-Lambiotte

F (2k = Q (2k) H P(2k , 2k-1)(T(2/) 0 /2k-1 ) (F(2) 0 /2k-i). t=1

Auto-Sort

F(2k) = H 7)(F(2) /2k-i)(P(2"+1, 2) 0 /21-1), 1=1

where

= (Q(2k-11-1) I2,1)(.12k_i T(21))(Q(2k-1±1) 01'21_0-1

II. Mixed Radix

N =- • • • NK •

N(k) = Ni • • • Nk•

N' (k) = N (k).

T (NI (k — 1)) = p ( k)N1 (k — 1).

Qk(xi 0 • • 0 xk) xk 0 • • 0 xi.

Rk = Qk IN,(k)•

Gentleman-Sande

F(N) =

Q(N) 11(iN(K_ T(Ni (IC - k)))(IN(K-k) F(NK-k-F1) IlsP(K—k+1))• k=1

Agarwal-Cooley

F(N) = [11 ((F (Nk) 0 I N INk)T4F(N , Nk))]Q(N)-1,

k=1

References 89

where is the diagonal matrix

= T(Isr(k - 1)) 0 IN(k - 1).

Auto-Sort

F(N) = H(Fork, .4„,,,Tnp(N(k),N(k _ 1)) I N'(k)), k=1

where Tic" is the diagonal matrix

= Rk (1-N(k_ i) T(Nt(k - 1))).1 1.

References

[1] Agarwal, R. C. and Cooley, J. W. "Vectorized Mixed Rada Discrete Fourier Transform Algorithms", IBM Report., March 1986.

[2] Cochran, W. T., et al. "What is the Fast Fourier Transform?", IEEE Mans. Audio Electroacoust. 15, 1967, 45-55.

[3] Cooley, J. W. and Tukey, J. W. "An Algorithm for the Machine Cakulation of Complex Fourier Series", Math. Comp. 19, 1965, 297- 301.

[4] Gentleman, W. M. and Sande, G. "Fast Fourier Transform for Fun and Profit", Proc. AFIPS, Joint Computer Conference 29, 1966, 563- 578.

[5] Korn, D. G. and Lambiotte, J. J. "Computing the Fast Fourier Transform on a Vector Computer", Math. Comp. 33, 1979, 977-992.

[6] Pease, M. C. "An Adaptation of the Fast Fourier Transform for Parallel Processing", J. ACM 15, 1968, 252-265.

[7] Singleton, R. C. "On Computing the Fast Fourier Transform", J. ACM 10, 1967, 647-654.

[8] Singleton, R. C. "An Algorithm for Computing the Mixed Radix Fast Fourier Transform", IEEE Trans. Audio Electroacoust. 17, 1969, 93-103.

[9] Temperton, C. "Self-Sorting Mixed Radix Fast Fourier Transforms", J. Comput Phys. 52, 1983, 1-23.


[10] Temperton, C. "Fast Fourier Transforms and Poisson-Solvers on Cray-1", Supercomputers, Infotech State of the Art Report, Jesshope C. R. and Hockney R. W. eds., Infotech International Ltd., 1979, 359-379.

[11] Temperton, C. "Implementation of a Self-Sorting In-Place Prime Factor FFT algorithm", J. Comput. Phys. 58(3), 1985, 283-299.

[12] Temperton, C. "A Note on a Prime Factor FFT", J. Comput. Phys. 52(1), 1983, 198-204.

[13] Heideman, M. T. and Burrus, C. S. "Multiply/add Trade-off in Length-2n FFT Algorithms", ICASSP'85, 780-783.

[14] Duhamel, P. "Implementation of Split-Radix FFT Algorithms for Complex, Real, and Real-Symmetric Data", IEEE Trans. on Acoust., Speech and Signal Proc. 34(2), April 1986.

[15] Vetterli, M. and Duhamel, P. "Split-Radix Algorithms for Length pr" DFT's", ICASSP'88, 1415-1418.

Problems

1. Write a code implementing each stage Xi, 1 < / < k, of the Gentleman-Sande algorithm.

2. Write a code implementing bit-reversal.

3. For N= 8, 16, 32 and 64, describe the twiddle factor Ti in the Pease algorithm.

4. Derive the general form of the twiddle factor in the Pease factoriza-tion.

5. From the Pease algorithm, design an algorithm reversing the order of permutations and the twiddle factors.

6. From the Pease algorithm, design an algorithm having bit-reversal at output.

7. Determine the general form of the twiddle factors in the Stockham factorization.

8. Describe the twiddle factors in thelaixed radix Agarwal-Cooley FFT algorithm.

9. Describe the twiddle factors in the mixed radix auto-sort FFT algorithm.

5 Good-Thomas PFA

5.1 Introduction

The additive FFT algorithms of the preceding two chapters make no ex-plicit use of the multiplicative structure of the indexing set. We will see how the multiplicative structure can be applied, in the case of transform size N = RS, where R and S are relatively prime, to design an FT al-gorithm that is similar in structure to these additive algorithms but no longer requires the twiddle factor multiplication. The idea is due to Good [2] in 1958 and Thomas [8] in 1963, and the resulting algorithm is called the Good-Thomas Prime Factor algorithm (PFA).

If the transform size N = NiN2, then one form of an additive algorithm can be expressed by the factorization

F(N) = (F(Ni) 0 IN,)T(IN, 0 F(N2))P, (5.1)

where P is a permutation matrix and T is a diagonal matrix or twiddle factor. Corresponding to a decomposition of the transform size N of the form N = RS, where R and S are relatively prime, one form of the Good-Thomas PFA is given by the factorization

F(N) = (F(R) 0 is)(iR 0 F(S))Q2, (5.2)

where Qi and Q2 are permutation matrices. We can rewrite (1.2) as

F(N) = Qi(F(R) F(S))Q2. (5.3)

92 5. Good-Thomas PFA

An obvious advantage of (1.3) is that the multiplications required in the twiddle factor stage of (1.1) are no longer necessary. Burrus and Es-chenbacher [1] and Temperton [4] point out that a variant of (1.3) can be implemented in such a way that it is simultaneously self-sorting and in-place. In the preceding chapter, these properties served to distinguish the data flow of the different additive FFT algorithms, but in no case were both present. We will discuss some of Temperton's ideas below.

5.2 Indexing by the CRT

The main tool in the indexing of input and output data for the Good-Thomas PFA is given by the CRT. Suppose that

N = RS, (R, S) = 1. (5.4)

The CRT asserts the existence of a ring isomorphism

ch : ZIR x ZIS ZIN,

where Z/R x Z/S denotes a ring-direct product with componentwise addi-tion and multiplication. We will take (b to be the specific ring-isomorphism given by the complete system of idempotents relative to the decomposition (5.4) as described in chapter 1. Explicitly, elements el and ez in ZIN can be found such that

ei 1 mod R, ei 0 mod S, (5.5)

ez =- 0 mod R, e2 1 mod S. (5.6)

Then e? ei mod N, e2 mod N, (5.7)

eiez 0 mod N, (5.8)

ei + ez 1 mod N. (5.9)

Using these properties, we can prove that the mapping

0(ai,a2) ale]. a2e2 mod N, 0 < al < R, 0 < a2 < S (5.10)

is a ring-isomorphism of Z/Rx Z1S onto Z/N. We will use (5.10) to define a permutation 7r of Z/N. First, by (5.10),

each a E ZIN can be written uniquely as, -

a E aiei + azez mod N, 0 < ai < R, 0 < az < S.

Since we aLso have that a E ZIN can be written uniquely as

a = az + aiS, 0 < ai < R, 0 < az < S,

5,3 An Example, N = 15 93

a permutation ir of Z/N can be defined by the formula

7r(a2 + aiS) ale]. + a2e2 mod N, 0 < < R, 0 < a2 < S. (5.11)

Order the indexing set ZIN by 7r:

0, e2, (S - 1)e2,

+ e2, • • • , + (S-1)e2,

(R - 1)ei, (R - 1)ei + e2, -1)ei + (S-1)e2,

and denote the corresponding permutation matrix by Q. Then the matrix

F„ = QF(N)Q-1, (5.12)

is given by Fir = [e(a)7r(b)] 0<a b<IV'

v = e21-z/N (5.13) . _ , We will now explicitly describe Fr,. First an example will be considered.

5.3 An Example, N = 15

Talce R = 3 and S = 5. The idempotents are

ei = 10, e2 = 6.

The permutation 7r of (5.11) is given as

7r -= {0, 6, 12, 3, 9; 10, 1, 7, 13, 4; 5, 11, 2, 8, 141.

,We distinguish three blocks by the following notations,

A (0, 6, 12, 3, 9),

B = (10, 1, 7, 13, 4),

C = (5, 11, 2, 8, 14).

Each block begins with a different multiple of el. Corresponding to the nine Cartesian products of these blocks, we have nine submatrices of F„. Consider first the submatrix corresponding to A x A:

1 1 1 1 1 1 w6 w12 w3 w9

w12 w9 w6 3 w e(2/ri/15) (5.14) w , w3 w6 w9 w12

w9 w3 w12 w6

94 5. Good-Thoma.s PFA

Setting u = e(2"/5), we can rewrite (5.14) a.s

[1 1 1 1 1

1 U2 U4 U 1L3 1 U4 u3 u2 u , u _ e(27ri/5), (5.15)

1 U U2 U3 U4

1 U3 U U4 U2

We denote (5.15) by F5 and, following Temperton [51, call it a rotated five-point FT. We can relate F5 to the five-point FT matrix,

[11 1 1 1 1 u u2 u3 u4

F(5) = 1 u2 u4 u u3 , (5.16) 1 u3 u u4 u2 1 u4 u3 u2 u

in two ways. First, if in F(5) we replace u by we2 = u2, F(5) becomes F5.

An algorithm computing the action of F(5) can be modified to compute the action of F5 by determining the consequences of this replacement through the different stages of the algorithm. Second,

F5 -= PF(5), (5.17)

where '10000 00100

P= 00001. 01000 _00010

Since both F5 and F(5) are symmetric matrices and 12' = P-1, taking transpose on both sides of (5.17) gives

F5 = F(5)P-1.

Consider the submatrix corresponding to the Cartesian product B x B. Direct computation from (5.13) shows that this submatrix is

io io io io io wio w W 7 W13 W 4 [ W

W 10 w w4 W

1 0 W173 W :VW4 W 13 • W Ww wW W W7

w10 4 13

W7 -4. W

-

(5.18)

Factoring out wl°, we can rewrite (5.18) as

v2F5,

5.3 An Example, N = 15 95

where v = e(21"/3) = w5. Continuing in this way, we have

F5 F5 F5

= [F5 v2 F5 v F51, F5 V F5 V 2 F5

which we can rewrite as F, = F3 0 F5 , (5.19)

where 1 1 1

F3 = 1 V 2 V .

1 V V 2

Since

F3 = 131 F(3), (5.20)

where 1 0 0

P' = [0 0 1], 0 1 0

F3 can be formed from F(3) by replacing v in F(3) by = v2. Putting (5.19) into (5.12), we have

F(15) = Q-1(F3 F5)Q, (5.21)

where the permutation matrix Q is given by (5.11). Matrix Q, although not strictly a stride permutation, has a circular structure. We begin by stride-6 mod 15 from the index 0,

0, 6, 12, 3, 9,

but in the second stage, instead of beginning at the index 1, we begin at the index 10 and stride-6 mod 15,

10, 1, 7, 13, 4.

In the last sweep, we begin at 5 and stride-6 mod 15,

5, 11, 2, 8, 14.

In [5] and [6], Temperton discusses the direct implementation of rotated FTs.

We can use (5.17) and (5.20) to rewrite (5.21) as

F(15) = Q-1 (P' F(3) 0 PF(5))Q,

F(15) = Q-1(P' P)(F(3) F(5))Q.


Setting (2/ = Q-1(pl ID),

we have that F(15) = Qi(F(3) F(5))Q. (5.22)

From (5.22), additional factorizations of F(15) can be provided. In all cases, after an initial input permutation, we compute the action of F(3) 0 F(5), then permute the output to obtain the natural order. Generally, input and output permutations are not the same and are more complicated than those discussed above for direct implementation of (5.21). We notice, however, that no twiddle factor appears in (5.22).

5.4 Good-Thomas PFA for the General Case

Returning to the permutation 7r of section 2, we set

Ao = {0, c2, • • • , (S 1)€21, (5.23)

- {el , el e2, • • • , el + (S — 1)e2}, (5.24)

AR_i = 'RR — 1)ei, (R - 1)ei + e2, • • • , (R — 1)ei + (S - 1)e2}. (5.25)

We now can write 7r = (A0, Ai, ..., AR-1)•

Consider the submatrix of Fr corresponding to the Cartesian product

Aj X Ak.

A typical component in this submatrix is given by forming first the product mod N,

+/e2)(kei + me2), 0 < /, m < S,

which by (5.7)-(5.9) can be written as

jkei ±lme2, 0 <1, m < S.

The (/, in) coefficient of this submatrix is

(wei)ikove2yrn, 0 /, TT/ < S.

Set

Fs = [(we2)1m]o<2,.<s, w e2r/IN.

The submatrix corresponding to A, x Ak iS

[welijk Fs.

5.4 Good-Thomas PFA for the General Case 97

Continuing in this way,

= FR 0 Fs, (5.26)

where FR = RW61)310<3,k<R•

The matrices FR and Fs are called rotated FTs by Temperton. By (5.5) and (5.6), w" is a primitive R-th root of unity and we2 is a

primitive S-th root of unity. To see this, set

v = e(27ri/ 11) , u = e (271-i/S)

Since ei = S, for some fi with (fi, R) = 1, and

wel = vh,

w" is a primitive R-th root of unity. The corresponding result for we2 is proved in the same way. It follows that we can write

FR = PiF(R), (5.27)

where Pi is an R x R permutation matrix, and

Fs = P2F(S), (5.28)

where P2 is an S x S permutation matrix. Placing (5.27) and (5.28) in (5.26), we have

F, = (Pi 0 P2)(F(R) F(S)),

and, by (5.12),

F(N) = Cri (Pi P2)(F(R) o F(S))Q- (5.29)

It follows that, to compute the action of F(N), we begin with the input permutation Q, compute the action of the tensor product F(R)0F(S) and complete the computation by arranging the output in its natural order by the permutation Cr l(Pi 0 P2). Other factorizations can be obtained by taking the transpose on both sides of (5.27),

FR = F(R)Pi-1, (5.30)

and placing (5.30) rather than (5.27) in (5.26). If modules implementing the tensor product FR Fs exist, then the

data flow of the computation of F(N) is given by the permutation Q. The permutation Q can be viewed as follows. We begin at the index point 0 and stride by e2. This process continues, where at each new stage we begin at the index point given by a multiple of ei.

,Since F(R) F(S) can be viewed as S actions of F(R) followed by R actions of F(S), the arithmetic of (5.29) is given by the formulas

a(N) = Sa(R) + Ra(S),


m(N) = Sm(R)+ Rm(S),

where algorithms computing R-point FT are taken which require a(R) additions and m(R) multiplications. The multiplications required by the twiddle factor in the additive algorithms are no longer necessary.

Formula (5.29) can be generalized to several factors. Suppose that N = nin2 = mon2n2, where ni = mim2, (mi, m2) = 1. We can write

F(ni) = le(F(mi) F(m2))R, (5.31)

where R and R' are permutation matrices. Placing (5.31) into (5.29), we have

F(n) = Rm(F(mi) 0 F(m2) F(n2))R", (5.32)

where R" and Ri" are permutation matrices. In subsequent chapters, we will combine the Good-Thomas PFA with

multiplicative FT algorithms to produce several FT algorithms having distinct data flow and arithmetic. (See [3].)

5.5 Self-Sorting PFA

Burrus and Eschenbacher [1] point out that the Good-Thomas PFA can be computed in-place and in-order. Temperton [4, 5, 6] discusses the im-plementation of PFAs on different computer architectures, especially on CRAY. He shows that the indexing required for the PFA was actually simpler than that for the conventional Cooley-Tukey FFT algorithm.

Temperton implemented the RS-point FT using FR and Fs directly. The indexing for input and output data in this case is the same.

Consider the following example. For N = 42 = 6 7, the corresponding system of idempotents is {el, e21 = 17,361. The mapping given in (5.23)— (5.25) can be described by the two-dimensional array,

- 0 7 14 21 28 35 36 1 8 15 22 29 30 37 2 9 16 23 24 31 38 3 10 17 18 25 32 39 4 11 12 19 26 33 40 5

_ 6 13 20 27 34 41_

We can implement this by the simple code,

INTEGER /(R) DATA I/0,7,14,21,28,35/

Updating the indexing for each subsequent transform is achieved by simple auto-increment addressing mode,

References 99

DO 100 L=1, S J=I(R) +1 DO 200 K=R, 2, —1

I(K)=I(K —1) +1 200 CONTINUE

I(1)=J 100 CONTINUE

The code requires no IF statements or address computation mod N. Temperton [5] describes in detail the minimum-add rotated discrete Fourier transform modules for sizes 2, 3, 4, 5, 7, 8, 9 and 16.

References

[1] Burrus, C. S. and Eschenbacher, P. W. "An In-place In-order Prime Factor FFT Algorithm", IEEE Trans. Acoust., Speech and Signal Proc., 29, 1981, pp. 806-817.

[2] Good, I. J. "The Interaction Algorithm and Practical Fourier Analysis", J. Royal Statist. Soc., Ser. B20, 1958, pp. 361-375.

[3] Kolba, D. P. and Parks, T. W. "A Prime Factor FFT Algorithm Using high-speed Convolution", IEEE Trans. Acoust., Speech and Signal Proc., 25, 1977.

[4] Temperton, C. "A Note on Prime Factor FFT Algorithms", J. Comput. Phys., 52, 1983, pp. 198-204.

[5] Temperton, C. "A New Set of Minimum-add Small-n Rotated DFT Modules", to appear in J. Comput. Phys.

[6] Temperton, C. "Implementation of A Prime Factor FFT Algorithm on CRAY-1", to appear in Parallel Computing.

[7] Temperton, C. "A Self-Sorting In-place Prime Factor Real/half -complex FFT Algorithm", to appear in J. Comput. Phys.

[8] Thomas, L. H. "Using a Computer to Solve Problems in Physics", in Applications of Digital Computers, Ginn and Co., 1963.

[9] Chu, S. and Burrus, C. S. "A Prime Factor FFT Algorithm Using Dis-tributed Arithmetic", IEEE Trans. Acoust., Speech and Signal Proc., 30(2), April 1982, pp. 217-227.


Problems

1. Find the system of idempotents of N = 2 • 3, and define the permutation matrix Q as in section 2.

2. Find the system of idempotents of N = 4 • 5, and define the permutation matrix Q as in section 2.

3. Find F2 and F3 for a six-point Good-Thomas PFA based on the idempotents of problem 1.

4. Find F4 and F5 for a 20-point Good-Thomas PFA based on the idempotents of problem 2.

5. Give arithmetic counts for problems 3 and 4 by direct computation of F2 , F3 , F4 and F5 .

6. Give arithmetic counts for 6-point and 20-point Cooley-Tukey FFT algorithms, where F(2), F(3), F(4) and F(5) are carried out by direct computation. Compare with those of problem 5.

7. Derive a Good-Thomas PFA for N = 75, and give F3 and F25 •

8. Derive a Good-Thomas PFA for N = 100, and give F4 and F25 . Derive the Cooley-Tukey FFT algorithm with a factorization of 100 = 10-10. Compare the arithmetic counts of these two algorithms.

9. Give the self-sorting indexing table for N = 40 = 5 . 8 as in section 5.

6 Linear and Cyclic Convolutions

Linear convolution is one of the most frequent computations carried out in digital signal processing (DSP). The standard method for computing a linear convolution is to zero-tap, turning the linear convolution into a cyclic convolution, and to use the convolution theorem, which replaces the cyclic convolution by an FT of the corresponding size. In the last ten years, theo-retically better convolution algorithms have been developed. The Winograd Small Convolution algorithm [1] is the most efficient as measured by the number of multiplications.

First, we derive the convolution theorem by two different methods. The second method is based on the CRT for polynomials. A special case of the

,-CRT then is applied in a more general setting to derive the Cook-Toom [2] algorithm. The generalized (polynomial) CRT then is used to derive the Winograd Small Convolution algorithm. We emphasize the interplay between linear and cyclic convolution computations.

6.1 Definitions

Consider vectors h and g of sizes M and N. The linear convolution of h and g is the vector s of size L = M N — 1 defined by

Sk E hk_ngn, 0 < k < L, n=0

where we take 11, = 0 if m > M and gri = 0 if n > N.

102 6. Linear and Cyclic Convolutions

Example 6.1 The linear convolution s of a vector h of size 2 and a vector g of size 3 is given by

SO = h0g05

Si = higo + hogi,

s2 = higi + hog2,

s3 = 1/02.

Associate the polynomial h(x) of degree M — 1 to the vector h of size M,

h(s) -= ho + hix + • • +

Direct computation shows that formula (6.1) is equivalent to the polynomial product

s(x) = h(s)g(x).

The representation of linear convolution by polynomial product permits the application of results in polynomial rings, especially the CRT.

Example 6.2 Consider the linear convolution s of a vector h of size 3 and a vector g of size 4. By definition,

so = hogo,

si = higo + hogi,

s2 = h2go + + hog2,

S3 = h2gi h1g2 h0g35

S4 = h2g2 h1933

S5 = h2g3.

The linear convolution can be described by matrix multiplication:

s

-Ito hi h2 0 0

_ 0

0 ho hi h2 0 0

0 0

ho hi h2 0

0 - 0 0

ho hi h2 _

g•

—

In general, if s is the linear convolution of a vector h of size M and a vector g of size N, we can write

s H g ,

6.1 Definitions 103

where H is the L x N, L = M + N - 1, matrix

ho hi

0 ho

0 -

0

• •

H = hm-i

0 •

hm_i 0

•

• -

•

-

ho hi

•

_ 0 0

Consider two vectors a and b of size N. The cyclic convolution c of a and b, denoted by a* b, is the vector of size N defined by the formula

N-1

Ck = E ak_nbn, 0 < n < N. n=0

The indices of the vectors are taken mod N.

Example 6.3 The cyclic convolution c of two vectors a and b of size 3 is given by

co = aobo + a2bi + aib27

ci = aibo + aobi + a2b2,

c2 =- a2b0 + aibi + aob2.

Observe that a_i = a2 and a_2 = ai.

In chapter 1, we discussed the quotient polynomial ring

C[x]/(xN - 1) (6.1)

consisting of the set of all polynomials of degree less than N, where addition A and multiplication are taken as polynomial addition and multiplication mod

(xN - 1).

Example 6.4 Consider two polynomials

a(x) = ao + aix + a2x2,

b(x) = bo +bix +b2x2.

The product is

a(x)b(x) = aobo + (aibo + aobi)x

+ (a2b0 + aibi + a0b2)/2 + (a2bi + aib2)x3 + a2b2x4.

This is the linear convolution. The product

c(x) a(x)b(x) mod (x3 - 1)


is formed by setting x3 = 1 in the expansion of the product a(x)b(x). We find that the coefficients of c(x) are given by

co = aobo + a2bi aib2,

= aibo aobi a2b2,

C2 = a2b0 + aibi + a0b2.

Thus, multiplication in the ring

c[x]/(x3 - 1)

computes the 3 x 3 cyclic convolution. In general, multiplication in the ring (6.1) computes the N x N cyclic

convolution. To see this, consider polynomials a(x) and b(x) of degree less than N , and compute the product a(x)b(x),

2N-2 ( n

a(x)b(x) = E E an_kbk xn , (6.2) n=0 k=0

where a„ = by, = 0 whenever n > N . Setting XN = 1 in (6.2), we have

N-1

C(X) = E cans- a(x)b(x) mod (xN — 1), (6.3) n=0

where

n+N

Cn = E an—kbk + E an+N_kbk (6.4) k=0 k=0

N-1

= Ean_kbk + E an-FN — kbk k=0 k=n+1

N-1

= E an_kbk• k=0

In (6.4), the indices are taken mod N . By definition, we see that (6.3) com-putes the N x N cyclic convolution. An important outcome of the discussion is that N x N cyclic convolution can be computed by first computing linear convolution as a polynomial product and then setting XN = 1.,

As with linear convolution, cyclic convolution also can be eximessed by matrix multiplication.

Example 6.5 Returning to example 6.3, we can write

c = Cb,

6.1 Definitions 105

where ao a2 ai

C = [al ao ad • a2 al ao

The matrix C is an example of a circulant matrix, which we will define below. If S denotes the cyclic shift matrix

0 0 1 S = [1 0 01 ,

0 1 0

then 0 1 0

S2 = [0 0 11 , S3 = 13.

1 0 0

We can write the matrix C in the form

C = ao/3 + (LIS + a2S2.

The N x N cyclic shift matrix S is defined by the rule,

XN-1

X0 SX =

-XN-2

Observe that SN = IN. By an N x N circulant matrix we mean any matrix of the form

N-1

C = E anSn. (6.5) = 0

At times, we will denote the dependence of C on a by writing C(a):

ao aN _1 al al d a2

C(a) = al • • aN_i aN_ 2 • • ao

Example 6.6 The 4 x 4 cyclic shift matrix is

0 0 0 1

S = 01 01 00 00

0 0 1 0


Notice that

[0 00 01 Oil [0 01 01 00

s2 = ° s3 = ° 1 0 0 0 ' 0 0 0 11 ' 0 1 0 0 1 0 0 0

and S4 = Li. The 4 x 4 circulant matrix,

C = a0/4 + aiS + a2S2 + a3S3,

is ro a3 a2 ail

C = al ao a3 a2 .

a2 al ao a3 a3 a2 al ao

As we read from left to right, the columns of C are cyclically shifted:

C = [a Sa S2a S3a1.

Exarnple 6.7 Denote by en, 0 < n < N, the vector of size N consisting of all zeros except for a 1 at the n-th place. Observe that

eo *b = b,

* b = Sb,

•,

eN_i * b =- SN—lb.

Consider the N x N cyclic convolution c = a* b. Writing

N-1

a = E anen,

n=0

we have N-1

a * b = E age, * b), n=0

which by example 6.7 can be rewritten as

N-1

a* b = E anS"b. n=0

By (6.5), a* b = C(a)b.

6.2 Convolution Theorem 107

The N xN cyclic convolution c = a*b can be computed by multiplication in the quotient polynomial ring C[x]/(xN — 1),

c(x) a(x)b(x) mod (xN — 1),

or by circulant matrix multiplication,

c = C(a)b. (6.6)

Direct computation from (6.6) shows that

C(a* b) = C(a)C(b).

More generally, we can prove that the set of all N x N circulant matrices is a ring under matrix addition and multiplication and is isomorphic to the quotient polynomial ring C [x]/(xN — 1).

6.2 Convolution Theorem

The N x N cyclic convolution can be computed using N-point FTs. This is especially convenient when efficient algorithms for N-point FTs are avail-able. The result that permits this interchange is the Convolution Theorem. We will give two proofs. The first depends on the representation of cyclic convolution as a matrix product by a circulant matrix. We will soon see that the FT matrix diagonalizes every circulant matrix. The second proof uses the representation of cyclic convolution as multiplication in the quotient polynomial ring C [x]/(xN — 1).

Example 6.8 Set F = F(3). Denote by S the 3 x 3 cyclic shift matrix and by D the matrix

1 0 0 D = [0 v 01, v = e271-2/ 3

0 0 v2

Then 1 1 1

FS =[v v2 11= DF, v2 v 1

which implies that FSF-1 = D

and F diagonalizes S. In addition,

FS2F-1 = (FSF-1)2 = D2,

and F diagonalizes S2.


An arbitrary 3 x 3 circulant matrix is of the form

C(a) = aoh +aiS+ a2S2.

Since F diagonalizes each term of this sum, it diagonalizes C(a),

FC(a)F-1 = ao/3+ aiD +a2D2.

Writing this out, we have

ao az ai go 0 0 F [al ao ad F-1= [ 0 gi 0

az al ao 0 0 gz

where

go = ao + ai + az,

= ao + vai + v2a2,

g2 = ao v2ai vaz.

We see that FC(a)F-1 = diag(g),

where g = F(3)a.

We can extend this argument to prove that

F(N)SF(N)-1 = D,

where S is the N x N cyclic shift matrix and D is the diagonal matrix,

1

D — 1, v = epirifiv).

V1V-1

It follows that F(N)SkF(N)-1 = Dk.

Thus, F(N)C(a)F(N)-1 = diag(g), (6.7)

where g = F(N)-a.

In words, the N-point FT matrbc F(N) diagonalizes every N x N circulant matrix C(a).

Set

G(a) = diag(Fa),

6.2 Convolution Theorem 109

where F = F(N). We can rewrite (6.7) as

C(a) = F-1G(a)F.

Since a* b = C(a)b,

we have the following theorem.

Theorem 6.1 For vectors a and b of size N,

a* b = F(N)-1G(a)F(N)b,

where G(a) = diag(F(N)a).

Theorem 6.1 determines an algorithm computing the cyclic convolution a* b:

1. Compute F(N)b.

2. Compute F(N)a.

3. Compute the componentwise product (F(N)a)(F(N)b).

4. Compute F(N)-1(F(N)aF(N)b).

The componentwise product can be described by

(F(N)a)(F(N)b)= G(a)F(N)(b),

where G(a) = diag(F(N)a). The nonsymmetric role of a and b in this computation should be em-

phasized. In standard applications to digital filters, we fix the vector a (the elements of a linear system) and then compute the cyclic convolution a* b

'for many input vectors b. As a consequence, the diagonal matrix G(a) can be precomputed and does not enter into the arithmetic cost of the process.

The second proof uses the CRT to `diagonalize' multiplication in the ring C[x]/(xN - 1). Consider the factorization

N-1 xN = H (x _ v.), v = e(2/1-i/N) (6.8)

n=0

Applying the CRT to (6.8), we have that the mapping

a(1)

a(x) a(,v)

a(vN-1)


establishes a ring-isomorphism,

N-1

C [X1/(XN - 1) c,

where iiTT.N=-01 C denotes the ring-direct product of N copies of C with com-ponentwise addition and multiplication. In particular, a polynomial a(x) of degree < N is uniquely determined by the values a(vn), 0 < n < N.

Consider two polynomials a(x) and b(x) of degree < N, and set c(x) a(x)b(x) mod (xN —1). The polynomial c(x) is uniquely determined by the values c(vn), 0 < n < N. Since (vn)N = 1,

c(vn) = a(vn)b(vn), 0 < n < N. (6.9)

We see that multiplication mod (xN —1) can be computed by N complex multiplications (6.9) along with some mechanism that translates between C [x] /(xN — 1) and ILN-01 C. In fact, this mechanism is the N-point FT, and (6.9) is a disguised form of the convolution theorem. To see this, observe that

N-1 a(1) = E an,

n=-0

N-1

a(v) = E n„ V un,

n=0

N-1

a(VN-1) = E Vn(N-1)an,

n=0

which implies that a(1) a(v)

= F(N)a, (6.10)

a(vN-1)

where a is the vector of length N associated to the polynomial a(x). Placing (6.10) into (6.9) we have

F(N)c = (F(N)-a)(F,(N)b), (6.11)

where the right-hand side is a componentwise product.

6.3 Cook-Toom Algorithm 111

6.3 Cook-Toom Algorithm

The derivation of the convolution theorem using the CRT admits important generalizations that can be used to design algorithms for computing linear and cyclic convolutions. The simplest is the Cook-Toom algorithm, which we discuss in this section.

Take N distinct complex numbers

{ (107 all • • • aN-1},

and form a polynomial

m(x) = (x - ao)(x - al) • (x - aN-1)•

We will begin by designing algorithms to compute polynomial multipli-cation mod m(x) or, equivalently, multiplication in the quotient polynomial ring

C [x] /m(x). (6.12)

Applying the CRT as in the preceding section, the mapping

p(a0)

a(ai) a(x) ->

a(aN-1)

establishes a ring-isomorphism,

N-1

C [X7M(X) c, n=0

with the result that a polynomial a(x) of degree < N is uniquely determined by the values a(ar,), 0 < n < N. To see how to recover a(x) from these values, write

a(ao) = ao + aoai + • + aLv laN-15

, N-1 a(aN_I) = ao aN — iai + • • • 1- aN_iaN-1,

where a(x) = V'N-1 a rn. In matrix form, this becomes L-dn=0 n

a(ao) a(ai)

= W a,

a(aN-1)


where a is the vector of components of a(x) and W is the Vandermonde matrix ,

-1 Cto • - • cr9vN-1

1 al • • cti -1 W = . •

_1 aN_i • aNN:1

Since the elements an, 0 < n < N, are distinct, the matrix W is invertible, so that we can recover a(x) from

- a(a0) - a(ai)

a = W-1

- a(aN-i)-

Consider two polynomials a(x) and b(x) in the quotient polynomial ring (6.12). Set

c(x) a(x)b(x) mod m(x).

By the CRT ring-isomorphism

c(cin) = a(an)b(an), 0 < n < N,

we have c = W-1((Wa)(Wb)), (6.13)

where (Wa)(Wb) denotes componentwise multiplication. Equation (6.13) generalizes the convolution theorem.

Example 6.9 Talce m(x) = x(x + 1). Then

[1 0 w = w-i _ 1 -1]

Example 6.10 Take m(x) = (x - 1)(x ± 1). Then

W = F(2).

Example 6.11 Take m(x) = x(x - 1)(x ± 1). Then

1 0 0 W = [1 1 11 .

1 -1 1

Example 6.12 Take m(x) = x(x -

[1. 0 1111

W =

1 -1

1 2

0

1 4

+.1)(x - 2). Then

0

-11 8


Equation (6.13) can be modified to design an algorithm for computing the linear convolution. Consider polynomials g(x) and h(s) of degrees N —1 and M — 1, respectively. The linear convolution

s(x) = h(x)g(x)

has degree L — 1, where L = M + N — 1. Denote by g, h and s the vectors of sizes N, M and L, corresponding to the polynomials g(x), h(x) and s(x). Take L distinct complex numbers cri, 0 < 1 < L, and form the polynomial of degree L

L-1

rn(x) = ll(x _

We call rn(x) a reducing polynomial. The design of an efficient algorithm depends, to a large extent, on the choice of a 'good' reducing polynomial. Define submatrices of W by

-1 ao • • • anm-1

1 ai

Wm =- . •

_1 aL-1 ' a Lm--1.1

and - 1 ao • • • anN-1

cd`v-i 1 al

WN = . • •

- cEL-1 • • '

Since deg(s(x)) = L — 1 < deg(rn(x)) = L,

we have s(x) = s(x) mod m(x),

which means that we can compute the linear convolution s(x) by computing the product h(s)g(x) in C[x]/m(x). In fact, (6.13) can be applied. The vectors g and h can be identified with vectors in CL by placing zeros in the last L — N and L — M coordinates. Since

WNg = Wg, Wmh = Wh,

we have s = W-1((Wmh)(WNg)).

an the examples that follow, we assume that

E Z, 0 < / < L,


and W, W-1 are rational matrices. Computing the actions of W and W-1 requires multiplications by rational numbers only. Complex number multiplications occur in forming the componentwise product

(Wmh)(WNg),

which we can rewrite as a diagonal matrix multiplication

D(h)WNg,

where D(h) diag(Wmh).

In practice, the vector h is fixed over many computations, and the di-agonal matrix D(h) is precomputed and is not counted in the arithmetic cost. In this case, computing the linear convolution can be carried out in the following three steps:

1. Compute WNg. This part requires (N - 1)L additions and LN multiplications by rational numbers.

2. Compute D(h)WNg. This part requires L multiplications.

3. Compute W-1(D(h)WNg). This part requires L(L-1) additions and L2 multiplications by rational numbers.

In summary, computing linear convolution with h fixed and Wm(h) precomputed requires

(L N - 2)Ladditions,

(L + N)Lmultiplications by rationals,

Lmultiplications.

This should be compared with the straightforward computation of M x N linear convolution, which requires

(N -1)(M - 1)additions,

NMmultiplications.

The arithmetic described here is for the general case. Significant reduction occurs if the numbers ai, 0 < 1 < L, are carefully chosen.

Example 6.13 Consider the 2 x 2 linear convolution, and take rn(x) = x(x -1)(x +1). Then

s I 01 0 01 [1 0 ( 1 0 11h llg .

I -1 7

-1 -1

We can write this out in the following sequence of steps:


O. Precompute

Ho = ho, = ho + H2 = h0 hl•

1. Compute

Go = go, Gi = go +Th., G2 — go — gi.

2. Compute

= HoGo, =Gi, S2 = H2G2.

3. Compute

1, „ 1 So = SO, Si = —

2 — S2), s2 = + —

2(Si + S2).

If we compute 11/1 and 0.2 in the precomputation stage, then multi-plication by in step 3 can be eliminated. We see that five additions and three multiplications are required to carry out the computation, compared to one addition and four multiplications by straightforward methods. As is typical, multiplications are reduced at the expense of additions.

A better algorithm can be produced using the following modification. Consider again the linear convolution, s(x) = h(x)g(x), of polynomials h(x) and g(x) of degrees M-1 and N — 1, respectively. Take L — 1 distinct numbers

0 < 1 < L — 1,

and form the polynomial

m'(x) = (x — cto)(x — cti) " • (x — ceL-2)-

Cqmpute c(x) h(s)g(x) mod mi(x).

c(x) has degree L — 2. Since the difference between s(x) and

c(x) + hm_igN_imi(x)

is a polynomial of degree L — 2 having L — 1 roots, we can recover s(x) from c(x) by the formula

s(x) = c(x)+hm_igN_imi(x). (6.14)

Now we compute s(x) in two stages:

1. 'Compute c(x) h(x)g(x) mod m'(x).

2. Compute s(x) by (6.14).


The modification above reduces the required additions without any change in the required multiplications. Computing s(x) by the above two stages is denoted by

s(x) h(x)g(x) mod tri(x)(x — co).

Example 6.14 Consider again the 2 x 2 linear convolution by taking m(x) = x(x + 1)(x — oo) and computing

s(x) h(x)g(x) mod m(s).

First compute c(x) h(s)g(x) mod x(x + 1)

by c _ [11 011 (( [ 01] h) [ _01] g))

Writing this out, we have the following sequence of steps:

O.Ho = ho, Hi = ho — hi,

1.Go = go, Gi = — 91, 2.,50 = HoGo, =Gi,

3.co = So, ci = So —

We assume that step 0 is precomputed. Steps 1-3 require two additions and two multiplications. We complete the computation by the following sequence of operations:

4.s2 = higi,

5.so = co, = + 52.

In the above algorithm, three additions and three multiplications are re-quired, reducing the arithmetic by two additions as compared to example 6.10. The whole computation can be written in one matrix equation as follows:

s ill 01 0 [ 01 1 h) ( 011

[0 0 10 1 [0 1

Example 6.15 Consider the 2 x 3 linear convolution and let

m(x) = x(x — 1)(x +1)(x — co).

The computation of linear convolution

s(x) = h(x)g(x),


where deg(h(x)) = 1 and deg(g(x)) = 2, can be carried out by first computing

c(x) h(x)g(x) mod x(x — 1)(x + 1)

and then using the formula

s(x) = c(x) + hig2x(x —1)(x + 1).

The first part is given by

c 01 0 0 [ 01 01 00 (

—1 1 1 [0

01 ( [1

0 1

—1

0 11 1

We carry out this computation as follows:

ho + hi ho — hi 0.Ho = ho, = , 112 = (precarnputed),

2 2 1.G0 = go, Gi = go + gi + 92 G2 = 90 — 91 + 92, 2.S0 = HoGo, = HiGi, S2 = H2 G2

3.00 — SO, el = S1 — S2, C2 = —SO Si ± S2.

This part requires six additions and three multiplications (8,s before, multiplications by 1 have been placed in precomputation stage).

We complete the computation by

4.s3 = hig2.

5.so = co, si = — s3, sz = cz.

In one matrix equation,

[10 0 01[10001 0 1-1-1 0100

s = —111 0 0010 0 0 0 1 0001

[11 1

(0

(11 —1

1

[11 1 0

? —1 0

1 g)

11 • The small size linear convolutions described in the above examples can

be efficiently computed by the Cook-Toom algorithm. The factors of the reducing polynomials have roots 0, ±1 with the result that the matrices Wm and WN have coefficients 0, ±1. The matrix W-1 is more complicated, but the rational multiplications can be carried out in the precomputation stage. This is a general result that will be discussed in section 5. As the size


of the problem grows, the roots of the reducing polynomials must contain large integers that appear along with their powers in the matrices Wm and WN. If the large integer multiplications are carried out by additions, then as the size of the problem grows, the number of required additions grows too large for practical implementation. In any case, the computation becomes less stable as the size grows [3]. In the next section, we will present efficient larger size algorithms using a generalization of the Cook-Toom algorithm.

The linear convolution can be used to compute multiplication in quotient polynomial rings. In section 4, we will use this approach to present cyclic convolution algorithms [5].

Example 6.16 We want to compute

ci(x) h(x)g(x) mod mi(x),

where

mi(x) = x2 ,

Tri2(x) = x2 + 1,

m3(x) = x2 — 1,

in,4(x) = x2 + x + 1,

771,5(x) =- x2 — x + 1.

Computing first the linear convolution s(x) -= h(x)g(x) by the algorithm designed in example 6.11, we have

s 01 ( h) (Ili 01 1 g))

[0 0 [0 1 _I 0 1

The operation, mod mi(x), can be viewed as matrix multiplication. Set

A =

We have

1 0

ci = [ _1

Continuing in this way, we have

e2 [ 01

c_ [1

[1 —1

1 [1

0

0 1

1

0 —11 . 1

"h)(Ag))'

((Ah) (Ag)),

((Ah)(Ag)),

6.4 Winograd Small Convolution Algorithm 119

1 0 —1] c4 = [ i 0 ((Ah) (Ag)) ,

ri 0 ((Ah) (Ag)) .

C5 = [ 1 --1 2

6.4 Winograd Small Convolution Algorithm

The Cook-Toom algorithm uses the CRT relative to a reducing polynomial m(x) constructed from linear factors. In the examples, the roots of these linear fa.ctors are restricted to be integers, since by doing so non-rational multiplications are kept to a minimum. However, additions grow rapidly as the size of the computation increases. A major part of these additions is needed to carry out the rational multiplications coming from the integer coefficients of the linear factors. Of major importance is the numerical stability of the computation [3].

By applying the CRT more generally, Winograd designed algorithms that could efficiently handle a larger collection of small size convolutions. The growth in the number of required additions is not as rapid as in the Cook-Toom algorithm, while the number of required multiplications increases modestly.

Consider a reducing polynomial

rn(x) = mi (x)rn2(x) • • rar(x),

where mi(x), 0 < / < r, are relatively prime. We do not require that these polynomials be linear. This leads to the possibility of building reducing polynomials rn(x) of higher degrees than before and still having factors with small integer coefficients. As we saw in the preceding section, the coefficients of the factors of the reducing polynomials become multipliers in the corresponding algorithm. If these coefficients are small integers, then these multiplications can be carried out by a small number of additions. As the size of these integers grows, the number of required additions grows.

Suppose that in(x) = mi(x)m2(x), where mi (x) and m2 (x) are relatively prime. The extension to more factors follows easily. We want to compute multiplication in the polynomial ring

C [x] /m(x). (6.15)

By the CRT, we can carry out this computation as follows. Suppose that

deg(m(x)) = N, deg(mk(x)) = Nkl k = 1,2.

Take polynomials h(x) and g(x) of degree < N . We want to compute

c(x) -a- h(x)g(x) mod m(x).


1. Compute the reduced polynomials

h(k)(x) h(x) mod mk(x), k = 1, 2,

g(k)(x) g(x) mod mk(x), k = 1, 2.

2. Compute the products

C(k) (X) h(k) (1)g(k) (X) mod rnk(x), k = 1, 2.

The CRT g-uarantees that c(x) is uniquely determined by the polynomials c(k)(x), k = 1, 2, and prescribes a method for its computation. The unique system of idempotents

{el (x), e2(x)},

corresponding to the factorization rn(x) = mi(x)m2(x), satisfies

ek(s) 1 mod mk(x), k = 1, 2,

ei(x) 0 mod rnk(x), 1, k = 1, 2, k.

Then 1 =_ ei (x) + e2(x) mod m(s),

and c(x) c(1)(x)ei(x) + c(2)(x)e2(x). (6.16)

To complete the computation of c(x) by (6.16), we require the following steps:

3. Compute the products

Ck(S) C(k) (s)e k (x) mod m(x), k = 1, 2.

4. Compute the sum

c(x) = ci(x) + c2(x).

In the first stage of the algorithm, we compute multiplications in the polynomial rings

C[x]Imk(x), k = 1, 2. (6.17)

In part, multiplications in the polynomial ring (6.15) have been replaced by multiplications in the polynomial rings (6.17). Efficient small size algorithms computing multiplications in (6.17) provide buildink blocks for

—

computing multiplications in (6.15). In the previous section, we designed algorithms for computing linear

convolution and multiplication in quotient polynomial rings in the form

c = C ((Ah) (Bg)) ,


where h, g are input vectors, c is an output vector and A, B and C are matrices corresponding to Wm, WN and W-1. Such algorithms are bilin-ear algorithms. We will now see how to piece together bilinear algorithnas computing multiplication in the polynomial ring C[x]im(x). This per-mits efficient small algorithms to be used as building blocks in larger size algorithms.

Suppose that bilinear algorithms compute the products,

C(k) (X) h(k) (X)g(k) (X) mod mk(x), k = 1, 2,

c(k) = ck ((Akh(k)) (Bkg(k))) (6.18)

We assume that Ak and Bk are N x Nk matrices and that Ck is an Nk X N matrix. The operation mod mk(x) can be computed by matrix multiplication,

h(k) = Mkh,

g(k) _ mkg,

where Mk is an Nk X N matrix having coefficients determined by mk(x). Set

A= [AA11 , B [B1 m = 2 B2 M2

and set

Mi _ 0 1 Am = [A2m21 , Bm =

.L.,2L ■u2 C 0 C2 ]

Then [c(1)] c(2) = C(Amh)(Bmg).

The vectors c(1) and c(2) determine the polynomials c(1)(x) and c(2)(x), 'which now must be put together using the idempotents. Multiplication by ek(s) mod m(x) can be described by an N x Nk matrix Ek, k = 1, 2. We have

ck = Ekc(k), k = 1, 2

and c = EC((Amh)(Bmg)), (6.19)

where E = E2]•

The efficiency of (6.19) depends on two factors: the efficiency of the small bilinear algorithms (6.18) and the efficiency of how these building blocks are put together. We assume throughout that the factors mi(x) and m2(x) contain only small integer coefficients. Then M has only small integer coefficients. Although the matrix E has rational coefficients, as we will see


in section 6 that frequently its action can be computed in a precomputation stage.

Example 6.17 Take m(x) = x(x2 + 1).

We will use (6.19) to compute

c(x) a- h(x)g(x) mod m(x),

where the bilinear algorithm computing multiplication mod (x2+1) is taken from example 6.16. First, with

mi(x) = x, m2(x) = x2 + 1,

we have [1 0 —1

Mi = [ 1 0 0 ] , M2 -= ] .

We can see directly that

Ai = = = [ 1 .

Then

10 0 0 C = [0 1 0 —11.

01-11

The idempotents are given by

ei(x) = x2 + 1, e2(x) = —x2 ,

from which it follows that

1 0 0 Ei. = [ 01 , E2 = [ 0 1 I .

1 - 1 0 '--- ,...


10 0 0 EC = [0 1 —1 11.

1 —1 0 1

From example 6.16,

A2 -= B2 = 1 [1

0

0 - 1 1

1 0 0

, C2 = [ 1 0 - 1 1 1 - 1 1 i '

1 Am = B m =

[1

1 0

0 0

—1 1

0 —1 —11' 0


Thus

00 01 00 01 1 0 0 01

c = 0 1 —1 1 1 —1 —1 1 —1 —1

[1 —1 0 1

(0 1 0 0 1 01

Example 6.18 Take

m(x) = (x + 1)(x2 + x + 1),

where mi(x) -= x + 1, m2(x) = x2 + + 1.

Then 0 _I. m1=0. 11, M2 = 1 -11

Directly, Al = Bi = Ci = [ 1] .

By example 6.13, we can take

[1 0 A2 = B2 = 1 -1 , G = [

li °I. 01 ' 0 1


ei(x) = x2 +x +1, e2(x) = — (x2 + x),

implying that 1 0 1

= E2 = [-1 21 . 1 -1 1

Putting this all together in (6.19), we have

c_ [1. 1 —1 01 [11 —01 11 11 —01 11

1 1 —2 1 1 0 —1 1

1 —1 0 1 —1 0 ( 0 1 —1 0 1 —11 g)

Example 6.19 We design an algorithm computing multiplication mod m(x) where

m(x) = mi.(x)1712(x),

m,i(x) = x(x2 + 1), m2(x) = (x + 1)(x2 + x +1).


From the preceding two examples, we can compute multiplication mod mk(x), k = 1,2, by taking

[1 0 0 1 0 0 0

1 0 —1 Ai = = = [0 1 —1 11 ,

1 —1 —11 ' 1 —1 0 1

0 1 0

Directly,

1 - 1 1 1 0 - 1

A2 = B2 = 1 - 1 0 0 1 - 1

1 1 - 1 0 C2 = 1 1 —2 11 .

1 0 —1 1

10 0 0 0 . 0 0 0 0 0 0 _10

10 0 _1 2 —2 M2 = 0 1 0 —2 3 —21 .

0 0 1 —2 2 —1

Then -1 0 0 0 0 0- 1 0 —1 0 1 0 1 —1 —1 1 1 —1 0 1 0 —1 0 1

Am = Bm = 1 —1 1 —1 1 —1 1 0 —1 1 0 —1 1 —1 0 1 —1 0

_O 1 —1 0 1 — 1_


1 ei (x) = —

2 (3x5 5x4 6x3 + 5/2 ± 3x ± 2),

1 e2 (x) = — — (3x5 + 5x4 + 6x3 + 5x2 + 3x),

2

and since in(x) = x6 + 2x5 + 3x4 3x3 2x2 x,

we have that

-2 0 0 - -0 0 0 .7

3 —1 1 3 —3 1 1 5 —3 1 1 5 —3 —1 '`-'1 2 6 —4 0 E2 = 6 —4 0

5 —3 —1 5 —3 —1 _3 —1 —1_ _3 —1 —1_

6.5 Linear and Cyclic Convolutions 125

Direct computation shows that C' = EC is given by

2 0 0 0 0 0 0 0 4 —2 1 0 1 0 2 —2

ci = —1 2

6 6

—4 —4

3 4

—2 —4

1 2

2 2

0 2

—4 —4

4 —2 3 —4 1 2 0 —4 2 0 1 —2 1 2 0 —2

Then, by (6.19), c CVAmh)(Amg)).

6.5 Linear and Cyclic Convolutions

The methods of section 4 decompose the computation of polynomial multiplication mod m(x) into small size computations of polynomial mul-tiplications mod mi(x) and rn2(x), where m(x) = mi(x)m.2(x), ini(x) and m2(x) are relatively prime. We can apply these ideas to linear and cyclic convolutions to decompose a large size problem into several small size problems. As we will see, algorithms computing linear convolution, cyclic convolution and multiplication modulo a polynomial can be used as a part of other algorithms of the same type. This permits large size problems to be successively decomposed into smaller and smaller problems.

Example 6.20 Consider the 2 x 2 linear convolution. Using the algo-rithm given by (6.19) for the reducing polynomial rn(x) = x(x2 — 1) leads to the same algorithm as that designed in section 3.

Example 6.21 Consider the 2 x 3 linear convolution

s(x) = g(x)h(x),

where g(x) = go + h(s) ho + hix + h2x2

First compute the product

c(x) g(x)h(x) mod x(x2 +1)

by the algorithm of example 6.14. Then we compute s(x) by

s(x) = c(x) + gili2x(x2 +1).

In the next few examples, two 3 x 3 linear convolution algorithms will be derived. The first is based on the Cook-Toom algorithm while the second follows from the methods of section 4.



s(x) = g(x)h(x)

of polynomials g(x) and h(s) of degree 2. Compute

c(x) g(x)h(x) mod x(x — 1)(x +1)(x — 2)

by the Cook-Toom algorithm. Then

s(x) = c(x) + g2h2x(x — 1)(x + 1)(x — 2).

Working this out, we have

1 s = —

4C((Ag)(Ah)),

where

1 0 0 40000 1 1 1 0 2 —2 0 8

A = 1 —1 1 , C= —7 5 3 —1 —4 1 2 4 3 —3 —1 1 —8 0 0 1 0 0 0 0 4

c(x) g(x)h(x) mod (x4 — 1).

Using the factorization

x4 - = (x2 + 1) ( x2 - 1 )

in (6.19), we have

r 0 01 r 010 mi= [0 0 j, m2=1_0 0 j-

Compute multiplication mod (x2 1) and multiplication mod, (x2 — 1) by example 6.16. Then

0 Ai = A2 = Bi = B2 = -1 ,

0 1

.

Before designing the second 3 x 3 linear convolution algorithm, we will design a four-point cyclic convolution algorithm that will then be used to design a 3 x 3 linear convolution algorithm having slightly more multipli-cations but significantly fewer additions. We also note that the convolution theorem can be used to efficiently compute a four-point cyclic convolution.

Example 6.23 Consider the four-point cyclic convolution


o -11 cl= -1 J


1 (x) = (x2 — 1),

and we have

p. 0 C2 = -

E2(x) = (x2 +1),

1 0 1 0 _ 1 0 1 1 0 1

el — —1 0 ' e2 1 0 0 —1 0 1

Putting all of these together,

1 c = —

2C((Ag)(Ah)),

where 1 0-11 01 1 —1 1 1 —1 1

C = —1 0 11 01 ' —1 1 —1 1 —1 1

1 0 —1 0 1 —1 —1 1 0 1 0 —1

A = 1 0 1 0 1 —1 1 —1 0 1 0 1

-


s(x) = g(x)h(x).

'First compute the four-point cyclic convolution

c(x) g(x)h(x) mod (x4 — 1)

by the algorithm designed in example 6.20. We note that since the degree of g(x) and h(x) is equal to two, we can rewrite example 6.20 as

1 c = —C((A'g)(A111)),

where 1 0 —1 1 —1 —1 0 1 0

A' = 1 0 1 1 —1 1 0 1 0


We can now compute s(x) by

s(x) = c(x) + g2h2(x4 — 1).

We see that now all of the coefficients are 0, 11 while in example 6.19, 'large' integers appear in the matrices.


s(x) = g(x)h(x).

First compute c(x) g(x)h(x) mod m(x),

where the reducing polynomial is

m(x) = x(x +1)(x2 +1 )(x2 + x +1).

Then s(x) = c(x)+ g3h3m(x).

To compute c(x), we use the algorithm designed in example 6.16. Since

deg(g(x)) = deg(h(x)) = 3,

we have c C'((kg)(A'h)),

where 1 0 0 0 1 0 —1 0 1 —1 —1 1 0 1 0 —1

A' = 1 —1 1 —1 ' 1 0 —1 1 1 —1 0 1 0 1 —1 0

and C' is as given in example 6.16. Consider the N-point cyclic convolution,

c(x) g(x)h(x) mod (xN — 1).

If an efficient N-point FT is available, then the convolution theorem is usually the best approach for computing an. N-point cyclic :Convolution. The algorithms of section 4 also can be called upon. For instance, take the factorization,

K-1

XN -1 = ok(x) , (6.20) k=0


where the polynomials Ok (x), 0 < k < K , are the prime factors of XN - 1 over the rational field Q. These polynomiaLs are usually called cyclotomic polynomials. If

g(k) =- g(x) mod q5k(x),

h(k) h(x) mod Ok(x), 0 k < K,

then the cyclic convolution c(x) can be found from the products,

c(k)(x) g(k)(x)h(k)(x) mod Ok(x), 0 < k < K, (6.21)

by the formula, K-1

C(X) E 0(k)(X)Ek(S),

k=0

where {Ek(x) : 0 < k <

is the unique system of idempotents corresponding to the factorization (6.20).

As discussed in section 4, choosing a factorization over the rational field Q implies that the only multiplications required to carry out the algorithm are those given in (6.21). We continue to assume that the factorization is over Q, but point out that factorization over other fields can lead to efficient algorithms. This will be the case if multiplication by elements from the field can be efficiently implemented. For example, the field Q(i) of Gaussian numbers consisting of all complex numbers a + ib, a and b rational, is frequently taken. The value of 'extending' the field of the factorization is that the prime fa,ctors are of smaller degrees.

In the following examples, we will work out algorithms following the above approach. The multiplications in (6.21) will be computed by first passing through linear convolution.

Example 6.26 Consider the three-point cyclic convolution,

c(x) g(x)h(x) mod (x3 — 1).

The factorization (6.20) is given by

X3 - = (X - 1)(X2 + X +1).

Example 6.16 provides the bilinear algorithm for computing multiplica-tion mod x2 + x + 1. We have

[i 0 -11 = [1. M2 =

0 1 -1

= = = [ 1 ,


1 0 A2 = B2 = 1 - 1 C [

2 - 0 1


1 n

el (x) = (x' + x + 1), e2(x) =

1 1

1

0 -1 1

-1 0 •

(x2 + x — 2),

and .1 1 2 —1

= [11 , E2 = -1 2 I . 3 1 — 1 — 1

By (6.19), we have the bilinear algorithm

c = —1C1((kg)(A'h)),

3

where

1 1 1 - 1 1 1 —2

1 0 —1 A' = C' = [1 1 — 2 1 I .

1 —1 0 ' 1 —2 1 1

0 1 —1

Example 6.27 Consider the five-point cyclic convolution

c(x) g(x)h(x) mod (x5 — 1).

Factorization (6.20) is

x5 — 1 = (x — 1)(x4 + x3 + x2 + x + 1).

Then

=[ 1 1 1

Directly,

Multiplication mod (x4+x3+x2+x+1) can be computed by first computing the 4 x 4 linear convolution by the algorithm designed in eximple 6.22. Using the notation of example 6.22,

A2 = B2 = A',

C2 = CI .

1 0

0 1

0 0

0 0

—1 —1

1 1] , M2 = 0 0 1 0 —1 •

0 0 0 1 —1

Ai = = = [ 1 ].


6.6 Digital Filters 131

1 1 1 1 0 0 0 —1 0 —1 0 0

—1 —1 1 0 Am = 1 0 —1 0 = Bm.

—1 1 —1 0 0 —1 1 —1

—1 0 1 —1 1 —1 0 0

To complete the ingredients needed for (6.19), we observe that the idempotents are

ei (x) = —3

(1 + x + X2 ± X3 + X4),

1 e2 (x) = — —

3(-4 ± ± X3 ± X4),

which can be used to compute

-1 - —4 1 1 1

1 1 1 1 —4 1 1

Ei = ,7 1 „ E 2 =- - - 1 1 —4 1 . j 1 3 1 1 1 —4

_1 _ 1 1 1 1

6.6 Digital Filters

The bilinear algorithms computing convolution, developed from the CRT, have the form

s = C((Bh)(Ag)), (6.22)

where C is usually more complicated than A or B. For application to digital filtering, we typically have one of the inputs, say h fixed, at least over many occurrences, and g varies. We will now discuss the concept of the transpose of (6.1), which permits s to be computed by the formula

s _ nteth)00), (6.23)

where is the matrix determined by reversing the columns of B and o is the matrix determined by reversing the rows of C.

Since h can be viewed as fixed, the computation Gill can be made once and for all. This precomputation stage, once made, does not enter into the overall efficiency of the algorithm, which now depends on the matrices A


and B. In the examples of section 5, the entries of A and B were always 0, 1 and —1, and that makes very obvious the advantage of precomputing Cth.

In [3], the implications of this discussion to the stability of the computation were studied.

We turn now to a proof of (6.2). The result depends on the following observation about Toeplitz matrices.

Let T be a Toeplitz matrix that admits the factorization

T = CDB,

and let R denote a matrix of the same size of T given by

0 0 • • 0 1 0 0 • • • 1 0

0 1 - • • 0 0 1 0 • • 0 0

Then

which proves that

Consider now

We can write

= RTR = (RC)D(BR)= aD13,

T = (6.24)

s(x) = g(x)h(x) mod (xN — 1).

s = C(h)g,

where C(h) is the circulant and, hence, the Toeplitz matrix with the vector h as its first column. Suppose that we have a bilinear algorithm

s = C((Ag)(Bh))

computing s. Let D be the diagonal matrix

D = diag(Ag),

Then we write

g = CDBh,

and C(h) = CDB.

By (6.3),

C(h) = /31/30,

References 133

from which it follows that

s = nt((Ag)(oth)).

We have additional results that can be proved from (6.3). For example, if m(x) = x4 - 6, where 6 is a constant, then

s(x) g(x)h(x) mod m(x)

can be computed by s = T(h)g,

where li(h) is the Toeplitz matrix

ho

h.1

ehN-i •

ho •

• •

• •

6hi Viz

Tc(h) = [ .

hN_i hN-2 ho

Arguing as above, if s = C((Ag)(Bh)) is a bilinear algorithm, we have

s = fit((Ag)(ath)).

Since (Ag)(011)

is componentwise multiplication, the order can be changed, and

s = nt((e-th)(Ag)), (6.25)

where vector h represents the system elements and g is the input vector.

References

[1] Winograd, S. "Some Bilinear Forms Whose Multiplicative Complex-ity Depends on the Field of Constants", Math. Syst. Theor., 10, 1977, pp. 169-180.

[2] Agarwal, R. C. and Cooley, J. W. "New Algorithms for Digital Con-volution", IEEE Trans. Acoust. Speech and Signal Proc., 25, 1977, pp. 392-410.

[3] Auslander, L., Cooley, J. W. and Silberger, A. J. "Number Stability of Fast Convolution Algorithms for Digital Filtering", in VLSI Signal Proc., IEEE Press, 1984, pp. 172-213.

[4] Blahut, R. E. Fast Algorithms for Digital Signal Processing, Addison-Wesley, 1985, Chapters 3 and 7.


[5] Nussbaumer, H. J. Fast Fourier Transform and Convolution Al-gorithms, Second Edition, Springer-Verlag, 1981, Chapters 3 and 6.

[6] Burrus, C. S. and Parks, T. W . DFT/FFT and Convolution Algorithms, John Wiley and Sons, 1985.

[7] Oppenheim, A. V. and Schafer, R. W. Digital Signal Processing, Prentice-Hall, 1975.

Problems

1. For two vectors h = [2, 3, 4, 5] and g = [6, 7, 8, 1], compute their linear convolution by

a. Convolution summation.

b. Polynomial multiplication.

c. Matrix multiplication.

2. For two vectors a = [2, 3, 4, 5] and b = [6, 7, 8, 1], compute their cyclic convolution by

a. Convolution summation.

b. Polynomial multiplication.

c. Matrix multiplication.

3. Write the cyclic shift matrices S5 and S6 . Prove that

= S8 = h .

4. For a = [1, 2, 3, 4, 5], write the circulant matrix C(a), and represent C(a) by the cyclic shift matrix S5 .

5. Compute the four-point cyclic convolution of problem 2 by the con-volution theorem. Show that the results are the same as the direct computation.

6. Diagonalize the matrix

2 A =

[4

1 8

8 4 2 1

1 8 4— 2

2 1 8 •

4

7. Show that F(5)S5 = D5F(5), where D5 is a diagonal matrix, and give the diagonal matrix.

[e rn(x) = x(x — 1)(x ± 1)(x ± 2)(x — 2)(x - Cook-Toom algorithm for a 3 x 4 lineal

d the arithmetic counts of problem 8.

Problems 135

- oo), and derive a - convolution.

7 Agarwal-Cooley Convolution Algorithm

The cyclic convolution algorithms of chapter 6 are efficient for special small block lengths, but as the size of the block length increases, other methods are required. First, as discussed in chapter 6, these algorithms keep the number of required multiplications small, but they can require many ad-ditions. Also, each size requires a different algorithm. There is no uniform structure that can be repeatedly called upon. In this chapter, a technique similar to the Good-Thomas PFA will be developed to decompose a large size cyclic convolution into several small size cyclic convolutions that in turn can be evaluated using the Winograd cyclic convolution algorithm. These ideas were introduced by Agarwal and Cooley [1] in 1977. As in the Good-Thomas PFA, the CRT is used to define an indexing of data. This in-dexing changes a one-dimensional cyclic convolution into a two-dimensional one. We will see how to compute a two-dimensional cyclic convolution by 'nesting' a fast algorithm for a one-dimensional case inside another fast algorithm for a one-dimensional cyclic convolution. There are several two-dimensional cyclic convolution algorithms that, although important, will not be discussed. These can be found in [2].

7.1 Two-Dimensional Cyclic Convolution.

Consider two M x N matrices

g = [g (m, n)]0<m<M, 0<n<N)

h = [h(m, n)]0<m<m, o<n<N•

138 7. Agarwal-Cooley Convolution Algorithm

We will define the two-dimensional cyclic convolution

s=h*g.

Associate to g and h the polynomials in two variables,

M-1 N-1

G(x , y) = E E g(rn,n)sm , (7.1) m=0 n=0

M-1 N-1

H (x , y) = E E h(k,l)xk yl (7.2) k=0 i=0

Form the polynomial

S (x , y) = H (x, y)G(x , y) mod (xm — 1) mod (yN — 1),

we first form the polynomial product

H (x , y)G(x , y)

and then reduce mod (xm — 1) by setting xm = 1; in the same way, we reduce mod (yN — 1) by setting yN = 1. We can write

M-1 N-1

S (x , y) = E E s(m,n)si yn (7.3) m=0 n=0

We call the M x N matrix

S = 1S(M, 71)]0<m<M, 0<n<N

the cyclic convolution of h and g. We can compute s by the following nesting procedure. First, by accumu-

lating all the terms that have the same power of x, we can rewrite (7.1) as

m —1

G(x,y) = E gy.,(y)xr , m=0

where N-1

gm(y) = E g(m, n)yn , 0 < m < M. n=0

In the same way, we can rewrite (7.2) and (7.3) as

m—i H (x, y) E hm(y)xm ,

m=0

7.1 Two-Dimensional Cyclic Convolution 139

m—i

S(x, y) = E sm(y)xm

Then

m- si(y) E.-- E hi_m(y)g,,,(y) mod (yN - 1), 0 < / < M, (7.4)

m=0

which can be viewed as cyclic convolution mod M where the data are no longer talcen from the complex field but are taken from the ring C[y]/(yN - 1).

Main Idea The cyclic convolution algorithms of chapter 6, designed for complex data, hold equally well for data taken from any ring containing the complex field. In particular, they can be used to compute (7.4). In this case, multiplication and addition mean multiplication and addition in C[y1/(yN - 1). The multiplication is cyclic convolution mod N.

Suppose that cyclic convolution mod M is computed by an algorithm using a(M) additions and m(M) multiplications, with similar notation for cyclic convolution mod N. Then the M polynomials of (7.4),

st(y), 0 5_1 < M,

are computed using m(M) N-point cyclic convolutions and a(M) additions in C[y]/(yN - 1). Since each N-point cyclic convolution is computed using a(N) additions and m(N) multiplications, we have that

rn(M)ni,(N)

multiplications and Na(M) + ni(M)a(N)

additions are needed to compute s. The order of the operations above can be interchanged by reversing the

roles of M and N. This has no effect on the number of multiplications but does affect additions.

We will now translate this discussion into matrix language. The M polynomial computations given in (7.4) can be rewritten as

so (Y) ho(Y) si (y) hi(y)

hm -1(Y) _ sm-i.(Y) _

hm-i(g) • ho(Y) •

hm -2 (Y) •

114) - h2(Y)

go(Y) gi(g)

ho(Y) _ .gm-1(Y)

mod (yN - 1). (7.5)


The matrix

- ho(Y) hm—i(Y) • • • hi(Y) hi (Y) ho (Y) • • • h2 (Y) H(y) =-

_ hm—i(Y) hm-2(Y) • • • ho(Y)

is a circulant matrix having coefficients in C[y]/(yN — 1). Set

- g(rn,0) g (m, 1)

= , 0 < m < M,

_ g(m, N — 1) _

and observe that gni is the vector formed from the m-th row of the matrix g. In the same way, define hi, 0 < / < M and sk, 0 < k < M. Let Hi denote the circulant matrix having hi as the 0-th column. We can rewrite (7.5) as

so go si gi

= H

sm—i _ gm—i _

where H is the block circulant matrix with circulant blocks -

Ho Hm_i • • Hi Ho • •

H = (7.6)

Hm—i

and (7.6) is the matrix description of a two-dimensional cyclic convolution.

In chapter 6, bilinear cyclic convolution algorithms were designed as matrix factorizations of circulant matrices. We will extend these one-dimensional algorithms to the two-dimensional computation given by H. Matrices A and B define a bilinear N-point cyclic convolution algorithm bilinear algorithm/cyclic convolution if, for any N x N circulant matrix C, a diagonal matrix G can be found satisfying

C = BGA. (7.7)

A class of algorithms of this kind has been given in chapter 5. In the con-volution theorem, we have A = B-1 = F(N). In the Winograd algorithms, the matrices A and B are matrices of small integers but are no longer square matrices.

7.1 Two-Dimensional Cyclic Convolution 141

First consider the special case

H = C CI, (7.8)

where C is an M x M circulant matrix and C' is an N x N circulant matrix. H has the form (7.6). Take bilinear algorithms computing M-point and N-point cyclic convolutions

C = BGA, (7.9)

C' = B'G' A' , (7.10)

where G and G' are diagonal matrices. Placing (7.9) and (7.10) in (7.8), we can write

H = (B B')(G G')(A 0 A'),

where G G' is a diagonal matrix. Consider again the matrix H given in (7.6). By (7.10), we can write

Ht = , 0 < 1 < M, (7.11)

where the diagonal matrix GI is determined by Hi. Placing (7.11) in (7.7), we can rewrite H as

H = (I m B')D1(Im A'), (7.12)

where D' is the block circulant matrix having diagonal matrix blocks

G'm_i • G'0

D' =

G' _ m -1 G10

Suppose that the size of each diagonal matrix q, 0 < < M, is K . Then A' is an K x N matrix and B' is an N x K matrix.

The matrix P(MK,K)D'P(MK,M) (7.13)

is a block diagonal matrix consisting of KMxM circulant blocks. By (7.7), we can write (7.13) as the matrix direct sum

E EBBGkA, (7.14) k=0

where Gk, 0 < k < K , is a diagonal matrix. This implies that D' can be written as

(BO/K)D(A®/K) (7.15)


for some diagonal matrix D. Placing (7.15) in (7.12),

H = (B B')D (A 0 A'). (7.16)

We see that the bilinear algorithms (7.9) and (7.10) can be used to com-pute the two-dimensional cyclic convolution given by H. In particular, the convolution theorem,

H = (F(M) F(N))-1D(F(M) F(N)), (7.17)

is the two-dimensional convolution theorem.

7.2 Agarwal-Cooley Algorithm

The CRT will be used to turn a one-dimensional N-point cyclic convolution where

N = NiN2, (NI., N2) = 1,

into a two-dimensional x N2 cyclic convolution. By the results of section 1, we then can carry out the computation by nesting an Ni-point cyclic convolution algorithm inside an N2-point cyclic convolution. Formula (7.17) is the explicit form of this nesting.

Choose idempotents el and e2 satisfying

ei 1 mod NI, ei 0 mod N2

e2 0 mod NI., e2 1 mod N2 .

Each n, 0 < n < N, can be uniquely written as

n niei + n2e2 mod N, 0 < ni < 0 < n2 < N2.

Consider the N-point cyclic convolution

s = h * g,

which we can rewrite in the form

where H is the circulant matrix

s = Hg,

h(0) h(N — 1) 7 • h(1) h(1) h(0) - h(2)

H =

h(N — 1) h(N — 2) • h(0)

7.2 Agarwal-Cooley Algorithm 143

We will show that a permutation matrbc P can be found such that

Ps= (PHP-1)Pg, (7.18)

where PH13-1 is a block circulant matrix with circulant blocks. As a consequence, formula (7.18) computes Ps as the two-dimensional cyclic convolution in the sense of formula (7.6).

Example 7.1 Take N = 6 with Ni = 2 and N2 = 3. The idempotents are

el = 3, e2 = 4.

Consider the six-point cyclic convolution

s = Hg,

where H is a 6 x 6 circulant matrix. Define the permutation ir of Z/6 by

7r = (0,4,2; 3,1,5),

and denote by P the corresponding permutation matrix. Then

-

-100 000

000 010

001 000 P =

000 100 '

010 000

_000 001_


h(0) h(2) h(4) h(3) h(5) h(1)

h(4) h(0) h(2) h(1) h(3) h(5) h(2) h(4) h(0) h(5) h(1) h(3)

PHP-1= h(3) h(5) h(1) h(0) h(2) h(4)

h(1) h(3) h(5) h(4) h(0) h(2)

h(5) h(1) h(3) h(2) h(4) h(0)

which is a 2 x 2 block circulant matrix having 3 x 3 circulant blocks. The input and output vectors are

g(0) s(0) -

g(4) s(4) g(2) s(2)

Pg g(3) ' s(3) '

g(1) s(1)

g(5) s(5)

h(n2 + N2ni,k2+ N2ki) = hani - ki)ei -I- (n2 - k2)e2), (7.19)

where 0 < ni, ki < Ni, 0 < n2, k2 < N2 . From (7.19), we have that PHP-1 is an NI x Ni block circulant matrix having N2 X N2 circulant blocks:

,

_

where Hi is the circulant matrix having 0-th column:

[

h(tei) h(lei +e2)

.

h(lei + (N2 - 1)„e2) -,,

Ho Hivi—i • • 111 Hi Ho ' • •

PHP-1 =

.

_ H-Ni-i • • - Ho


and

Ps = (PHP-1)Pg

is a two-dimensional 2 x 3 cyclic convolution. Consider the general case N = AriAr2, where Ni and N2 are relatively

prime. As in the Good-Thomas PFA, we define the permutation ir of ZIN by the formula

7r(n2 Nzni) niet -1-n2e2 mod N, 0 < ni < 0 < nz < N2

Denote the corresponding permutation matrix by P. Pg is formed by reading across the rows of the matrix

g(0) g(e2) • g((N2 - 1)e2)

g(et) g(ei+ e2) • • g(ei + (N2- 1)e2)

g((Ni- 1)ei) g((Ni - 1)ei -I- (N2 - 1)e2)

Write

PHP-1 = [h(1,k)], 0 <1,k < N.

Then

As a result,

Ps = (PHP-1)Pg (7.20)

is a two-dimensional x N2 cyclic convolution.

Problems 145

Fast algorithms computing two-dimensional cyclic convolution can now be applied to compute (7.20) and in this way the N-point cyclic convolu-tion. If Ni-point cyclic convolution is computed using a(Ni) additions and ni(Ni) multiplications, then

m(Ni)m(N2)

multiplications are needed to compute the N-point cyclic convolution s by (7.20) and

N2a(Ni) rrt(Ni)a(N2)

additions are required.

References

[1] Agarwal, R. C. and Cooley, J. W. "New Algorithms for Digital Con-volution", IEEE Trans. Acoust., Speech and Signal Proc., 25, 1977, pp. 392-410.

[2] Blahut, R. E. Fast Algorithm,s for Digital Signal Processing, Addison-Wesley, 1985 , Chapter 7.

[3] Nussbaumer, H. J. Fast Fourier Transform and Convolution Algo-rithms, Second Edition, Springer-Verlag, 1981, Chapter 6.

[4] Arambepola, B. and Rayner, P. J. "Efficient Transforms for Multidi-mensional Convolutions", Electron. Lett., 15, 1979, pp. 189-190.

Problems

1. Derive a three-point Winograd cyclic convolution algorithm.

2. Derive a five-point Winograd cyclic convolution algorithm.

3. Derive a 15-point Agarwal-Cooley convolution algorithm using the results of problems 1 and 2.

4. Derive a four-point Winograd cyclic convolution algorithm.

5. Derive a 12-point Agarwal-Cooley convolution algorithm using the results of problems 1 and 4.

6. Write a 2 x 2 cyclic convolution algorithm

S (x, y) = H (x, y)G (x , y), (mod X2 - 1)(mod y2 — 1).

7. Write a 2 x 2 polynomial product

S (x, y) -= H (x, y)G(x, y), (mod X2 ± 1)(mod y2 + 1).

8 Multiplicative Fourier Transform Algorithm

The Cooley-Tukey FFT algorithm and its variants depend upon the exis-tence of nontrivial divisors of the transform size N . These algorithms are called additive algorithms since they rely on the subgroups of the addi-tive group structure of the indexing set. A second approach to the design of FT algorithms depends on the multiplicative structure of the indexing set. We applied the multiplicative structure previously, in chapter 5, in the derivation of the Good-Thomas PFA.

In the following chapters, a more extensive application of multiplicative structure will be required. The first breakthrough was due to Rader [1] in 1968, who observed that a p-point Fourier transform could be computed by a (p — 1)—point cyclic convolution. Winograd [2, 3] generalized Rader's results to include the case of transform size N = , p a prime. Combined with Winograd's cyclic convolution algorithms, these methods lead to the Winograd Small FT Algorithm which we will derive in detail in chapter 9.

In tables 8.1 and 8.2, taken from Temperton [4], we compare the number of real additions and real multiplications required by conventional methods and by the Winograd methods. For the transform sizes included in tables 8.1 and 8.2, Winograd's algorithm requires substantially fewer multiplica-tions at the cost of a few extra additions. However, as the transform size increases, although the Winograd algorithm continues to maintain its ad-vantage in multiplications, the price in additions becomes higher. This is to be expected since these algorithms depend on cyclic convolution algo-rithms. Standing alone, the Winograd small FT algorithms (WSFTA) are practical only for a collection of small size transforms. In tandem with the Good-Thomas PFA, the Winograd algorithms can be effectively used to

148 8. Multiplicative Fourier Transform Algorithm

handle medium and some large size transforms. The Winograd Large FT algorithm, [5] is based on a method of nesting the WSFTAs in the Good-Thomas PFA. This results in algorithms that minimize multiplications. This nesting technique can be described using tensor product formula-tion. Suppose that N = RS, where R and S are relatively prime. By the Good-Thomas algoritlun,

F(N) = P(F(R) F(S))Q, (8.1)

where P and Q are permutation matrices defined by the Good-Thomas method. A WSFTA for an R-point FT has the form

F(R) = (8.2)

In the cases under consideration, Ai and Ci are matrices of zeros and ones, and Bi is a diagonal matrix whose entries are either real or purely imaginary. In the same way, we can write

F(S) = C2B2A2.

The dimension of Bi in (8.2) is in general greater than R, and consequently Ai and Ci are not square matrices. Using the tensor product formula,

(A 0 B)(C D) = (AC) 0 (BD), (8.3)

we can place (8.2) and (8.3) in (8.1) and write the N-point Fourier transform matrix as

F(N)= PCBAQ, (8.4)

where C and A are matrices of zeros and ones:

c = ® c2,

A = Ai 0 A2;

B is a diagonal matrix with real or purely imaginary entries on its diagonal

B = Bi 0 B2,

and P and Q are permutation matrices. The number of multiplications required to compute an R-point FT by

(8.2) is the dimension m(R) of the diagonal matrix Bi. It follpys that the number of multiplications required to compffte an N-point FT is

m,(N) = m,(R)m(S),

the dimension of the matrix B.

8. Multiplicative Fourier Transform Algorithm 149

If a(R) denotes the number of additions required to compute an R-point FT by (8.2), then the number of additions required to compute the N-point FT by (8.4) is

a(N) = Ra(S)+ m(S)a(R).

Kolba and Parlcs [6] implemented the Good-Thomas algorithm by direct computation of each small FT using the Winograd FFT without nesting. In this case, the number of multiplications and additions are given respectively by

m(N) = Sm(R)+ Rm(S),

a(N) = Sa(R)+ Ra(S).

In general, R < m(R) and S < m(S),

which imply that the Kolba-Parks approach has the advantage when it comes to additions. However, in most cases, Winograd's approach has the advantage when measured by multiplications. Tables 8.3-8.6, also taken from Temperton [4], compare the conventional approach, the Good-Thomas approach where conventional methods are used on the factors, the Kolba-Parks approach and the Winograd approach. As can be seen from tables 8.4-8.6, Winograd's technique offers substantial savings in multi-plications relative to all other methods. However, it is the least efficient with respect to additions. In all cases, we see that additions dominate multiplications. Temperton [7] argues for the Good-Thomas approach with conventional computation on factors on computers such as CRAY, where additions and multiplications are performed simultaneously. On these com-puters multiplications are 'free' in the sense that they are carried out while the more numerous additions are being performed. Other implementation considerations are discussed in [8, 9].

In the following chapters, we will present a class of algorithms [10-15], that combine features of all of these algorithms. Tensor product rules are used throughout. The fundamental factorization has the form

F(N) -= PCAP-1, (8.5)

where P is a permutation matrix, A is a preadditions matrix with all of its entries being 0, 1 or —1 and C is a block-diagonal matrix having skew-circulant blocks (rotated Winograd cores) and a tensor product of such blocks.

We have implemented these algorithms and their variants on the Micro VAX II. For a large collection of transform sizes, these algorithms require roughly half the run-time of comparable programs in Digital's Scientific Library (LabStar Version 1.1). We will see this in tables 8.7 and 8.8.

Although the fundamental factorization and its variants are highly struc-tured and uniform, a direct attack on the programming of matrix A is


complicated. There is no apparent looping structure. However, as discussed in [14], the programming effort can be greatly simplified and automated by the use of macros and production rules that take advantage of the 'local' structure of the preadditions.

For all of the transform sizes listed in tables 8.7 and 8.8, we used the Variant 1 form of the fundamental factorization, as described in the follow-ing chapters. The main programs are written in Fortran, and they call the small DFT subroutines written in assembly.

From tables 8.7 and 8.8, we can see that many transform sizes not in-cluded in Lab Star have been programmed. Timings for most sizes are significantly better than the nearest size Cooley-Tukey algorithm.

Tables of Arithmetic Counts and Timing

R.A.— the number of real additions. R.M.— the number of real multiplications.

Table 8.1 Conventional methods.

Sizes R.A. R.M. 2 4 0 3 12 4 4 16 0 5 32 12 7 60 36 8 52 4 9 80 40

16 144 24

Table 8.2 Winograd.

Sizes R.A. R.M. 2 4 0 3 12 4 4 16 0 5 34 10 7 72 16 8 52 4 9 88 — 20

16 148 20

Tables of Arithmetic Counts and Timing

'able 8 3 Conventional methods.

7,es R.A. R.M. )5 2272 1492 )8 2018 1012 12 2162 1188 20 2302 1116 26 2684 1672 28 2242 900 10 5322 2708 52 5954 2500 56 5122 2050 15 8492 5728 20 7202 3396

Table 8.4 Good-Thomas

7.es R.A. R.M. )5 1992 932 12 1968 744 20 2028 508 26 2452 1208 10 4656 1256 52 5408 2416 15 7516 3776

Table 8.5 Kolba-Parks.

zes R.A. R.M. )5 2214 590 12 2188 396 20 2076 460 26 2780 568 10 4812 1100 52 6064 1136 15 8462 2050


Table 8.6 Winograd.

Sizes R.A. R.M. 105 2418 322 112 2332 308 120 2076 276 126 3068 392 240 5016 632 252 6640 784 315 10406 1186

Table 8.7 Timing comparisons (pq and pqr cases).

8

15

16

23

3 x 5

24

1.13 ms.

1.49 ms.

2.87 ms.

21 3 x 7 2.23 ins.

32 25 5.78 ms.

33 3 x 11 4.35 ins.

35 5 x 7 4.22 ms.

39 3 x 13 4.78 ms.

51 3 x 17 6.97 ms.

64 26 12.49 ms.

93 3 x 31 16.03 ms.

105 3 x 5 x 7 16.01 ms.

—

128 27 27.80 ms.

ms.= 10-3 second.

Size Factors pq(pqr) Dec Lab Star

--_,

References 153

Table 8.8 Timing Comparisons (4p, 4pq and p2q).

Size Factors 4p(4pq) Dec Lab Star

8 23 1.49 ms.

12

16

4 x 3

24

0.415 ms.

2.87 ms.

20 4 x 5 0.985 ms.

28 4 x 7 2.92 ms.

32 25 5.78 ms.

44 4 x 11 5.86 ms.

45 32 x 5 6.52 ms.

52 4 x 13 6.53 ms. ,

60 4 x 3 x 5 6.39 ins.

64 26 12.49 ms.

68 4 x 17 9.27 ms.

124 4 x 31 20.42 ms.

128 27 27.80 ms.

ms.= 10-3 second.

References

[1] Rader, C. M. "Discrete Fourier Transforms When the Number of Data Samples Is Prime", Proc. IEEE, 56, 1968, pp. 1107-1108.

[2] Winograd, S. "On Computing the Discrete Fourier Transform", Proc. Nat. Acad. Sci. USA., 73(4), April 1976, pp. 1005-1006.

[3] Winograd, S. "On Computing the Discrete Fourier Transform", Math. of Computation, 32(141), Jan. 1978, pp. 175-199.

154 Chapter 8. Multiplicative Fourier Transform Algorithm

[4] Temperton, C. "A Note on Prime Factor FFT Algorithms", J. Comp. Phys., 52,1983, pp. 198-204.

[5] Blahut, R. E. Fast Algorithms for Digital Signal Processing, Addison-Wesley, 1985, Chapter 8.

[6] Kolba, D. P. and Parks, T. W. "Prime Factor FFT Algorithm Using High Speed Convolution", IEEE Trans. Acoust., Speech and Signal Proc., 25,1977, pp. 281-294.

[7] Temperton, C. "Implementation of Prime Fa.ctor FFT Algorithm on Cray-1", to be published.

[8] Agarwal, R. C. and Cooley, J. W. "Fourier Transform and Convo-lution Subroutines for the IBM 3090 Vector Facility", IBM J. Res. Devel., 30, Mar. 1986, pp. 145-162.

[9] Agarwal, R. C. and Cooley, J. W. "Vectorized Mixed Radix Discrete Fourier Transform Algorithms", IEEE Proc., 75(9), Sep. 1987.

[10] Heideman, M. T. Multiplicative Complexity, Convolution, and the DFT, Springer-Verlag, 1988.

[11] Lu, C. Fast Fourier Transform Algorithms For Special N's and The Implementations On VAX, Ph.D. Dissertation, The City University of New York, Jan. 1988.

[12] Tolimieri, R., Lu, C. and Johnson, W. R. "Modified Winograd FFT Algorithm and Its Variants for Transform Size N = pi' and Their Implementations", accepted for publication by Advances in Applied Mathematics.

[13] Lu, C. and Tolimieri, R. "Extension of Winograd Multiplicative Algo-rithm to Transform Size N = p2q, p2qr and Their Implementation", Proc. ICASSP 89, Scotland, May 22-26.

[14] Gertner, I. "A New Efficient Algorithm to Compute the Two- Di-mensional Discrete Fourier Transform", IEEE Trans. Acoust., Speech and Signal Proc., 36(7), July 1988.

We.

MFTA: The Prime Case

9.1 The Field Zlp

For transform size p, p a prime, Rader [1] developed an FT algorithm based on the multiplicative structure of the indexing set. The main idea is as follows. For a prime p, Zlp is a field and the unit group U(p) is cyclic. Reordering input and output data relative to a generator of U(p), the p-point FT becomes essentially a (p —1) x (p —1) skew-circulant matrix action. We require 2(p — 1) additions to make this change. Rader com-putes this skew-circulant action by the convolution theorem that returns the computation to an FT computation. Since the size (p — 1) is a com-

`posite number, the (p — 1)-point FT can be implemented by Cooley-Tukey FFT algorithms. The Winograd algorithm for small convolutions also can be applied to the skew-circulant action. (See problems 3, 4 and 5 for basic properties of skew-circulant matrices.)

Example 9.1 If p = 3, then U(3) has the unique generator 2.

Example 9.2 If p =-. 5, then U(5) has two generators, 2 and 3. We can order U(5) by consecutive powers of 2,

1, 2, 4, 3,

or by consecutive powers of 3,

1, 3, 4, 2.

156 9. MFTA: The Prime Case

In the following table, we give generators z corresponding to several odd primes.

Table 9.1 Generator of U(p).

Size I 3 I 5 7 I 11 13 [ 17 Generator I 212 31 2 2 I 6

Choose a generator z of U (p). The order of U (p) is p - 1. We will reorder the indexing set according to successive powers of the generator z as follows:

0, 1, Z, Z2, ... ZP-2,

with zP-1 =- 1 mod p. We call this ordering the exponential ordering based on z. The relationship between the canonical ordering and the exponential ordering is given by the indexing set permutation

(k) = {13;c_i k = z , 1 < k < p,

which we call the exponential permutation based on z.

Example 9.3 Relative to the generator 2 of U(5), the exponential permutation is

Ir -= ( 0 1 2 4 3 )

Exarnple 9.4 Relative to the generator 3 of U(7), the exponential permutation is

7r = ( 0 1 3 2 6 4 5 ).

A useful fact in building algorithms is

(p-1)/2 = -1,

which implies that the exponential ordering based on z has the form

1)/2 , _1 , -z , , - z(P- ') /2 0 , 1 , z , . . . , z(P-

In general, if 7r is a permutation of Z/N and P(7r) is the corresponding N x N permutation matrix

xir(o) x.„(i)

P(r)x =

X.ff (N-1) °:

then the matrix F, satisfying F(p) = .13-1(7r)F,P(7r) is given by

= [w7r(k)ir(i)] w = e276/N 0<j, k<N

9.2 The Fundamental Factorization 157

9.2 The Fundamental Factorization

We assume throughout this chapter that a generator z of U(p) has been specified, and that 7r is the exponential permutation based on z. We will design algorithms computing p-point FT based on reordering input and out-put data by the exponential ordering. In matrix formulation this amounts to permuting input and output data by the permutation matrix P corresponding to the exponential permutation Explicitly, P is the p x p permutation matrix defined by the formula

Px = y,

where yo = so and

Yk = Xyk-1, 1 < k < p.

Consider the matrix F„ defined by F(p) = P-1F,P. Set v = e21"/P Since

1, /=Oork=0, v

71-(1)7r(k)

= vz

1+k-1, 1 < 1, k < p,

we have _11

= . : C (p)

_1

where C(p) is the (p — 1) x (p — 1) skew-circulant matrix

CO)) [Vzi+k ] 0<l,k<p-1"

We call F, the FT matrix and C(p) the Winograd core based on the gener-ator z of U(p). Unless otherwise specified, we will assume throughout that a generator has been chosen and suppress the dependence of Fir and C(p) on the choice of generator.

Example 9.5 The Winograd core based on the generator 2 of U(3) is

c,(3) = v., v21 , v = e27,i/3

Lvz v

Example 9.6 The Winograd core based on the generator 2 of U(5) is r V2 V4 V3

V2 V4 V3 V C(5) = 21, v = 6

27ri /5 V4 V3 V V

V3 V V2 V4


For any vector x of size p, denote by x' the vector of size p — 1 formed by deleting the 0-th component of x. The action of Fir can be computed by the following two formulas. If y = Firx, then

p—i

Yo = E x,n, m=0

= C(p)x'

where lp_i is the vector of size p — 1 of all ones.

Example 9.7 Based on the generator 2 of U(5), we can compute y = Firx by the formulas

4

Yo = E xn„ m=0

2 4 3 yy vV2 Vv 4 Vv 3 Vv Xx 0

o V e21-z/5

• Y3 V4 V3 V V2 X3 X0

Y4 V3 V V2 V4 X4 X0

Up to the permutation of input and output data by the permutation ma-trix P, the action of F, computes the action of F(p). Since the permutation matrix P has the form

1 0 P = [ 0 QI'

where Q is a permutation matrix, we have

F(p) = 11. 1 [

1 Q-1C(P)Q 1-

_.

Set p' = p — 1 and define the p x p matrix

A(p) -=-

Observe that A(p)

Example 9.8

1 1 1 —1

• I ,

—1

does not depend on

1 1 A(3) = —1 1

—1 0

1 lpt, =-

1.2,,

r.

1 01 . 1


Example 9.9 -11111

—11000 A(5) = —1 0 1 0 0 .

—10010 _-10001

Example 9.10 Consider the Winograd core C(5) of example 9.6. Since

± v3 ± v4 = 0, v = e27ri/5, ± V ± V2

we have C(5)14 = — 14,

implying that 10 0 00- 0

F, =[0 C (5) A(5). 0 0 _

Using the matrix direct sum notation, we can rewrite this as

= (1ED C(5))A(5),

which leads to the computation of y =- Firx by the following steps:

• Compute

ao = xo + xi + x2 + x3 + x4,

• = — xo,

a2 = x2 — xo,

a3 = x3 — xo,

• = X4 — Xo.

• Compute Yo = ao,

Y2 = C(5) a21 a3

YY43 a4

The results of example 9.10 hold in general. The main fact we need is that, for v = e2"t/P , we have

p-2

Vz = 0, (9.1) k=0


which implies that C(p)lp, = -1p, (9.2)

and the following theorem.

Theorem 9.1 F, = e C(p))A(p),

where C(p) is the Winograd core and

r 1 11; , 1. A(p) -

1P'

- I ,

The factorization given in the theorem is called the fundamental factorization. It computes the action of F, in two stages:

• An additive stage describe by the matrix A(p).

• A multiplicative stage described by the Winograd core C(p).

Table 9.2 F, = [1 ED C(p)1A(p) direct method.

Factor R.A. R.M.

A(p) 4(p - 1) 0

C (p) 2(p -1)(2p - 3) 4(p - 1)2

F, 2(p -1)(2p - 1) 4(p - 1)2

Table 9.3 Arithmetic count of C(p): Direct method.

Factor R.A. R.M.

5 56 64

7 132 144

11 380 400

13 552 576

17 992 1024

R.A. - the number of real additions. R.M. - the number of real multiplications.


The additive stage requires 2p' additions. In the next section, several implementations of the multiplicative stage will be given that have vari-ous arithmetic counts. We have called this stage the multiplicative stage since, by the convolution theorem, the skew-circulant matrix C (p) can be diagonalized. (See problems 4 and 5.)

Table 9.4 Arithmetic count of C(p): Convolution theorem.

Factor R.A. R.M.

5 38 12

7 82 36

11 202 76

13 214 76

17 326 100


Taking transpose on both sides of the fiindamental factorization, we have

= At (p)(1 ED C(p)). (9.3)

The multiplicative stage now c,omes before the additive stage. A second algorithm computing

y = F,x

will now be given. Set E(p) = lp,

the p x p matrix of all ones. Form the matrix

- E(p),

which can be written as

F, - E(p) = 0 ED W(p),

where W(p) is the pi x skew-circulant matrix given by

w (P) = (P) - E(P').

The computation becomes

y = (F, - E(p))x E(p)x.


We see that E(p)x = yol,

can be computed using p' additions. The computation is arranged in two stages:

p-1 • YO = j=0 X j.

• y' = W(p)x'

As in the preceding approach, we require 2p' additions and the action of the p' x p' skew-circulant matrix W (p).

Example 9.11 Based on the generator 2 of U(5),

V — 1 V2 — 1 V4 — 1 V3 — 1 2 1 4 i

V — i V — V -- 3 1 V — 1

W(P) = V4 — 1 V3 — 1 V — 1 V2 — 1 ' [

3 1 1 V2 — 1 V4 — 1

V — V —

V = e272/5

The computation of y = F, x can be carried out by

4

YO = E Xm rn=0

Y1 — YO X1

Y2 — YO = W(P) [X2

Y3 — Yo X3

Y4 — YO X4

9.3 Rader's Algorithm

For a prime p, consider the fundamental factorization

F, = (1 (131 C(p))A(p).

Throughout this section, set p' = p-1. Unless otherwise specified, additions and multiplications mean complex additions and complex multiplications. Every addition is equivalent to two real additions. There are several ways of computing multiplications. We will assume that direct methods are used, so that every multiplication requires two real additions and four real mul-tiplications. In this section and the next, we will design variants of the fundamental factorization that reduce multiplications or change the balance between the multiplications and additions.

By the convolution theorem, C(p) can be diagtmalized by

D(p) = F(p')-1C(p)F(p')-1

Placing this result into the fundamental factorization, we have the following result.

9.4 Reducing Additions 163

Theorem 9.2 (Rader FFT I)

= e F(p'))(1 G D(p))(1 F(p'))A(p).

Up to the preaddition stage, A(p), and the diagonal multiplication stage, 1 ED D(p), we can compute F, and F(p) by two pi-point FTs.

9.4 Reducing Additions

Diagonalizing the Winograd core C(p) reduces the number of multiplica-tions needed to carry out the computation. This is an important, but not the only, consideration even when small computers that have fast adders and slow multipliers are used. Data flow can have a great effect on the actual time cost of carrying out a computation. However, measuring the efficiency of a given data flow is extremely machine-dependent, and beyond the scope of this work.

On some larger machines, the speed of additions is nearly equal to that of multiplications. In this case, algorithms that balance between the number of additions and the number of multiplications should be most efficient.

Throughout this section, set pi = p — 1 and denote by e the vector of size p' having 1 in the 0-th component and 0 in all other components. Define the p x p matrix B(p) by

B(p)(1@ F(p')) = (1 ED F(0)A(p).

Since F(p')e =

we have B(29) _= r et 1

I_ —11 e j Theorem 9.3 (Ftader FFT II)

= (1 ED F(p'))(1 e D(p))B(p)(1 e F(p')),

where 1 et B(P) = [ _pie 1 •

Rader FFT II is symmetric in the sense that 1 ED F(p') initiates and completes the computation of F,.

Example 9.12 11000

—41000 B(5)= 00100 .

00010 _ 00001_


Computing the action of B(p) requires two additions and an integer multiplication by —p', which should be compared with the 2p' additions required for the action of A(p). The additions have been dramatically re-duced with one extra multiplication as the trade-off. Another important fact is that the arithmetic of B(p) is independent of p.

The second approach for reducing additions comes from the special form of the Winograd core C(p).

Example 9.13 Take p = 7. The matrix C(7) has the form

X(7) X* (7)1. C(7) = [x.(7) x(7)

Based on the generator 3, we have

[ V V3 V2

X(7) = v3 v2 v , v = e21'417 V2 V V3

A straightforward computation of the action of C(7) requires 132 real additions and 144 real multiplications.

A partial diagonalization method or block diagonalization m,ethod will be used to replace complex arithmetic by purely real arithmetic. Set

Y(7) = (F(2) 0 /3)-1C(7)(F(2) 0 /3)-1

Y(7) is a block diagonal matrix consisting of two blocks, one with purely real coefficients and one with purely imaginary coefficients. A direct computation shows that

y(7) = [ X(7) + X* (7) 0 2 I_ 0 X(7) — X* (7)] •

Placing this result into the fundamental factorization, we have

Fir = (1ED (F(2) 0 /3))(1 Y(7))(1 e (F(2) 0 /3))A(7).

The matrix X(7) + X(7)

has only real entries. Multiplication of a real number and a complex number requires no real additions and two real multiplications. It follows that the action of X(7)+X* (7) requires 18 real multipliotions and 12 reatadditions. The matrix

X(7) — X* (7)

has only purely imaginary entries. We assume that multiplication by i requires no addition or multiplication. The arithmetic of X(7) — X* (7) is

9.4 Reducing Additions 165

equivalent to that of X(7)+X* (7). It follows that the action of X(7)—X*(7) requires 18 real multiplications and 12 additions.

The computation of the action of Fir is decomposed into a preaddition stage given by A(7), a two-point FT stage given by 1 ED (F(2) 0 h), an essentially real multiplication stage given by lED Y(7) and a final two-point FT stage given by 1 ED (F(2) 0 /3). Computing C(7) by this method should be compared to the p = 7 case of table 9.4.

Table 9.5 C(7) = (F(2)

Factor

I3)Y (7)(F(2) h).

R.A. R.M.

X + X* 12 18

X — X* 12 18

Y 24 36

F(2) 0 h 12 0

C 48 36

R.A. — the number of real additions. R.M. — the number of real multiplications.

The general case follows in the same way. C(p) has the form

r x(p) x.(p)i (9.4) c(p) = pc.(p) x(P)

The partial diagonalization method leads to the next result.

Theorem 9.4 (Rader FFT III)

F, = e (F(2) 0 /p72))(1 e Y(p))(1 ED (F(2) 0 /p72))A(p),

where Y(p) is the block diagonal matrix

1 Y(P) [(X(P) + X* (A) (X(P) — X* (PM.

As in the example, X(p) + X* (p) has only real entries and X(p)— X*(p) has only imaginary entries. The action of Y(p) requires only real arithmetic.

Denote by f the vector of size pi whose first p'/2 components are 1 and second p'/2 components are O. Define the p x p matrix Bi (p) by

(p)(1 e (F(2) 0 Ipy2)) = (1 6 (F(2) 0 Ipv2))A(p).

A direct computation shows that

r ft BI(P)— I_ —2f rp‘ •


In general, the action of Bi(p) requires p' additions.

Table 9.6 Rader FFT III.

Factor R.A. R.M.

X(p) + X*(P) (p - 1)(p - 3)/2 (p - 1)2 12

X (p) - X* (p) (p - 1)(p - 3)/2 (p - 1)2 12

(P) (P - 1)(P - 3) (p - 1)2

F(2) 0 /p72 2(p - 1) 0

C(p) (p - 1)(p + 1) (p - 1)2

(P 1)(23 + 5) (p - 1)2

Table 9.7 Rader FFT III.

Factor R.A. R.M.

5 40 16

7 72 36

11 160 100

13 216 144

17 352 256

19 432 324


Theorem 9.5 (Rader FFT IV)

F, = (1 e (F(2)0 ipy2))(1 e Y(p))Bi(p)(1 e (F(2) ® Ipv2)),

where r ft 1 Bl(P) = I_ -2f •

As in Rader FFT II, the factorization is symmetric.

9.5 Winograd Small FT Algorithm

167

Table 9.8 Rader FFT IV.

Factor R.A. R.M.

Bi (P) 2(p - 1) (p - 1)

F, (p - 1)(p + 3) P(P - 1)

Table 9.9 Rader FFT IV.

Fa.ctor R.A. R.M.

5 32 20

7 60 42

11 140 110

13 192 156

17 320 272

19 396 342


9.5 Winograd Small FT Algorithm

The action of the Winograd core C(p) in the fundamental factorization also can be computed by the Winograd small convolution algorithm. (See problems 4 and 5.) Recall that the Winograd factorization for a skew-circulant matrix C(p) has the form

C(p) =-- LGM, (9.5)

where L and M are matrices of small integers and G is a diagonal matrix. The matrices L and M are generally not square matrices.

Example 9.14 Consider the five-point FT matrix factorization.

[1 0 0 0 0 0

F„ = 0 C(5) A(5), 0 0


where [ v v2 V4 V3

c(5) = v24 v34 v3 v2 ,

V = e(21"/5). V V V V

V3 V V2 V4

Several Winograd factorizations of C(5) have been derived in chapter 5. For example, we have

C(5) = LGM,

where 1 0-11 01

1 —1 1 1 —1 1 L =

—1 0 11 01' —1 1 —1 1 —1 1

- 1 0 —1 0 -

1 —1 —1 1

0 1 0 —1 M =

1 0 1 0 '

1 —1 1 —1

_ 0 1 0 1 _

and G = diag(g), where

V — V4

V V — V4 — (V3 — V2)

1 V3 1 3 2 V — V

g = M v4 = —2 • v v4 V2 V + V4 — (V3 ± V2)

V3 + V2

Then 1 0] [1 0] [1 0 A(\

Flr—[OL OG 0/1/`"'

In general, if (9.5) is the Winograd factorization of the Winograd core C(p), then we have the factorization

F, = M' ,

where G' is the diagonal matrix

• = G,

and L' and M' are matrices of small integers,

• = 1 e L,

/1/1' = (1 M)A(p).

9.6 Summary 169

Since F(p) P-1F,P = P-1L'G'111P,

by setting A = Af' P,

B = G',

and c = P-1L',

we have F(p) = CBA, (9.6)

which has the same form as the Winograd small FFT algorithm. This form was given in chapter 8. The computation of (9.6) can be carried out in three stages: The first stage is the preaddition stage given by matrix A, the second stage is the multiplication stage given by the diagonal matrix B, and the last stage is the postaddition given by matrix C.

The Winograd algorithm increases the number of additions but greatly reduces the number of multiplications.

9.6 Summary

For a generator z of U(p), define the matrix F, by

F(p) P-1F„P,

where 7r is the exponential permutation based on z and P is the permutation matrix corresponding to 71". F„ describes the p-point FT on input and output data ordered exponentially by z.

Set pi = p — 1 and v = e27"/P. Define the preaddition matrices

1 lt

A(p) =[ I

[ et B(p)=

—pie /p, '

Bi(p) ftl —2f '

where e and f are vectors of size pi, defined in section 4. Define the Winograd core C(p) as the skew-circulant matrix with the

0-th row p-2

V, VZ, VZ


and observe that C(p) has the form

r x(p) x.(p) 1. c(P) = x.(p) x(p)

Define the diagonal and block diagonal matrices

D(p) = F (0-.1 C(p)F (0-1

1 [ X(p) + X* (P) 0 YU)) = -2 0 X (p) - X* (p)]

Fundamental factorization

F, = ( 1 ED C(p))A(p).

Rader FFT I

• = (1 F(p'))(1 D(p))(1@ F(p'))A(p).

Rader FFT II

• = (1 e F(p'))(1 ED D(p))B(p)(1 ED F (0).

Rader FFT III

• = (1 031 (F(2) 0 I p72))(1 ED Y(p))(1 ED (F(2) 0 2))A(p).

Rader FFT IV

• = e (F(2) ® ipy2))(1 e Y(P))B4)(1 e (F(2) ® 472)).

Winograd Algorithm By the Winograd small convolution algorithm, we have

C(p)= LGM,

where L and M are matrices of small integers and G is a diagonal matrix, and we have

= (1 ED L)(1 ED G)(1 ED M)A(p).

The implementation of these algorithms has been carried out on the Micro VAX II. A major issue, apart from run-time, is simplicity of code generation. In particular, for the Micro VAX, Wader FFTs I and II appear to have the best structure for programming. However, for implementation on computers such as the CRAY X-MP and IBM 3090, where multiplications are tied to additions, an arithmetically balanced algorithm is preferred. In this case, Rader FFT IV is the best.

Problems 171

References

[1] Rader, C. M. "Discrete Fourier Transforms When the Number of Data Samples Is Prime", Proc. IEEE, 56, 1968, pp. 1107-1108.

[2] Winograd, S. "On Computing the Discrete Fourier Transform", Proc. Nat. Acad. Sci. USA, 73(4), April 1976, pp. 1005-1006.

[3] Winograd, S. "On Computing the Discrete Fourier Transform" , Math. Cornput., 32, Jan. 1978, pp. 175-199.

[4] Blahut, R. Fast Algorithms for Digital Signal Processing, Addison-Wesley Pub. Co., 1985, Chapter 4.

[5] Heideman, M. T. Multiplicative Complexity, Convolution, and the DFT, Springer-Verlag, 1988, Chapter 5.

Problems

1. Show that 3, 2, 2 and 6 are generators of the unit group U(7), U(11), U(13) and U(17), respectively.

2. Find generators for the unit group U(7), U(11), U(13) and U(17) that are different from the ones given in problem 1.

An N x N matrix S is skew-circulant if S,,j = for i + j = k + l mod N.

3. Show that for a skew-circulant matrix S, = S.

4. Show that C is a circulant matrix if and only if S = CR is a skew-circulant matrix, where R is the time-reversal matrix.

5. Use problem 3 to show that FSF is a diagonal matrix whenever S is a skew-circulant matrix and F is the FT matrix.

6. Order ZI7, Z/11, Z/13 and Z/17 by the powers of the generators given in problem 1.

7. Write the Winograd core corresponding to the generator z = 3

for U(7).

8. Write the Winograd core corresponding to the generator z = 5 for U(7), and observe the difference with problem 7.


9. F(7) can be written as

1 1

Q -1 C(7)Q

Determine the permutation matrix Q.

10. Verify table 9.4.

11. Derive the Rader algorithm for an 11-point Fourier transform.

12. Derive the Good-Thomas algorithm for F(10) in terms of F(2) and F(5). Using this derivation, compute the C(11) Winograd core.

13. If a given computer has the CPUTIME ratio as one real multiplication per ten real additions, what is the threshold size p for which we would choose Rader FFT II to compute F(p) instead of Rader FFT I.

14. Prove the formulas given in table 9.3.

15. What is the basic difference between the Winograd algorithm and the algorithms derived in sections 2, 3 and 4?

16. Give the arithmetic counts for the five-point Winograd algorithm.

17. Derive the Winograd algorithm for F(3).

10 MFTA: Product of Two Distinct Primes

10.1 Basic Algebra

The results of chapter 9 will now be extended to the case of a transform size that is a product of two distinct primes. As mentioned in the general introduction to multiplicative FT algorithms, several approaches exist for combining small size algorithms into medium or large size FT algorithms by the Good-Thomas FFT. The advantage of using the Good-Thomas FFT is that tensor product rules directly construct multiplicative FT al-gorithms for appropriate composite size ca.ses. The method is completely algebraic and results in composite size algorithms whose factors contain tensor products of prime size fa,ctors. However, these results are not totally appealing since complex permutations appear. A related problem is that tensor products are taken over direct sum factors.

In the following two chapters, we will derive multiplicative composite size FT algorithms based directly on the CRT ring-isomorphism. This approach will necessarily repeat some of the constructions used in deriving the Good-Thomas FFT, but will result in multiplicative composite size FT algorithms that more naturally extend the prime size cases.

Our approach emphasizes and is motivated by the results of chapter 9. By employing tensor product rules, we derive the fundamental factorization

F, = CA,

174 10. MFTA: Product of Two Distinct Primes

where C is a block diagonal matrix having skew-circulant blocks (rotated Winograd cores) and tensor products of these skew-circulant blocks, and A is a matrix of preadditions. Variants will then be derived.

Talce N = pq, where p and q are distinct primes, and consider the ring ZIN. Throughout this section, set p' = p — 1 and q' = q— 1. The unit group of ZIN,

U(N)= {a E ZIN:(a,n)=1},

is not a cyclic group. To determine the structure of U(N), we will use the CRT. Throughout we assume that generators zi and z2 of U(p) and U(q) have been specified and suppress the dependence of the Winograd cores C(p) and C(q) on these generators.

Consider the complete system of idempotents lei, e2} for the factoriza-tion N = pq. In chapter 5, in the derivation of the Good-Thomas PFA, we defined the ring-isomorphism

cb:ZIpx Zlq'-' ZIN

by the formula

0(ai,a2) = aiei + a2e2 mod N, 0 < ai < p,0 < a2 < q.

The ring-direct product ZIpx Zig is taken with respect to coordinatewise addition and multiplication. The ring-isomorphism 0 restricts to a group-isomorphism

q5:U(p) x U(q) '--' U(N).

Since U(p) is a cyclic group, U(N) is the direct product of cyclic groups of order p' and q', and every u E U(N) can be written uniquely as

u a- ziei + 4e2 mod N, 0 < l < pi , 0 < k < q' .

We order U(N) by taking k to be the faster running parameter. An element a E ZiN that is not a unit is called a zero divisor. The set

eiU(N) = fziiei : 0 < / < Pi}

consists of all elements in ZIN that are divisible by q but not p, and the set

e2U(N) = {4e2:0 < k < ql

consists of all elements in Z/N that are divisible by p but not q. The ordering of U(N) induces an ordering of tbe sets eiU(N) and e2U(N). Order the set Z IN by the permutation

7r = (0, eiU(N), e2U(N), U(N)).

We call 7r the exponential perm,utation based on the factorization N -= pq.

10.2 Transform Size: 15 175

10.2 Transform Size: 15

Take p = 3, q = 5 and N = 15. In this case, the idempotents are

el = 10, e2 = 6,

and every element a E Z/15 can be written uniquely as

a al • 10 + a2 • 6 bmod 15, 0 < al < 3,0 < a2 < 5.

Take generators 2 of U(3) and 2 of U(5). Every element u E U(15) can be written as

u = 2/ • 10 + 2k 6, 0 < / < 2, 0 < k < 4.

U(15) is ordered as 1, 7, 4, 13, 11, 2, 14, 8,

and we have

10U(15) = {10,51, 6U(15) =16,12,9,31.

The exponential permutation based on 15 = 3 x 5 is'

7r = (0; 10, 5; 6, 12, 9, 3; 1, 7, 4, 13, 11, 2, 14, 8) .

We now proceed to describe the matrix

F, = [u7(3)71 , 0 < j , k < 15, w = e27"/15

Set U = W5 = e27"/3

V = W3 = e2wi/5)

and set v2 v4 v3 V

2 [ U2 U2 I 5 e5 V4 V3 V2 V4 C3 =

U U V3 V V V

V V2 V4 V3

The matrices C3 and C5 are rotated Winograd cores. C3 is formed by

replacing u by u2 in C(3) and C5 is formed by replacing v by v2 in C(5). We also can write

-13 o o

G= [o 0 oio 10] C(3), C5 =

0 0 0 1 ‘-'‘").

_10 0 0


Direct computation shows that the bottom right-hand 8 x 8 submatrix of F„ is the tensor product

[U2C5 UC5 C3 0 C5 =

UC5 U2C5

Denote by 1, the vector of size m of all l's and by E(m, n) = 1,, 0 11 the m x n matrix of all l's. Set E(m) = E(m, m). We can rewrite F.„ as

1 lt4

12 C3 E(2, 4) C3 0 1t4 F =-

" 14 E(4, 2) C5 q 0 C5

18 C3 0 14 12 0 C5 C3 G

The highly structured form of F„, especially the repetition of C3 and C5 throughout the matrix, results from controlling data flow by the idempotents.

10.3 Fundamental Factorization: 15

We will now derive for F„ a factorization of the form

F.„ = C A, (10.1)

where A is a matrix of additions and C is a block diagonal matrix having skew-circulant blocks. This will generalize the prime transform size fac-torization of the same form derived in the preceding chapter. As in the preceding chapter, factorization (10.1) will be a spring board for several algorithms distinguished by arithmetic and data flow.

First,

C312 = —12, (10.2)

C514 — —14. (10.3)

The tensor product formula

(A B)(C D) = (AC) 0 (BD)

implies that C3(/2 0 lt4) = C3 0 1t4,

C5(1 014) = 0 C5.

Using (10.2) and (10.3) along with

E(m,n) = 1,, 0 ltn,

10.3 Fundamental Factorization: 15 177

we have

C3E(2, 4) = -E(2, 4),

C5E(4, 2) = -E(4, 2),

(C3 0 C5)18 = 18,

and we can write F, = CA,

where C is the block diagonal matrix

C = 1 e C3 e C3 e (C3 ® C3),

and 1 l& lt 4 11

[ -12 h -E(2, 4) /2 0 1'4 A --

-14 -E(4,2) 1.4 1. 0 /4 .

18 -1.2 0 14 -12 0 /4 /8

We can relate the matrix A to the matrix of additions in chapter 9. Recall that

A(p) _ r 1 4, 1 L -1 , r , i . P P

We can rewrite A as

A = 1 A(3) A(3) 0 lt4 1 [ -A(3) 0 14 A(3) 0 /4 i '

Now let Q0 = P(12, 4), as in chapter 2, and Q = 13 ED Q0. Straightforward computation shows that

C2o(A(3) 0 /4)Q0 1 = /4 0 A(3),

(A(3) 0 lt4)Q(71 = lt4 0 A(3),

Ch(A(3) 0 14) = 14 0 A(3),

and we have QAQ-1 = A(5) 0 A(3),

proving the following result.

Theorem 10.1

F, = (i. e C3 e C5 e (c3 ® C5))Q-1(A(5) 0 A(3))Q,

where Q = /3 ED P(12, 4), and C3 and C5 are the rotated Winograd cores

U2 U 21-i/3 G = [ U U2 i ) U = e ,

178 10. MFTA: Product of Two Distinct Prirnes

V2 V4 V3 V 4 3 2 e5 = V3 V V2 V4 , v = e2,0

[

V V V V

V V2 V4 V3

Table 10.1 Fir = CA, direct method (N = 15).

Factor R.A. R.M.

A 88 122}

228 272

F„ 316 272+122}

R.A. - the number of real additions M.A. - the number of real multiplication

10.4 Variants: 15

Variants of the above factorization will be designed in the spirit of chapter 9. First, by the convolution theorem,

D3 = F(2)-1C3F(2)-1,

D5 = F(4)-105F(4)-1

are diagonal matrices. Setting

F = 1 ® F(2) ED F(4) ED (F(2) F(4)),

D =1ED D3 IED D5 ED (D3 0 D5),

we have that D is a diagonal matrix and we can write

C = FDF,

proving the next result.

Theorem 10.2

= F(10) D3 ED D5 (D3 0 D5))FQ-1(A(5) 0 A(3))Q ,

where F is the FT factor

F = 1 ED F(2) ED F(4) ED (F(2) F(4))

and Q = /3 ED P(12,4).

10.4 Variants: 15 179

As in chapter 9, variants can be designed to reduce additions. First,

F, = FDBF,

where B = FAF-1.

By interchanging the order of computation, the action of A is replaced by the action of B, which, as we will now show, reduces the additions at the expense of a few rational multiplications. Define, as in chapter 9,

11000- -41000

1 , B(5) --- 0 0 1 0 0 . 00010 00001_

For the purpose of this discussion, set

X = 1 031 F(2),

and observe that F = X IED (X 0 F(4)).

A straightforward computation shows that

XA(3) = B(3)X,

which is what we need to prove that

B = 1 B(3) B(3) 0 et 1 L -4B(3) 0 e B(3) 0 14 i '

where et=[1000].

We see that 40 real additions and 8 multiplications by small integers are needed to compute the action of B.

The action of B can also be computed, without changing the arithmetic, by the factorization

B = Q-1 (B(5) 0 B(3))Q,

where Q = 13 031 P(12, 4). We then have the next result.

Theorem 10.3

F7,- = F(1 ED D(3) ED D(5) ED (D(3) 0 D(5))Q-1(B(5) 0 B(3))QF,

where Q = 13 ED P(12,4) and F is the FT factor defined in the previous theorem.

B(3) = [ 1

-2 0

1 1 0

0 0 1


C3 and C5 can be written in the form

C3 = x3 A;

,c; X3

X5 Xg` C5 = [ jq, x5 .

Setting 1 1

Y3 = -2(X3 + X3') -

2(X3 - X; ),

1 Y5 = -

2(X5 + X'5*) e -1 (X5 - XD,

2

we have the next result by Rader FFT III.

Theorem 10.4

F, = H(1 e Y3 ED Y5 Ef) (Y3 0 Y5))HQ-1(A(5) A(3))Q,

where Q = /3 ED P(12,4) and H is the FT factor

H = 1 ED F(2) ED (F(2) 0 /2) ED (F(2) 0 F(2) 0 /2).

The factor Y contains all of the multiplications. Reasoning as in the previous chapter, these are all real multiplications. Computing the action of Fir in this way requires 200 real additions and 68 real multiplications.

Table 10.2 C = HY H (N = 15).

Factor R.A. R.M.

H 44 0

Y 24 68

C 112 0


The cost of additions can be reduced by computing F,

F„ = HY Bill,

where Bi = HAH - 1.

10.5 Transform Size: pq 181

From chapter 9, recall the definitions

1 1 0 Bi(3) = [ —2 1 0 1

0 0 1 -

1 1 1 —2 1 0

Bi (5) = —2 0 1 0 0 0

_ 0 0 0

,

0 -

0 0 0 0 0 1 0 0 1_

By the usual tensor product manipulations, we have

Bi = Q-1 (Bi (5) 0 Bi (3))Q,

with Q = /3 ,E9 P(12,4).

Theorem 10.5

F, = Ho. e Y3 e Y5 ED (Y3 0 Y5))Q 101(5) 0 B1 (3))QH,

where Q =- /3 ED P(12,4) and H is the FT factor given in the previous theorem.

Table 10.3. F„ = HY 1511H (N = 15).

Factor R.A. R.M.

Bi 44 {111

F, 156 68 ±{111

R.A. — the number of real additions M.A. — the number of real multiplication

Multiplication by integers has been placed in brackets.

10.5 Transform Size: pq

In this section, algorithms for N -= 15 will be generalized to the case of N = pq, where p and q are distinct primes. Set U = U(N) and v = e27i/N . Throughout, zi and z2 are generators of U (p) and U(q), and we suppress the dependence of the Winograd cores on these generators. Set p' = p — 1 and q` = q— 1.

Denote by fel, e2}- the complete system of idempotents for the factoriza-tion N = pq. Partition the indexing set ZIN by the sets

0 U = { 0 } ,


= {ztei I 0 < k <111,

e2U = {4e2 I 0 < < gib

U = {ztei + 4e2 I 0 < k < p',0 < g').

The permutation 7r = (0; eiU; e2U; U)

is the exponential permutation based on the factorization N = pg. Al-though the definition of 7r contains the idempotents ei and e2, since the idempotents are uniquely determined by the factorization, the definition of the exponential permutation is unambiguous once the generators zi and z2 are specified.

Consider the submatrix corresponding to the Cartesian product

eiU x eiU.

Since (eizn(eizri) = eIzik±r eizik±r mod N,

the submatrix of F, corresponding to ei U x ei U is the skew-circulant matrix

Cp= [(vel)4+r ], 0 < k, r < pi.

v" is a primitive p-th root of unity. In general, if u is any primitive p-th root of unity, then the matrix Cp(u) formed by replacing evi/P in C(p) by u is called the rotated Winograd core based on u. The matrix Cp is the rotated Winograd core based on v".

In the same way, the submatrix of F, corresponding to the Cartesian product e2U x e2U is the rotated Winograd core based on the primitive q-th root of unity ve2,

Cq = [ (v€2)4+' I, 0 < s < g'

Consider now the submatrix of F, corresponding to the Cartesian prod-uct U x U. A typical entry in this submatrix is given by raising v to the power

\ k+r i+s

(ztei + z2e2)(4ei + z2e2) = ei + z2 e2 mod N,

(v").44-'(ve2)4+'

Since / and s are faster running parameters, this submatrix can be decomposed into submatrices,

(Vel ) 1 CD 0 < k,r < ,

and the submatrix of F, corresponding to U x U is the tensor product

Cp ® Cq.

10.6 Fundamental Factorization: pq 183

Similar arguments apply to the remaining submatrices. We summarize the results in the following description of F„:

1 ltp, lt

[

ltr, q' F = lp, Cp E(p' , q') Cp 0 ltq,

1,, E(q' ,p') C, l'p, 0 C, ' 1r, Cp 0 1,, lp, 0 C, Cp 0 Cq

where r' = p'q' . The highly structured form of the matrix is due in large part to the use of idempotents.

If we set 1 lt,

F,(p) = [ , P i , lp, Cp

then F7, = r F,(p) F„(p) 0 itg, 1

I_ F,(p) 01,, F„(p) 0 C, i '

leading to the next result.

Theorem 10.6 Suppose that 7r is the exponential permutation of Z I N based on the factorization N = pq. Then

1 F„(p) F„(p) 0 l'q, 1 F„ =

[ F,(p) 0 1,, F,(p) 0 C, i '

where [ 1 ltp, 1

F,(p) =- 1,, Cp i

with Cp the rotated Winograd core based on vel and C, the rotated Winograd core based on ve2 , v = e27ri/N .

Tensor product manipulation shows that

F, = (Ip e P(pq" ,p))(F,(p) 0 Fir (q))(Ip ED P(pq' ,p))-1

10.6 Fundamental Factorization: pq

The goal is to produce a factorization of the form

Fir = C A,

where A is a matrix of additions and C is a block diagonal matrix having skew-circulant blocks. The main ideas were given in the 15-point example. First, since the sum of the m-th roots of unity equals zero for any integer m > 1, we have

Cplyy = —1,,,


C, =

As in chapter 9, we can write

F„(p) = ( 1 ED Cp)A(p),

F.„(q) = ( 1 ED COA(q).

Setting

and observing

we have

It follows that

C = 1 e Cp Cq e (Cp CO

Cq (Cp Cq) = ( 1 ED Cp ) ,

F,(p) ltq, = (1 ED Cp)(A(p) itq,)

F,(p) = —(C, (C, 0 C,))(A(p) lq,)

F„(p) Cq = (Cq e (cp 0 Cq))(A(p) 4).

F, = (1ED C, Cq ED (Cp Cq)) [ _A(pA(P) AA((pl 1.1 qtq: . (10.4)

The ring structure has naturally pointed the way to the highly structured form of the factorization of F,. As discussed in section 2, tensor products of the corresponding p-point and q-point algorithms directly lead to an arithmetically equivalent algorithm that has a different data flow. This is a constant theme throughout this section and the next.

Denote the matrix on the right-hand side of (10.4) by A. We can implement A as a tensor product,

A = Q-1(A(q) A(p))Q

where Q = Ip ED P(pq' , q'),


Theorem 10.7 (Fundamental Factorization) If fel,e21 is the complete system of idempotents for the factorization

N = pq, then

F, (1 e cp c, e (cp o cq))Q-1(A(q) A(p))Q,

where Q = Ip ED P(pq' , q'), and C, and Cq are the rotated Winograd cores based on vel and ve2 v =

10.7 Variants 185

We see that F, can be built from the corresponding p-point and q-point algorithms desigmed in chapter 9 by tensor products. A general observation will be useful. If X is an m x m matrix, Y is an n x n matrbc, and if an algorithm computes the actions of X and Y using A(x) and A(y) additions, respectively, then the action of the tensor product X Y requires

nA(x) + mA(y)

additions.

Table 10.4 F., = C A, direct method (N =- pq).

Factor R.A. R.M.

A 4(p' q + pq') 0

C 2(p' (2p - 3)q + q' (2q - 3)p) 4(112 q + q/2

Table 10.5 Fir = CA, direct method.

Size R.A. R.M.

15 316 272

21 608 544

35 1284 1168

55 2892 2704


10.7 Variants

The methods of chapter 9 will be applied to design several variants of the factorization providing options for arithmetic and data flow that can be matched to a variety of computer architectures.

Define the diagonal matrices Dp and D, by

= F (p' )- 1 C,F(p')-1 ,

= F(q')-1CqF(qi)-1.

Theorem 10.8

F, =- F (1 e Dp e D q ED (Dp DO)FQ-1(A(q) A(p))Q ,


where F is the FT factor

F = 1 ED F(p') F(q1) e (F(p') F(q')).

The cost of additions in the previous theorem can be reduced by interchanging the order of the operations. Write

F, = FD(FAF-1)F

and set B = FAF-1.

Arithmetically, the action of A is replaced by the action B, which we will see requires fewer additions. To see this, we need to describe B. Recall that e(m) is the vector of size m having the 0-th component 1 and all of the others 0, and

B(P) = [ 1

1 et

—p e Ip‘] ' e = e(P')

.


( 1 ED F(p'))A(p) = B(p)(1 F(p')),

which is what we need to show that

B(p) et 1 B = [ ,B(P) e = e(g')

—p B(p) e B(p) '

Arguing as in the preceding section, with

Q = Ip P(pq',q'),

we have B = Q-1 (B(q) B(p))Q.

Theorem 10.9

F, = F(1 ED Dp Dq (Dp Dq))Q-1(B(q) B (p))Q F,

where Q = Ip ED P(pq'q') and F is the FT factor

F = e F(p1) e F(q') (F(p') F(0).

The arithmetic of B is given as follows:

4(p + q) R.A.

+ q} R.M.

In brackets, we have placed the number of multiplications by integers.

10.7 Variants 187

Table 10.6 Real additions.

Size A B

15 88 32

21 128 40

35 232 48

55 376 64

To reduce the cost of additions required to perform the complex mul-tiplications coming from the action of C, we note that Cp has the form

Xp Xp* Cp = [ x; xp .


Cp = (F(2)0 lpy2)Y,(F(2) Ip'/2),

where = 1/2(Xp + X;)e 1/2(Xp — Xp*)

Arguing as before, we have the next result.

Theorem 10.10

Fir = H (1 Yp e Yp 11,))H Q-1 (A(q) A(p))Q,

where Q = lp ED P(pq' ,q') and H is the FT factor

H = le (F(2) ® 12) e (F(2) ® 472) e (F(2) ® ipy2 ® F(2) ® 4,/2)-

Table 10.7 uses the fact that 1/2(Xp + Xp*) has only real entries and 1/2(Xp — Xp*) has only imaginary entries.

Table 10.7 Arithmetic counts of H and Y.

Factor R.A. R.M

H 2(p — 1)q + (q — 1)p 0

y (p2 1)g (g2 1)p (p 1)2 q (q 1)2p

R.A. — the number of real additions M.A. — the number of real multiplication


Setting

Y = 1 e IP Yq (Yp Yq),

the factorization in the preceding theorem can be rewritten as F, = HY 131H, where

= HQ-1 (A(q) A(p))Q

Fewer additions are required to compute the action of Bi as compared with computing the action of A. To see this, we must describe Bi.

For even n, define the vector f (Th) of size n by

f(n) [ 1 1 0 ® 1n/2.

Recall the definition

r e t

Bi(P) = —2f 4, f = f(p')

.

Then (F(2) 0 4,f2)1p, = 2f ('I).

This leads to the following description of Bi:

[ Bi(p) (p) ft f = f(V) B1 = —2B1(p) f Bi(P) 1-q'

The usual tensor product manipulations show that

= Q-1(-131(4) 131.(P))(

where Q = ED P(pq' ,q').

Theorem 10.11

F, = H(1 ED Yp ED Yq e ( Yp YO)Q-1(131(q) 0 Bi(p))Q H

where Q = /I, ED P(pq' , q') and

H = 1 ED (F(2) I py2) e (F(2) 0 4/2) e (F(2) ® rpy2 ® F(2) ® 4/2).

Table 10.8 Real additions for computing C.

Size Direct Method C = HY H

15 228 112

21 480 200

35 1052 408

55 2516 864

10.8 Summary 189

10.8 Summary

Suppose that ir is the exponential permutation of Z/N corresponding to the factorization N = pq. Denote by lei, e2} the complete system of idem-potents for the factorization. Suppose that zi and z2 are generators of U(p) and U(q). Denote by F, the FT matrix corresponding to 7r. Set Q = Ip ED P(pq' , q').

In the following discussion we will suppress dependence on the choice of generators. Denote by Cp and Cq the rotated Winograd cores based on ye, and ve2, v = e27"/N. Define the multiplicative factors

C(p,q) = le cp cq e (cp cg).

Cp and Cq have the form

X X* C = [ P P P X; Xp

with a similar formula for Cq. Define the diagonal factors

D(p, q) = e Dp e e (Dp Dq),

where Dp = F(p')-1CpF(p')-1 with a similar formula for Dq. Define the block diagonal factors

Y(p,q) =1EDYp E13)Yq ED (Yp 01'0,

where 1 [ Xp* 0 1 =

0 X — X* '

P p

with a similar formula for Yq. Define the FT factors

F (p, q) = 1 ED F (p') ED F (q') ED (F (p') F (g'))

and

H(p, q) = le (F(2)0 /p72) ED (F(2) 0 4 /2) e (F(2) ® ipy2 ® F(2) o/v/2).

Define the preaddition factors

A(p, q) = Q'(A(q) A(p))Q,

B(p, q) = Q-1 (B(q) B(p))Q,

Bi(P, 4) = (Bi(q) Bi(p))Q,

where Q = Ip ED P(pq' , q').


• Fundamental Factorization

F„. = C(p, q)A(p, q).

• Rader FFT I

F, = F(p, q)D (p, q)F(p,q)A(p, q).

• Rader FFT II

• = F(p, q)D(p, q)B(p, q)F (p, q).

• Rader FFT III

F, = H (p, q)Y (p, q)H (p, q)A(p, q).

• Rader FFT IV

= 11(P, Oir 4)B1(1), OH (11,4).

References

[1] Good, I. J."The Interaction Algorithm and Practical Fourier Analy-sis", J. R. Statist. Soc. B, 20(2), 1958, pp. 361-372.

[2] Thomas, L. H. Using a Computer to Solve Problems in Physics, Application of Digital Computers, Ginn and Co., 1963.

[3] Burrus, C. S. and Eschenbacher, P. W. "An In-place In-order Prime Factor FFT Algorithm", IEEE Runs Acoust. Speech and Signal Proc., 29(4), Aug. 1981, pp. 806-817.

[4] Chu, S. and Burrus, C. S. "A Prime Factor FFT Algorithm Using Dis-tributed Arithmetic", IEEE Trans. Acoust. Speech and Signal Proc., 30(2), April 1982, pp. 217-227.

[5] Kolba, D. P. and Parks, T. W. "A Prime Factor FFT Algorithm Using High-speed Convolution", IEEE Trans. Acoust. Speech and Signal Proc., 25(4), Aug. 1977, pp. 281-294.

[6] Blahut, R. E. Fast Algorithms for DigitarSignal Processing, Addison-Wesley, 1985, Chapters 4 and 8.


Problems 191

Problems

1. For p = 3 and q = 7, find the system of idempotents fel, e21.

2. Find the unit group U(21). List all of the elements by the ordering defined by the exponential ordering.

3. Take generator 2 for U(3) and 3 for U(5). Order the indexing set Z/15, and write a complete Fourier transform matrix corresponding to this ordering. Compare with the results in section 2.

4. Verify the tables in section 3, the arithmetic counts of F,(15).

5. Find the system of idempotents for Z/33, Z/35 and Z/39. Reorder the indexing set by the idempotents.

6. Write C3 and CH in F7,433).

7. Write C5 and C7 in F7,-(35).

8. Write C3 and Ci3 in F7,439).

9. Verify the arithmetic counts given in tables 10.4 and 10.7.

11 MFTA: Composite Size

11.1 Introduction

In this chapter, we extend the methods introduced in the preceding two chapters to include the case of a transform size that is a product of three or more distinct primes. In fact, we will give a procedure for designing algo-rithms for transform size N = Mr, r a prime not dividing M, whenever an algorithm for transform size M is given. We will also include FT algorithms for transform size 4M, where M is a product of distinct odd primes.

11.2 Main Theorem

Let N = Mr, where r is a prime not dividing M. Throughout this section, r' = r — 1, v = e'riim and w = Fbc a permutation 7r of Z/M and a generator z of U(r). Denote by F, the FT matrix corresponding to ir

F, = [27(a)ir 0<a, b<M

F, computes the M-point FT on data reindexed by 7r. We will develop a procedure for constructing an N-point FT algorithm with F, embedded in the computation. Any algorithm computing the action of F, can be used to compute the N-point FT.

194 11. MFTA: Composite Size

Consider the complete system of idempotents 4} for Mr. By the CRT, every a E ZIN can be written uniquely as

a 7r(ai)ei + a2e2, 0 < < M, 0 < a2 < r.

Partition Z/N into the sets

S = fir(ai)e'l : 0 < al < MI,

T = fr(ai)e'l + zke2: 0 < al < M, 0 < k <

We order S by the parameter al and T by the parameters ai and k, with k as the faster running parameter. Define the permutation p of ZIN by

p= (S, T),

and consider the FT matrix Fp corresponding to p

Fp = [wP(1)P(01 0.<1 k<N*

Fp computes the N-point FT on the data reindexed by p. We decompose Fp into the four submatrices corresponding to the Carte-

sian products S x S, S x T,T x S and T x T. Consider first S x S. The corresponding submatrix is

= Rwely(a1)7 011 0<ai GM

Since we'i is a primitive M-th root of unity, F,' is formed by replacing e2"iim in F, by wel and we can write F„' = PF, for some permutation matrix P.

The submatrix corresponding to T x T,

[wir(ai)r(191)4-1-z'+ke'21

0<ai, bi GM , 0<1,k<r'

can be written as

[(we'i y(ai)/r(bi) (we )zi+k 1 7

0<al, bi<M, 0<1,kGr'

which, by the ordering on T, is

F„' 0 Cr, —

where Cr is the rotated Winograd core based on the primitive r-th root of unity we'2. Continuing in this way,

F — F:r ® P F.; r®Cr j•

11.2 Main Theorem 195

Since

Crir, = -lr',

we have

Im 0 4,1 F, -= ED (F,', 0 Cr)) [_.im

Moreover,

p(mr',111) (Ir, ® Im ) =

implies that

ltr'l Q-1 (A(r) 0 I m)Q = [_ m 1.7., mr, J

where Q = e P(Mr', r'), proving the next result.

Theorem 11.1 If 71- is a permutation of Z I M and {el, e'21 is the complete system of idempotents for the factorization N = Mr, r a prime not dividing M , then there exists a permutation p of ZI N such that

Fp = (F; e (F; Cr))Q-1 (A, 0 I m)Q

with Q -= Im ED P(Mr' ,r'). F; is the matrix formed by replacing e'ilm in F, by we'i and Cr is the rotated Winograd core based on we' 2 , w =

The permutation p has been explicitly described in the preceding discussion.

Every factorization of F; into a product of two M x M matrices F; = CA produces a factorization of Fp by the tensor product manipulations

Fp = (C (C Cr))(A e (A 0 Ir,))Q-1(A(r) Im)Q

= (C e (c ® cr))(2-1(ir A)(Ar Im)Q

= (C e (c ® cr))Q-1(A(T) ® A)Q

which is summarized in the next corollary.

Corollary 11.1 If F.; = CA, then

Fp (C ED (C 0 Cr))Q-1(A(r) 0 A)Q.

In many applications, A is a matrix whose coefficients are talcen from {0, 1, -1} and C is the tensor product of rotated Winograd cores.


11.3 Product of Three Distinct Primes

We will now apply the results of the preceding section to design multiplica-tive FT algorithms for transform size N = pqr based on the multiplicative FT algorithm for M = pq developed in the preceding chapter, where p, q and r are distinct primes. Throughout we will use the notational con-ventions established in section 2. Set p' = p - 1, q' = q - 1, r' r - 1, w = e'IN and v = e27"im. Consider the complete system of idempotents

f2, f31 for the factorization N = pqr:

1 mod p, 0 mod 4, fi 0 mod r,

h 0 mod p, h 1 mod q, h 0 mod r, f3 0 mod p, f3 0 mod q, f3 1 mod r.

The set fel, e21 with ei = fi mod M and e2 = f2 mod M is the complete system of idempotents for the factorization M pq, and the set {el, e'21 with ec = f2 and e'2 = f3 is the complete system of idempotents for the factorization N = Mr. Throughout the discussion we suppress dependence on the choice of generators for U(p), U(q) and U(r).

Choose the exponential permutation 7r of Z/M corresponding to the factorization M = pq. By the fundamental factorization,

F, = (1 e cp e (c, ® Cq))Q-1(A(q) A(p))Q,

where 6', and Cq are rotated Winograd cores based on v" and ve2 and Q -= ip ED P(pq1,q/). Then

= (1 e c; e c,' e (c; ® C ))Q-1 (A(q) A(p))Q ,

where Cp' is the rotated Winograd core based on w"''2 = wfl and Cq' is

the rotated Winograd core, based on we'le2 wf2 . Since Cr' is the rotated Winograd core based on we2 = wf3, we have the next result by the corollary of section 2.

Theorem 11.2 Suppose that { f2, f31 is the complete system of idem-potents for the factorization N = pqr for distinct primes p, q and r. Then there exists a permutation p of ZIN such that

Fp = C pQ-1(A(r) A(q) A(p))Q,

where

Cp=1A)CpeC,e(Cp0C0eCre(Cp0C,)03(C,OCr)ED(Cp®Cq®C,-)

with rotated Winograd cores Cp, Cq and Cr based on 'Loh , wf2 and wh , w = e27"/N and Q = (Ir e P(pgl,q1)))(41 P(Mr',r')). ,

The permutation p is constructed from 7r and {el, as described in the preceding section.

Variants can be produced by the same tensor product manipulations described in the preceding chapters. We will state results without proofs.

11.4 Variants 197

11.4 Variants

Denote by ffi, f2, f3} the complete system of idempotents for the factor-ization N = pqr for distinct primes p, q and r. Denote by Cp, Cq and Cr the rotated Winograd cores based on wfl, wh and wf3, w =

Define the multiplicative factor

C(p,q,r) = C(p, q) ED (C(p,q) 0 Cr),

where C(p, q) is based on Cp and Cq as defined in the summary of chapter 10.

Define the diagonal factor

D(p, q,r) = D(p, q) e (D(p, q) D(r))

and the block diagonal factor

Y (p, q, r) = Y (p, q) ED (Y (p, q) Y (r)),

where D(p, q) and Y (p, q) are defined in the summary of chapter 10. Define the FT factors

F(p,q,r) = F(p, q) (F(p, q) F(r))

H(p,q,r) = H(p, q) e (H (p, q) 0 (F(2) 0 Iry 2)),

where F(p, q) and H (p, q) are defined in the summary of chapter 10. Define the preaddition factors

A(p, q, r) Q-1 (A(r) A(q) A(p))Q,

B(p, q, r) = Q' (B(r) B(q) B(p))Q,

Bi(p,q,r) = Q-1 (Bi(r) Bi(q) 0 Bi(p))Q,

where A(p), B(p) and Bi(p) are defined in chapter 9 and Q = (I, 0 (Ip ED P(pq' , q'))(I m ED P(Mr' , r')).

Fundamental Factorization

Fp = C(p,q,r)A(p,q,r).

Rader FFT I

Fp = F(p,q,r)D(p, q,r)F(p,q,r)A(p,q,r).

Rader FFT II

Fp = F(p,q,r)D(p,q,r)B(p,q,r)F(p,q,r).


Rader FFT III

= H(p, q,r)Y (p, q,r)H(p,q,r)A(p, q, r).

Rader FFT IV

Fp = H (p, q, r)Y (p, q,r)Bi(p, q,r)H(p,q, r).

These results easily generalize to an integer N equal to the product of an arbitrary number of distinct prime factors. The only change is that the multiplicative matrices are rotated Winograd cores based on raising w = e271-2/N to powers given by the complete system of idempotents of the factorization of N into this product.


Consider the permutation 7r of Z/4,

ir = (0, 2, 1, 3).

F, admits the factorization F, = C A,

where

[100 0 [I. 1 1 1 010 0 1 1 —1 —1

C = A = 0 0 1 i ' 1 —1 0 0 0 0 1 —i 0 0 1 —1

{9,4} is the complete system of idempotents for the factorization 12 = 4 x 3. Set w = e2"1112. Since w9 = —i and w4 = e27"/3, we have, in the fundamental factorization for 12 = 4 x 3, F", = C* A and C3 = C(3). By the corollary of section 2,

= (C* (C* C(3)))Q-1(A(3) A)Q,

where Q = ED P(8, 2). The permutation p of Z/12 is

p = (0, 6, 9, 3; 4, 8, 10, 2; 1, 5, 7 , 11).

11.6 Transform Size: 4p, p odd prime

Choose ir = (0, 2, 1, 3) and F„ = C A as in the preceding section. Denote by e'21 the complete system of idempotents for the factorization N = 4p.

11.7 Transform Size: 60 199

Set w = eri N . In the fundamental factorization for N = 4p, if we'i then = F, while if we'i = —i, then F.„' = C* A. We have, by the corollary of section 2,

(c e (c Cp))Q-1(A(p) A)Q , we' = FP

= (C* 133, (C* Cp))Q-1(A(p) A)Q, wel =

where Cp is the rotated Winograd core based on w4 and Q = 14 ED13(4p1, p'). The permutation p = (S,T) is given by

S = {7r(ai)e'l : 0 _< < 4} = (0,2e11,

T = (To,T2771)T3)

with Tj = {jel + zkei2 : 0 < k < p'1,

where z is a generator of LI (p).


Choose the permutation 71" of Z/12 given in section 4,

= 06934810215711).

As shown in section 4,

F, = (C* e (c* C(3)))Q-1(A(3) A)Q

with C, A and Q as defined in section 4. {25,36} is the complete system of idempotents for the factorization 60 =

12 x 5. Set w = e2'/6°. Then

w25 = e27riA, w36 = e27rit

F,' is formed by replacing e'/12 in F, by e27riA. Since

e2iri _

ezn-i/3 _ = (eza-in )*,

C* remains unchanged while C(3) is changed to C*(3). We have

= (C* e (c* c*(3)))Q-1(A(3) 0 A)Q.

Since C5 is the rotated Winograd core based on e2'it,

C5 = SC(5),


where 0 0 0 1 1 0 0 0

S 0 1 0 0 • 0 0 1 0

By the corollary of section 2,

Fp = (C' e (C' ® SC(5)))W1 (A(5) 0 A(3) 0 A)Qi,

where C' = C* (C* 0 C* (3)) and Qi --= (/5 Q)(/12 ED P(48, 4)). {45, 40,36} is the complete system of idempotents for the factorization

60 = 4 x 3 x 5. Since

40 27ri a 36 27ri

W45 = - i W = e 3 W = e 5 1

we have

C4 =- C* (4), C3 -= C* (3)1 C5 = SC(5),

which, by the theorem of section 3, implies that

F„ = (C'' e (C" SC(5)))Q-1 (A(5) 0 A(3) A(4)))Q,

where C" = 1 ED C*(4) C*(3) ED (C*(4) 0 C* (3)) and Q = (15 0 (14 ED P(8, 2))) (Ii2 ED P(48, 4)).

Tables of Arithmetic Counts

Table 11.1 Fp CpAp.

Factor R.A. R.M.

Cp 60 64

Ap 68 0

Fp 128 64

Table 11.2 F„ = CpAp.

Factor R.A. R.M.

Cp 4(p — 1)(4p — 5) + 4 16(p — 1)2

Ap 4(7p — 4) 0

Fp 4(4p2 — 2p + 2) 16(p — 1)2

R.A. R.M.

20p - 8

P(P+ 5)

14(p - 1)1

4(p - 1)2 + f4(p - 1)}

Tables of Arithmetic Counts

Table 11.3 Fp = CpAp.

. R.A. R.M.

128 64

368 256

736 576

1856 1600

Le 11.4 C = HY H (N = 4p).

R.A. R.M.

4(p2 - 4p + 6) 4(p - 1)2

8(P - 1) 0

4(p2 + 2) 4(p - 1)2

able 11.5 F, = HY Bill.

['able 11.6 F, = HY BiH.

R.A. R.M.

96 16+0}

200 64+06}

336 144+04}

704 400+1401

- the number of real additions. ,he number of real multiplications.


References


[2] Lu, C. Fast Fourier Transform Algorithms for Special N's and the Implementations on VAX, Ph.D. Dissertation, The City University of New York, Jan. 1988.


Problems

1. For p =- 3, q = 5 and r = 7, find the complete system of idempotents for N = pqr.

2. Find the ordering of Z/105 by the idempotents of problem 1.

3. Define the matrices Cs, Cs and C7 in F(105).

4. Derive the 4p algorithm for p -= 5 in detail as in the example of N = 12 given in section 5.

5. Derive four variants of the N = 20 algorithm.

6. Find prime p with the property that the Fir' has to be written as F,', = C*(4)A(4).

7. Prove the formulas given in tables 11.2, 11.4 and 11.5.

12 MFTA: p2

12.1 Introduction

Multiplicative prime power FT algorithms will be derived. Although mul-tiplicative indexing will play a major role as in the preceding chapters, the multiplicative structure of the underlying indexing ring is significantly more complex, and this increased complexity will be reflected in the resulting algorithms.

Two different algorithms are given for the case p2 and examples are presented in detail. In section 2, we start with the example of 9. The general case p2 will be given in detail in section 3. An extension to the case pk is given in section 4 by the example of 27.

12.2 An Example: 9

Z/9 is not a field, but the unit group U(9) is a cyclic group of order 6 having generator 2:

U(9) = {2k : 0 < k < = (1,2,4,8,7,5).

Order U(9) by powers of 2 as shown.

204 12. MFTA: p2

Set w = 0'0 and

W W2 W4 W8 W7 W5 -

W2 W4 W8 W7 W5 W

C(9) = W4

W8 W8

W7 W7

W5 W5

W

W

W2 W2

W4 •

W7 W5 W W2 W4 W8

W5 W W2 W4 W8 W7

C(9) is a 6 x 6 skew-circulant matrix having the form

x(9) X*(9) 1 C(9) = [ x*(9) X(9) J

where w W2 W4

X(9) -= w2 w4 ws I . W4 W8 W7

We call C(9) the Winograd core based on the generator 2 of U(9). Consider the sets Do, Di and D2 defined by

Do = U(9),

= 3U(9) = {3, 6},

D2 = {0},

ordered as shown. The collection {Do, Di, D2} is a partition of Z/9. The permutation

= (D2, Di, Do) = (0; 3, 6; 1, 2, 4, 8, 7, 5)

is called the ezponential permutation of Z/9 based on the generator 2 of U(9).

Denote by 1,, the vector of size rn of all l's, and by E(rn,n) = 1,, 0 the m x n matrix of all l's. Set E(rn) -=- E(m,m,). The FT matrix F, is given by

1 F, = [12 E(2) l& 0 C(3)1,

16 13 0 C(3) C(9)

where C(3) is the Winograd core based on the generator 2 of U(3). Computing the action of F, requires computing four C(3) a,ctions and

one C(9) action. The goal is to reduce the mimber of multiplications by distributing these actions across a preaddition stage. However, direct coni-putation shows that C(9)16 = 06, a result we will prove in general in the next section. Since C(9) cannot be factored across 4, we must handle C(9) separately.

12.2 An Example: 9 205

Denote by 0,, the vector of size 7n of all O's and 0(m, n) = 0 Oti, the m x n matrix of all O's. Set 0(m) = 0(m, m). Define

1 = [12 E(2) 0 C(3) .

06 0(6, 2) C(9)

Algorithm I for computing F,x:

• Compute u =

• Compute

v = C(3) [ xl . X2

• Compute

F,x = u 03 ± [ 03 1

I_ x016 13 0 v

Algorithms for computing F: will be discussed below. The cost of the final two stages is 14 additions and 2 complex multiplications.

F: admits a factorization into the product of a preaddition matrix and a multiplication matrix.

Theorem 12.1 = (1 C (3) e C(9))A(9),

where 1 lt

A(9) = - 12 -E(2) /2 . 06 0(6, 2) /6

Variants of the factorization can be derived by the usual tensor product arguments. Define the diagonal matrices

D(3) = F(2)-1C(3)F(2)-1,

D(9) = F(6)-1C(9)F(6)-1

Since

lt6F(6)-1 = [ 1 0 0 0 0 0

1000001 F(2)(4 /2)F(6)-1 = [

L000100j'


206 12. MFTA: p2

Theorem 12.2 Suppose that

F = e F(2) e F(6).

Then = F(1 ED D(3) ED D(9))B(9)F,

where 1 1010 0000

-2-2010 0000 B(9) 0 0000 0100 .

06 0(6, 2) 16

The block diagonalization method extends. Since

(1'3 0 F(2))(F(2)-1 0 /3) = [ 01 01 01 01 _ 01. 01 i ,


Theorem 12.3 Suppose that

H = 1 ED F(2) e (F(2) 0 /3).

Then

= H-1(1 e (-1) ED a e (X(9) + X*(9))

isB(X(9) - X*(9))Bi (9)H,

where a = v - v2 , v = e2"4/3 and

- 1 10 1110 0 0 -2 -20 1110 00

Bi(9) = 0 0 0 0 0 0 1 —11 .

06 0(6, 2) /6

12.3 The General Case: p2

Fix an odd prime p. Set w = e21"/P2 and p' = p - 1. Z/p2 is no longer a field, but the unit group U(p2) is a cyclic group of order s = pp'. Choose a generator z of U(p2) throughout this section. Then

U(p2) = tzk : 0 < k < sl.

12.3 The General Case: p2 207

Order U(p2) by the parameter k. The s x s skew-circulant matrix

w w. . . . wzs-1 _i wz w.2 . . we w

C(p2)—

— .:-' . we-2 W W

is called the Winograd core based on the generator z of U(p2). Since zs/2 —1 mod p2, C(p2) has the form

C(p2 ) = [ X(P2) X* (P2) x*(p2) x(p2) ] •

Define the subsets Do, Di and D2 of Z/p2 by

Do = U(P2),

Di = pU(P2) = {Pzk : 0 _< k < pi},

D2 = {0}.

The collection {Do, Di, D2} is a partition of Z/p2. The permutation 7r of Z/p2 given by

7r = (D2, Di, Do)

is called the exponential permutation of Z/p2 based on the generator z of U(p2). The FT matrix F„ is given by

1 lt P' its

F„ = 1p, E(p') lpt 0 C(p) , [

is ip 0 C(p) C(p2) (12.1)

where C(p) is the Winograd core based on the generator z mod p of U(p). A direct computation of F„ requires p + 1 C(p) computations and C(p2)

computations. The goal is to reduce the number of multiplications by dis-tributing Winograd cores across a preaddition stage. However, the following result shows that this cannot be completely accomplished.

Theorem 12.4 If w = e21rilP2 , then

E wk _ E wk2 _ o.

kEU(p2) kEU(p2)

Proof Since k EU(p2) if and only if k + p E U(p2), we have

E wk _ wp E wk,

kEU(p2) kEU(p2)

E wk2 _ w2p E wk2,

,u(p2) kEU(p2)

which can be the case only if the theorem holds.

208 12. MFTA: p2

Corollary 12.1

C(p2)1, = Os C(p2)(1, =

From the corollary C(p2) cannot be factored across 1, and 1p 0 C(p) and must be handled separately. We will describe two methods.

Define ipt,

=[1,/ E(p') 11; C(p)i. Os 0(s, p') C (p2)

Algorithm II for computing F,x:

• Compute u =-

• Compute xi

v = C(p)

L:1

• Compute 0

F,x = F„-Ex + [ [ v

The algorithm computes the action of Fir by computing the a,ctions of C(p) and F: plus 2s additions. Algorithms for computing F: will be given below. The computation of v can be carried out by the methods of the previous chapter using the fact that C(p) is a skew-circulant matrix and is a block 2 x 2 skew-circulant matrix.

Algorithms for computing F: can be derived using the methods of the previous section. Since C(p)1,,, = —1p,, we have the following result.

Theorem 12.5 = (1 ED C(p) C(p2))A(p2),

where 1

A(p2) = —E(p') 14,0 4,1. Os 0(s pi) is

Set F(p' , s) = 1 ED F(p') ED F(s).

Since C(p) and C(p2) are skew-circulant matrices, we have

C(p) = F(p')D(p)F(p'),

12.3 The General Case: p2 209

C(p2) = F(s)D(p2)F(s),

where D(p) and D(p2) are diagonal matrices. For the discussion, set

F = F(p' , s).

Then = F(1 e D(p) e D(p2))B(p2)F,

where B(p2) = FA(p2)F-1.

Denote by e(n) the vector of size n having the 0-th component equal to 1 and all other components equal to O. Set e(m,n) = e(m) (e(n))t and e(rn) = e(tn, rn).

Theorem 12.6

F(p')(lpt Ip,)F(s)-1 = 0 et ,

where e = e(P)

Proof By the Cooley-Tukey factorization for s = pp' ,

F(s)-1 = (F(p)-1 1-,,,)T(Ip F(p')-1)P(s,p),

where T is a diagonal matrix whose first p' diagonal elements are all equal to 1. Then

F(p')(4 Ip,)F(s)-1 = (4 F(p'))F(s)-1

= (et 0 F(p'))T(Ip F(p')-1)P(s,p)

= (F(II) e 0“P')2)(ip F(P')-1)P(s,P) = e 0(01)2)P(S,P)

= 0 et .

From the preceding theorem, we have the next result.

Theorem 12.7

= F(p' , s)(1 ED D(p) D(p2))B(p2)F(p', s),

where 1 et

I. et

2 B(P2) = [ —P'el —P'e(P') ip' 0 et 1,

Os 0(s, p') Is

where ei = e(P'), e2 = e(8) and e = e(P).

210 12. MFTA: p2

The action of B (p2) requires three additions and one rational multiplication by —p'. Up to the computation of the two F(p' , s) factors, the action of F: requires three additions, one rational multiplication and p2 — 1 complex multiplications.

A related approach begins by defining

1 4, = [1p, E(p') 4 C (p)1.

1, lp C (p) 0(s)

Algorithm III for computing Fn-x:

• Compute u = F:+x.

• Compute x,

v = c(p2)

X.5_1

• Compute

F,x = u + [ °vP -

The action of C(p2) has been separated out. Since C (p2) is skew-circulant and 2 x 2 block skew-circulant its action can be computed by the diagonalization and block diagonalization methods of the preceding chapter.

The computation of the action of F:± is handled in much the same way as that of F:

Theorem 12.8

F.;1-+ = (1 ED C(p) (/, C(p))Ai(p2),

where 1 lt its

A' (p2) = —1,, —E(p') 4 0 p, . —1, 1, 0 Ip, 0(s)

Theorem 12.9 Set

(p' , s) = F(p') (/, F(p')).

Then F:+ = F' (p' , s)(1 D(p) D(p2)).131 (p2)F' (p' , s),

12.3 The General Case: p2

211

where

1 el lpt 0 el g(p2) = —p' e(p') 0 1,

—p lp ei lp /p, 0(s)

where el -= e(P').

The action of BV2) requires 3p additions. The block diagonalization method can also be applied. For n even, set

L.(n) [

I — 0 Q9 In/2

and f (n) = f(n) (f(n))t. Denote by in, the vector of size m whose r-th component is (-1)r and set

-t(n) [ 0 „ 7n/2.

'

Since

(F(2) 0 /2)(12; 0 Ily)(F(2) 142)

= ((lpt F(2))(F (2)-1 0 Ip)) Ipy2

= [ ]0 I

ft p'/2

with f =- f2P and i= PP, we have the next result.

Theorem 12.10 Set

H(p' , s) = 1 (33 (F(2) 0 Ipy2) ED (F(2) 0 /s/2)•

Then

= 1-1(P1 , s)-1 X (P2 ,P)131(P2)11(P' s),

where

X(p2 , p) = le(X (p)+X*(p))e(X(p)— X*(p))

(13,(X(p2) X*(p2)) e (x(p2)- x*(p2))

and 1

[

fl ct '2

Bi(P2) = —2fi. —2f (p') [-fftt]® 472

03 0(s, p') .18

with fi = f (ii ) , f2 = f (8) and f = f(2P).

212 12. MFTA: p2

12.4 An Example: 33

Take 2 as a generator of the unit group U(27) of Z/27,

U(27) = {2k : 0 < k < 181.

The indexing set Z/27 is partitioned by the sets

Do = U(27),

= 3U(27) = {3, 6, 12, 24, 21, 151,

D2 = 9U(27) = {9, 181,

D3 = {0}.

Order U(27) by the parameter k. The permutation 7T of Z/27 defined by

7r = (D3, D2, Di, Do)

is the exponential permutation of Z/27 based on 2. The FT matrix Fir can be written as

7, _ [ 1 11 1A it 18

12 E(2) E(2, 6) A 0 C(3) F —

16 E(6, 2) E(3) 0 C(3) 4 0 C(9) '

118 lo 0 C(3) 13 0 C(9) C(27)

where C(3) and C(9) are the Winograd cores based on the generators 2 mod 3 and 2 mod 9 of U(3) and U(9), and

w = e2iri/27 C(27) = [wzi+k],

is called the Winograd core based on the generator 2 of U(27). Define

_

1 1 1 lti8

F++ 12 E(2) E(2, 6) 4 0 C(3) 71" — 16 E(6, 2) E(3) 0 C(3) 0(6, 18) •

_ 118 lo 0 C(3) 0(18, 6) 0(18)

Algorithm IV for computing F,x:

• Compute u Fj-±x.

• Compute

v = 0(9, 3) 13 0 C(9)

S3

11 0 C(9) 1 s.4 C(27)

x26

12.4 An Example: 33

F,x = F:+x [ °v3

1 e C(3) e (13 ® C(3)) e (19 ® C (3)) A,

1 -12 -16 118

—E(2) —E(6, 2) 16 0 12

—E(2, 6) E(3) 12

0(18, 6)

118 11 0 /2 0(6, 18) 0(18)

le preceding section can be used to derive add

metic Counts

hble 12.1 Algorithm I F„(9).

:tor R.A. R.M.

C 144 160

A 36 0

(9) 180 160

(9) 204 176

Lble 12.2 Algorithm IV F, (9).

:tor R.A. R.M.

H 16 0

T 0 8

(9) 56 36+0}

(9) 88 44+181

214 12. MFTA:

Table 12.3 F, = HDBH (N =27).

Factor R.A. R.M.

H 52 0

D 0 26

B 450 414

F„ 554 440

R.A.— the number of real additions. R.M.— the number of real multiplications.

Referen.ces


[2] Heideman, M. T. Multiplicative Complexity, Convolution, and the DFT, Springer-Verlag, 1988, Chapter 5.

[3] Lu, C. Fast Fourier Transform Algorithms For Special N's and The Implementations On VAX, Ph.D. Dissertation, The City University of New York, Jan. 1988.

[4] Lu, C. and Tolimieri, R. "Extension of Winograd Multiplicative Al-gorithm to Transform Size N=p2g, p'gr and Their Implementation", Proc. ICASSP 89, 19(D.3), Scotland.

[5] Nussbaumer, H. J. Fast Fourier Transform and Convolution Algo-rithm,s, Second Edition, Springer-Verlag, 1982.

[6] Tolimieri, R., Lu, C. and Johnson, W. R. "Modified Winograd FFT Algorithm and Its Variants for Transform Size N=pn and Their Im-plementations," Advances in Applied Mathem,atics, 10 pp. 228-251, 1989.

[7] Winograd, S. "On Computing the Discrete Fourier Transform", Proc. Nat. Acad. Sci. USA, 73(4), April 1976, pp. 1005-1006.

[8] Winograd, S. "On Computing the Discrete Fourier Transform", Math. of Computation, 32, Jan. 1978, pp. 175-199.

Problems 215

Problems

1. List all of the elements of the unit group U(25) of Z/25.

2. Reindex Z/25 by its orbits Do, Di and D2.

3. Write the explicit matrix of FB.(25) and derive algorithm I and its variants.

4. Find the arithmetic counts for each of the variants derived in problem 3.

5. Derive the algorithm II for F,(25) and find its arithmetic counts.

6. Find a generator for the unit group of U(125) = U(53).

7. Find the system of idempotents lei, e2} for Z/75 = Z/523.

8. Order the unit group U(75) using the idempotents found in problem 9.

9. Derive in detail the algorithm for F, (75) following the procedures of section 4.

10. Derive the Good-Thomas algorithm for a 75-point Fourier transform.

11. Compare the arithmetic counts of the results of problems 9 and 10.

13 Periodization and Decimation

13.1 Introduction

The ring structure of Z/N provides important tools for gaining deep in-sights into algorithm design. The fundamental partition of the indexing set Z/pm, a major step in the Rader-Winograd FT algorithm of the preceding chapter, was based on the unit group U(pm).

We will now adopt the point of view that N-point data is a complex-valued function having the set Z/N as domain of definition. Denote by

L(ZIN)

the set of all complex-valued functions on Z/N and regard L(ZIN) as a complex vector space under the following rules of addition and scalar multiplication. For f, g E L(ZIN) and a E C,

(f + g)(j) = f (j) + g(j),

(af)(j)= a(f(j)), 0 5 j < N.

The ideas of this chapter are best described on the function theoretic level and constitute a part of abelian harmonic analysis. In this section, we will redefine the Fourier transform as a linear operator of the vector space L(Z/N) whose matrix relative to the standard basis is the N-point FT F(N).

In the next section, subspaces of L(ZIN) will be introduced correspond-ing to subgroups of Z/N. For a subgroup B, we define the subspace

218 13. Periodization and Decimation

of B-periodic functions and the subspace of B-decimated functions. The main theorem, proved in section 3, establishes an important duality be-tween these subspaces determined by the FT. This duality plays a role in both the Cooley-Tukey algorithms and the Rader-Winograd algorithms, and provides the key to understanding the global structure of many one-dimensional and multidimensional FT algorithms.

Denote the linear span of a subset S of a vector space V by Ln(S). For simplicity, we introduce the following convention. Suppose that S is a subset of a vector space V and determines a basis of Ln(S). If Y is a linear transformation of V such that Ln(S) is Y-invariant, then we say that M is the m,atrix of Y with respect to S whenever M is the matrix of the restriction of Y to Ln(S) with respect to S.

The set of functions : 0 < / < NI

with 1, k =1,

ei(k) = { 0, k

0 k < N

is a basis of L(ZIN) called the standard basis. If f E L(ZIN), we can write

N-1

f = E f Met, (13.1) t=o

and we call the vector of size N

f (0)

f =[ f(ISI —1)

the standard representation of f. An inner product on L(Z/N) is defined by the formula

N-1

(f '9) = E f(i)g*U), (13.2) i=o

where * denotes complex conjugation. The standard basis is an orthonorrnal basis relative to this inner product in the sense that

1 j — k (ei,ek) =10: j 0 j,k < N.

The space L(ZIN) viewed as an inner product spa.ce with inner product (13.2) is denoted by L2 (Z/N).

We will define another basis of L(Z/N) bearing some relationship to the ring structure of Z/N.

13.1 Introduction 219

A function

X Z/N Cx,

where Cx denotes the multiplicative group of nonzero complex numbers, is called an additive character if the following condition holds:

x(/ + k)= x(l)x(k), 0 < k < N,

with addition / + k taken mod N. In mathematical language, an addi-tive character is a homomorphism from the additive group Z/N into the multiplicative group Cx.

The additive characters on Z/N can be described as follows. Denote the subgroup of Cx consisting of all N-th roots of unity by UN •

Theorem 13.1 An additive character x of ZIN is a homomorphism of the additive group Z/N into the multiplicative group UN and is uniquely determined by x(1) by the formula

x(j) = x(1)i , 0 < j < N.

The group UN is a cyclic group that has the element

w = e2iritIV

as a generator. For each k,0 < k < N, we define the mapping

Xk Z/N UN

by setting

Xk(i) = wk3, 0 < j < N. (13.3)

By theorem 13.1, the set

{Xk : 0 < k < N} (13.4)

is the set of additive characters on Z/N.

Theorem 13.2 The set of additive characters on ZIN is an orthogonal basis of L2(Z/N), and for any additive character x on ZIN we have

iiXii2 = (X, X) =

Proof If v is an N-th root of unity, then

N-i E _ 0, , 1, N, v =1.

1=o

Take 0 < , k < N . By (13.2) and (13.3), we have that

N-1

(X1)Xk) = E Wr") - 1°N 11 kk7 r=0


This implies that the set (13.4) is an orthogonal subset of L(Z/N). Since there are N distinct elements in (13.4), the same as the dimension of L(ZIN), the set (13.4) is a basis of L(ZIN), completing the proof of the theorem.

The FT of Z/N can now be defined as the linear operator F(N) of L(Z/N) satisfying the following condition:

F(N)(e3) = x3, 0 < j < N.

By (13.1), we have that

N-1

F(N)(f)= E f f E L(ZIN). i=o

Setting g =F(N)(f), we have

g = F(N)f,

which implies that the matrix of the FT of Z/N relative to the standard basis is F(N).

There are useful properties of the linear operator F(N) that will be needed throughout this chapter. They can be proved directly or by using the corresponding properties of the matrix F(N). We list them without proof:

F(N)4 = N2/, (13.5)

1- the identity operator on L(Z/N), and

F(N)2(e3) = eN_,, 0 j < N (13.6)

(f,g)=11N(F(N)(f),F(N)(g)), f,g E L(ZIN). (13.7)

13.2 Periodic and Decimated Data

Set w = eri/N. Since every subgroup B of Z/N has the form

B = rZIN,

where r is a divisor of N:

B = Irk :0 < k < N = rs,

we have BZIN c B.

13.2 Periodic and Decimated Data 221

The ring structure of Z/N gives rise to the bilinear pairing

k) = wlk, 0 < l,k < N. (13.8)

The product, lk, can be taken either mod N or in Z since wN = 1. Direct computation shows that the-bilinear pairing (13.8) satisfies the following three properties. For 0 < k, m < N,

• (l + k, m) = (l,m)(k,m),

• (lk,m) = (l,m)k,

• (l,k) = (k,l).

The dual B± of a subgroup B of Z/N is defined by

= 1/ E Z/N : (l,k) = 1, for all k E B}.

is a subgroup of ZIN.

Example 13.1 Take N = 6 and B = 2Z/6. Then

B L = 3Z/6.

Example 13.2 Take N = 9 and B = 3Z/9. Then Bi = B.

Theorem 13.3 If B = rZIN, then .13-1- = sZIN, N =rs.

Proof Since N =rs and wN = 1, we have that

B-L sZIN.

Conversely, if k E B±, then Wkr = 1, which implies that s divides k and

Bi c sZIN,

completing the proof of the theorem.

Corollary 13.1 (B±)± = B.

Suppose that B = rZ/N, N = rs throughout this section. A function f E L(A) is called B-periodic if the following condition is satisfied:

f(a + b) = f(a), a E A,b E B.

A B-periodic function f is uniquely determined by the vector of values

g=

f (0) 1

f(r —1)


by the formula

f (l + rk) = 0 < < r, 0 < k < s.

In vector notation, f = 19 g.

Define 7r : Z/N —> Z/r

by setting 7r(x) = x', where x x' mod r and 0 < x' < r. The mapping ir is a ring-homomorphism of Z/N onto Z/r. For g E L(ZIr), define

f = ir*(g) E L(ZIN)

by the formula

7r*(g)a = g(r(a)), a E Z/N.

Observe that f is B-periodic. Every B-periodic function is of this form, and we have the following result.

Theorem 13.4 The mapping

7r* : L(Z/r) —> L(ZIN)

is a linear isomorphism of L(ZIr) onto the space of B-periodic functions in L(ZIN), B = rZ/N,.

A function f E L(ZIN) is called B-decimated if we have, for a E Z/N

f(a) = 0, a B.

If f is B-decimated, then

f (0) -

f (r) f = e(r).

_ f((s — 1)r)

Take h E L(ZIs) and define

f = cr* (h) E L(Z I N)

by setting f (kr) = h(k) 0 k < s,

and f equal to 0 otherwise.

Theorem 13.5

o-* : L(ZI s) L(ZIN)

is a linear isomorphism of L(ZIs) onto the space of B-decimated functions in L(ZIN).

13.3 FT of Periodic and Decimated Data 223

13.3 FT of Periodic and Decimated Data

We come to the first major result of this chapter: the duality between spaces of periodic functions and spaces of decimated functions determined by the FT. This duality is tbe function theoretic analog of theorem 13.1. Throughout this section, we set w = e2"i/N and

(1, = wik 0 < 1. k < N.

Take B =rZIN,n=rs. Suppose that f E L(ZIN) is B-periodic. Since every k, 0 < k < N, can be written uniquely in the form,

k = k' + b, 0 < < r, b E B,

we have r-1

F(N)f (1) = E E f +b)(k',1)(b,1), k'=0 bEB

implying by B-periodicity that

r-1

F(N)f(1) E f(k)(k,l)E(b,1), 0 <1 < N.

k=0 bEB

We want to compute

-y(/) =- E(b,/), 0 < / < N. bEri

There are two cases to consider. If / E then (b,/) = 1, b E B and we have

7(0 =- s, / E

proving that T-1

F(N)f(1) = s E f(k)(k,1), 1 E B-L. k=0

If / B-L, then there exists c E B such that (c, /) 0 1. Since

(c,1)7(1) =- E(c + b,l) = E 0,1) = 7(1), bEB bEB

we have -y(/) = 0, / B-L, proving that

F(N)f (1) = 0, 1 B-

By the corollary to theorem 13.3, Bi = sZ/N, which implies that

r-1

F(N)f (1s) = s E f(k)(k,l)s, 0 <1 < r. k=0


Since (k,1)8 = e2wilk/r, we have

= s E fooe2,riikir,

F(N)f(ts) r-1

k=0


Theorem 13.6 Suppose that B = rZIN, N = rs and f is B-periodic.

Then F(N)f is Bi—decimated and, on , is given by

F(N)f (0) f (0) F(N)f (s) f(1)

= sF(r) . .

_ F(N)f ((' r — 1)s) f(r —1)

Observe that computing the n-point FT of B-decimated data can be carried out using one r-point FT.

Since the space of B-periodic functions and the space of B-L-decimated functions have the same dimension r, theorem 13.6 implies the next result.

Corollary 13.2 The FT of ZIN maps the space of B-periodic functions isornorphically onto the space of .8i-decimated functions.

Consider a B-decimated function g E L(Z/N). We are still assuming that B=rZIN. By definition,

F(N)g(1) = E g(k)(k,1), 0 <1 < N. kEB Since / E B1 implies that (k, /) = 1 for all k E B, replacing / by / + s in the equation, we have

F(N)g(1± s)=F(N)g(1), 0 <1 < N,

implying that F(N)g is /31-periodic and

F(N)g(1)= E gfrk)e2„iikis, < < 8, k=0


Theorem 13.7 Suppose that B = rZIN, N = rs and g is B-decirnated. Then F(N)g is Bi-periodic and is given by the matrix formula

F(N)g(0) g(0) F(N)g(1)

— F(s)[ g(r) 1-

F(N)g.(s —1) g((s —1)r)

13.4 The Ring Zip"i 225

Corollary 13.3 The FT of ZIN maps the space of B-decimated functions isornorphically onto the space of Bi-periodic functions.

The preceding two theorems express the duality on the function spaces deterrnined by the FT. They will serve to give the global structure of sev-eral 'multiplicative' FT algorfthms. We take up this topic in detail in the following chapters.

These theorems can be viewed as the first step in the N-point Cooley-Tukey algorithm corresponding to the divisor r.

13.4 The Ring Zip"'

The results of the preceding section will be applied to the special case of N = , m > 1 and p an odd prime. The group Z pin has a unique maximal subgroup pZIpm and every subgroup of Zip"' has the form pkZ/pm, 0 < k < m. We have

(0) = pniZIpm Cpm-1Z1prn c • •• c pZIpm c Zip'.

For / < k, denote by

L(pk,p1), 0 < k,1 < m,

the subspace of all f E L(ZIpm) satisfying the following two conditions:

• f is pkZ/pm-decimated,

• f is p/Z/pm-periodic.

L(p,pm) is the subspace of all pZipm-decimated functions and L(1,p) is the subspace of all pZ/pm-periodic functions.

By theorem 13.3, the dual of pkZIpm is prn—kZ/pm. Applying theorems 13.6 and 13.7, we have the following result.

Theorem 13.8 The FT of ZIpm maps the subspace L(pk,pi) onto the subspace L(pn",pr"), 0 < k <1 < m.

The subspace L(p, pm-1) is especially important. By theorem 13.8, it is invariant under the FT of Z/pm in the sense that

F(N)L(p,r-1) = L(p,pm-1)

N-1/2F(N) is a unitary operator of L2(Z/pm). Denote the orthogonal complement of L(p,pm-1) in .0(Z/prn) by W:

W = {f E L2(ZIpm): (f,g) = 0, for all g E L(p,pm-1)}.

W satisfies the following two properties:


• L2(ZIpm) = w e L(p,pm-1).

• F(N)W = W.

These properties imply that we can study the action of the FT of L(ZIpm) by independently studying the action on L(p,pm-1) and on W.

The set of functions in L(ZIpm)

{Ek : 0 < k < pm-1},

defined by Eo eo

= (11; 0 ipm-1) e.1 ,

_1 eN-1

is a basis of L(1,pm-1) and the subset

lEpk : 0 < k < Pm-21

is a basis of L(p,pm-1).

Theorem 13.9 If N = pm, m > 2 and p an odd prime, then the matrix of F(pm) with respect to the basis

{Epk : 0 < k < Prn-21

is pF(p').

Proof The function Epk, 0 < k < pm-2, is equal to 1 on the set

S = {pk + rpm-1 : 0 < r < p}

and vanishes otherwise. Set w = e2m/P- and v = e2mi/P- 2 . Since L(p,pm-1) is F(pm)-invariant, F(pm)Ekp vanishes off of pZ/pm. On pZIpm we have

F(pm)Ekp(tp spm-1) = F(pm)Ekp(/p)

p-1

= E r=0 pv/k, 0 < 1,k < pm-2,

implying that

pm — 2

F (pr/b)Ekp = p E v lk v =_ e27rilpm-2

1=o


Problems 227

Problems

1. Prove that if x is a,n additive character of Z/N, then x(1) is an N-th root of unity (without relying on theorem 13.1).

2. Prove that the set of additive characters of Z/N is a group under the product rule:

(XXi)(k) = X(k)X1(k), k E Z/N,

where x and x' are additive characters of Z/N.

3. Prove formula (13.5).

4. Prove formula (13.6) and describe the matrix of F2 relative to the standard basis.

5. Prove that the dual of a subgroup is a subgroup.

6. Write down the collection of all B-periodic functions on Z/N, where N = 6 and B = 2ZI6.

7. Repeat problem 6 with n = 27 and B = 3Z/27.

8. Write down the collection of all B-decimated functions on Z/N, where N = 6 and B = 3Z/6.

9. Repeat problem 8 with n = 27 and B = 9Z/27.

10. Verify F(p3)L(p,p2) = L(p,p2).

14 Multiplicative Characters and the FT -

14.1 Introduction

Fix an odd prime p throughout this chapter. For m > 1, consider the subspace

L(p,pm-1) c L(Z Ipm).

In the preceding chapter, we proved that L(p,pm-1) is FV)-invariant and described the action of F(pm) on L(p, pm-1). Denoting the orthogonal complement of L(p, pm-1) in L2(Z/ pm) by W, we have

L(Z/pm) = W ED 14,pm - 1),

and W is F(r)-invariant. We will describe the action of F(pin) on W. Suppose that m > 1 and set U(pm) = U(ZIpm). A multiplicative

character x of Z/pm is a homomorphism

x : U(pm) Cx

from the multiplicative group U(pm) into the multiplicative group C of nonzero complex numbers:

x(ab) x(a)x(b), a, b E U(pm)

The product ab is taken in U(r) and is multiplication mod pm .

Example 14.1 U(7) is a cyclic group of order 6. Set u = e27"16. A multiplicative character x of Z/7 is completely determined by its value on

230 14. Multiplicative Characters and the FT

the generator 3 of U(7) by the formula

x(3k) = x(3)k, 0 < k < 6.

Since 36 1 mod 7, x(3) is a 6-th root of unity. There exist exactly six multiplicative characters of Z/7 defined by the following table.

Table 14.1 Multiplicative characters of Z/7.

1 3 2 6 4 5 xo 1 1 1 1 1 1

Xi 1 U U2 U3 U4 U5

X2 1 u2 u4 u6 u8 u10

X3 1 U3 u6 u9 u12 u15

X4 1 U4 //8 u12 u16 u20

X5 1 u5 u10 u15 u20 u25

In general, U(p) is a cyclic group of order t = p — 1. Set u = e27"it. A multiplicative character x of Z/p is completely determined by its value on a generator z of U(p) by the formula

x(zk) = x(z)k, 0 < k < t.

Since zt 1 mod p, x(z) is a t-th root of unity. For 0 < / < t, define the multiplicative character xi of Z/p by setting

xi(z) = u .

There exist exactly t multiplicative characters on Z/p given by the following table:

Table 14.2 Multiplicative characters of Z/p.

1 z z2 • - zt-1 xo 1 1 1 1 xi 1 u u2 Ut- 1

xtli I i ut--1 u2(t — i ) u(t — i)(t — i)

Denote by 0(p) the set of multiplicative characters of Z/p. Extend the domain of definition of a multiplicative character x of Z/p to all Z/p by setting x(0) = O. A group multiplication is placed on U(p) by the rule

(sx')(a) = x(a)x1(a), a E Z/p, x, E 0(p).

Since skxi = xic±i with k+1 taken mod p, each generator of U(p) determines a group-isomorphism from U(p) onto U(p).

14.1 Introduction 231

el(p) is called the multiplicative character group of Zip. The identity element of U(p) is xo, which is usually called the principal multiplicative character of Z/p.

Denote by ti(p) and ex (p) the vectors of functions of size t given by

The el xi ez

ii(p) -= . , (p) = .

xt—i ezt-i

By table 14.1, ii(p) = F(t)ex (p),

which is the multiplicative analog of the result that F(N) maps the standard basis of L(ZIN) onto the basis (Z/Nr of additive characters.

Suppose that m > 1. U(pm) is a cyclic group of order t = pm-1(p-1). Set u = e'i/t. A multiplicative character x of Zip' is completely determined by its value on a generator z of U(pm) by

x(zk) -= x(z)k , 0 < k < t.

Since zt 1 mod pm, x(z) is a t-th root of unity. There are exactly t multiplicative characters xi, 0 < / < t, defined by

Xi (Z) = U .

Table 14.3 Multiplicative characters of Zip'.

1 z Z2 Zt-1

Xci 1 1 1 1

Xi 1 U U2 Ut-1

Xt-1 1 Ut-1 U2(t-1) u(t-1)(t-1)

Denote by I-1(pm) the set of multiplicative characters on Z/pm. Extend the domain of definition of a multiplicative character x of Zip' by setting x(a) = 0 whenever a 0 U(pm). A group multiplication is placed on U(pm) by the rule

(xx')(a) = x(a)x'(a), a E Zlpm , x, x' E 0-(pm)

Each generator of U(pm) determines a group-isomorphism between CI(pm) and U(pm). U(pm) is called the multiplicative character group of Z/pm. The identity element of U (pm) is xo, which is usually called the principal multiplicative character of Z/p"1.


The definitions of ii(p) and ex (p) easily extend to definitions of ii(pm) and ex (pm), and we have

ii(pm) = F(t)ex (pm)

There are several important formulas involving multiplicative characters that will be used repeatedly throughout this work.

Theorem 14.1 For x E (I(pm),

E x(k) = {t' x = x°' 0, otherwise.

kEU (pm)

Proof The case x = xo is trivial. Suppose that x xo and take ko in U(pm) such that

x(ko) 1.

As k runs over U(prn), kok runs over U(prn) and we have

E x(ko k) = E x(k). IcCEJ (pm) kEU (pm)

Since x(kok) x(ko)x(k), we also have

E x(kok) =x(ko) E x(k), lecU(pm) kEU(pm)

and the theorem follows.

Denote by (f,g) the inner product of two functions f, g E L(ZIpm),

(f , g) = E f (a)g* (a).

aEZ/pm

Corollary 14.1 For two distinct multiplicative characters x, y of Zip',

(s,y) = O.

(x,x) = t.

Corollary 14.2 0-(pm) is an orthogonal subset of L2(Z/pm)

14.2 Periodicity

14.2.1 Periodic Multiplicative Characters

Subspaces of L(ZIpm) will be constructed from sets of multiplicative characters satisfying certain periodicity conditions. Z/p has only trivial

14.2 Periodicity 233

subgroups, so periodicity plays no role. However, we will distinguish two subsets of U(p). Set

'17.4) = Ixol,

7(p) = (I(p)— fro} (set difference).

x E 177(p) is called a primitive multiplicative character of Z/p. Suppose that rn > 1. For 1 < k < m, define

0-(pm,pk) = Ix E (I(pm): x is pkZIpm — periodic}.

Cf(pm,pk) is a subgroup of (1(pm), and we have

ue,p) c u(prn,p2) c • c pnl) = (viz)

Form the set differences,

1:74m,P) = Nprn,p) — Ixo f/(prn pk) = Nprn. pk) _ 6-(pm, pk —1), 2 < k < m.

(I(prn) is the disjoint union

Nprn) = xo} uln=i c/(prn,pk)

The set

1-7(Pm) = f/(Pm,Pm)

is called the set of primitive multiplicative characters on ZIpm. 174"1) is the set of multiplicative characters that are not periodic with respect to any subgroup of Z/pm.

Example 14.2 Order U(9) exponentially relative to the generator 2,

1, 2, 4, 8, 7, 5,

and form the 3 x 2 array 1 2 4 8 . 7 5

Two elements of U(9) are equal mod 3 if and only if they Hein the same column of the array. A multiplicative character x of Z/9 is in U(9, 3) if and only if x is constant on the columns of the array. This will be the case if and only if x(4) = 1. Consider xi E U(9) relative to the generator 2. Since

xi (4) = e2wit/3,

xi is in (/(9, 3) if and only if 3 / and we have

0-(9,3) = Ixo,x3}.


Example 14.3 Order U(27) exponentially relative to the generator 2 and form the 9 x 2 array

.

Two elements in U(27) are equal mod 3 if and only if they lie in the same column of the array. x is in U(27,3) if and only if x is constant on the columns of the array. This will be the case if and only if x(4) = 1. Consider

E U(27) relative to the generator 2. Since

xi (4) = e27`i119,

xi is in 0(27,3) if and only if 9 I / and we have

0 (27, 3) = {xo, xs }.

Form the 3 x 6 array,

1 2 4 8 16 5 10 20 13 26 25 23 .

19 11 22 17 7 14

Two elements in U(27) are equal mod 9 if and only if they lie in the same column of the array. x is in U(27,9) if and only if x is constant on the columns of the array. This will be the case if and only if x(10) = 1. Consider

E U(27) relative to the generator 2. Since

xi (10 ) = e i / 3

xi is in (/(27, 9) if and only if 3 I / and we have

0(27,9) = x3, xs, xs, xi2, xis/.

Suppose that m > 1. Throughout this section, set t = pm-1(p — 1), the order of U (pm). Fix a generator z of U(pm). Denote by xi, 0 < / < t, the multiplicative characters of Zip' based on z. We will describe U(pm , pk).

Set tk — pk-1(p 1), the order of U(pk). Since z mod pk generates U(pk),

tk — 1 z = mod pk

with tk the smallest positive power of z having this property.

1 2 4 8

16 5 10 20 13 26 25 23 19 11 22 17

7 -

14

14.2 Periodicity 235

Form the Pm- k X tk array,

z 1 Z • Ztk —1

Ztk

k Zt-1

with r = pm-k - 1. Two elements in U(prn) are equal mod pk if and only if they lie in the same column of the array. x is in U(pm,pk) if and only if x is constant on the columnsof the array. This will be the case if and only if x(z") = 1. Consider xi E U(pm). Since

= e27Tiiipm-k Xi (Ztk )

xi is in (1(Pm Pk) if and only if Pm- k I 1, proving the following result.

Theorem 14.2

1-1(pm,pk) = {xi :0 </<tandpm-k1/1.

Corollary 14.3

1'74m) =- {x/ : 0 < / < t and (1,p1n) =

cl(pm,pk) = {xi : 0 < / < t and (/,pm) =

Example 14.4 Relative to any generator U(81),

0(81,3) = IXO, X27}

Cr (81 ,9) = {X0, X9, X18, X271 X36, X 451

(81 , 27) = X3, X6, X9, X12, • • - X51}

and

fr(81) = {xi, X2) X 4, - • , X52, X53}

fr(81 , 27) = { X0, /3, X6, • • • X39, X42}

fir' (81, = {X13, X9, X18, X27, X36, X45}

T7(81,3) = ISO, X27} •

14.2.2 Periodization and Decimation

Consider the ring-homomorphism

7r(pm,pk) : Z/pm -+ Z/pk

defined by 7r(pm,pk)a = a mod pk, a E Z/pm. For f E L(Z/pk), define

7*(Pm,Pk)f C L(Z/Pm)


by 7* (pm , pk) f (a) = f (a mod pk), a E Zip'.

The periodization map

ir* (pm , pk) : L(ZIpk) L(ZIpm)

is a linear isomorphism of L(ZIpk) onto the subspace of all pk Z / pm-periodic functions in L(Z/pm). Since r(pm,pk) restricts to a group-homomorphism of U(pm) onto U(pk), we have the next result.

Theorem 14.3 The periodization map 71-* (pm ,pk) restricts to a group isomorphism

7r.i. (pm, pk Npk) ErT(pm ,,pk)

and to a bijection from 1-'7(pk) onto I- 7(pm ,pk).

= e27rok, Throughout we fix a generator z of U(pm) and reference all dependent

constructions with respect to z. Set u = e'zit and uk where

t = pm-1(p — 1) and tk _= pk-1(p _._ 1). Denote by 4Pk), 0 < / < tk, the multiplicative characters of U(pk) relative to the generator z mod pk of

U(pk). Set xt = 4P—).

Example 14.5

71*(27, 3)

7*(27, 9)

49)

X(9) 1

X(9) 2

X 3(9)

X4(9)

X(9) 5

Xg

X3

X6

Xg

X12

''' ....1.5

(3) X 0

X(3) 1

Xg

'^ ...,6

Example 14.6

7*(81,3)

7*(81, 9) 7*(81, 27) X0(9)

x(9) i (9)

X 2

X(9) 3

X(9) 4 (9)

X5

X0

X6

X 18

'" •,27

X36

X45

(27) X0

X(27) 1

X2(27)

• •

(27) X 17

X0

X3

X6

.

.

X51

X0(3)

(3) X i

XO

X27

Since 7* (pm,pk)4Pk)(z) = =

Uipm-k

1


14.3 F(p) of Multiplicative Characters 237

Theorem 14.4

rrt k\ (Pk) ,p = xipm-k, 0 <1 < tk•

Consider the group-homomorphism -

cr(Pm,Pk) Z/Pk Z/Prn

given by o-(prn,pk)a = pni-ka, a E Zlpk. For f E L(ZIpk), define

u*(Pm ,Pk)f c L(z/r)

by setting o-* (pm , pk) f = 0 off of Pm-kZ/pm and

o-* (pm , pk) f (pm- k a) = f (a), a c pk

The decimation map

o-* (pm ,pk) : L(ZIpk) —> L(ZIpm)

is a linear isomorphism of L(ZIpk) onto the subspace of all pn" Z/prn-decimated functions in L(ZIpm). The decimation map will be used to describe the FT of multiplicative characters in the following sections.

For 1 < k < rn,, the set 1 + pkZIpm

is a subgroup of U(pn1). In fact,

1 +pkZIpm = {a EU(pm): 71-(pm,pk)a =1}.

If x E 1.7(pm,pk), then x(1 + pkZIpm) = 1. The converse also holds.

Theorem 14.5 X E 0(pm,pk) if and only if x(1+ pkZIpm) = 1.

Proof Suppose that x(l+pkZIpm)= 1. For a E U(prn) and b E pkZ/pm,

x(a + b) = x(a(1+ a-lb))

= x(a)x(1+ a-lb)

= x(a),


14.3 F(p) of Multiplicative Characters

Throughout this section, fix a generator z of U(p) and set v = e27"779 and u e27ri/(p-1) .


Theorem 14.6

F(p)xo(a) = {P 1' a = °' -1, otherwise.

Proof By definition,

p— 1

F(p)xo(a) =- E Vak

k=

If a = 0, then the sum on the right is p - 1. If 0 < a < p, then, since

p-1 E vak = 05

k=0

F(p)so(a) = -1, completing the proof.

Theorem 14.7 If x E c7(p), then

F(p)x = Gp(x)x*

where p-1

G p(x) = F(p)x(1) = E Z(k)IJk.

k=

Proof By definition,

p-1

F(p)x(a) = E x(k)vak = E x(k)vak.

k=1 xEU(p)

If a = 0, then

F(p)x(0) = E x(k) = O.

kEU(p)

Suppose now that 0 < a < p. Since a is invertible mod p, we have

E x(k)vak = E x(c1-1 k)vk kEU(p) kEU(p)

= x(a-1) E X(k)Vk

kEU (p)

= Gp(x)x* (a),

using x* (a) = x(a').

G p(x) is called the Gauss sum of the multiplicative character x of Z p. Since p1/2F(p) is unitary, we have

(F(p)x,F(p)x) = p(x , x) = IGp(x)12 (x* , x*) ,

proving the next corollary.

14.4 F(r) of Multiplicative Characters 239

Corollary 14.4 If x E 1-/.(p), then IGp(x)I2 = p.

For 0 < / < p — 1,

p-2

Gp(Xi) = E xi(zk)v-k k=0

p-1 lk zk = E 'IL V ,

k=0


Theorem 14.8

vz Cp(xl) F(p —1)

VZP-2 Gp(xp_2)

Recall the definition of the Winograd core C(p) as the (p — 1) x (p — 1) skew-circulant matrix having 0-th row,

v,vz,...

Since IGp(x)I2 = p whenever x E 1-7(p), we have the next result.

Corollary 14.5 C(p) is invertible and

- —1 -

Gp(xi) F(p —1)C(p)F(p —1) = diag

_ Gp(xp_2) _

Example 14.7 G3(xi) = w _ w2, w = eri/3

Example 14.8

G5 (x i ) = w — w4 +02 _ w3 ) ,

G5 (X2 ) = w w4 — (w2 w3)

G 5(X 3) = W — w4 — j(w2 w3) w = e27ri / 5

14.4 F(pm) of Multiplicative Characters

44.1 Primitive Multiplicative Characters

Suppose that rn > 1. Throughout this section we fix a generator z of U(pm) and set t = pm-1(p— 1), w = 62'/P— and u = e2"/t. First we compute the FT of primitive multiplicative characters of Zip'.


Theorem 14.9 If x E V(pm), then

F(r)x = Gpm(x)x* ,

where Gy.(x) = F(pm)x(1).

Proof Suppose that a E U(r). Since a in invertible mod pm, we have

F(pm)x(a) = E X (k)Wak

kEU (pm)

— E x(a-lk)wk IcCU (pm)

= x(a-1) E X(k)Wk

kEU (pm)

= Gpm(x)x* (a).

Suppose that a ce U(r) and write

a = pa', 0 < a' < pm-1.

Since x cl 0-(pm ,r-1), x(c) 1 for some c E 1 + pm-1ZIpm. From pc ...—_- p mod pm, we have 01' = wpa'b _ wpca'b _ wabc, b E U(pm) and

F(r)a = E x(b)wabc

b.u(,)

— E s(c_iowab

bEU (p"9

= x(c-1)F(pm)(a).

Since x(c) 1 implies that

F(pm)(a) = 0,

the proof is complete.

Gpm(x) is called the Gauss sum of the primitive multiplicative character x of Zip'. Arguing as above, we have the following corollary.

Corollary 14.6 If x E 1-7(pm), then 1Gpm(x)12 = Pm

14.4.2 Nonprimitive Multiplicative Characters

Consider x E V(r,pk), 0 < k < m. By theorem 14.3, the periodization map 71-*(pm,pk) defines a bijection from V(pk) onto V(pm,pk) and we can write

x = r*(Pm,Pk)Y, Y E 17(Pk).

14.4 F(pm) of Multiplicative Characters 241

The decimation map ce(pm,pk) isomorphically maps L(ZIpk) onto the spa,ce of pn"Zipm-decimated functions. Define gx L(ZIpm) by

= cf*(pm,pk)g.

gx vanishes off of pm-kU(pm) and on pm- k U (pm),

g x(pm- k a) = x(a), a E U (pm).

Theorem 14.10 If x E 1-7(pm,pk), 1 < k < m, then

F(pm)x = pm-kGpk(g)gx.,

where x = 7r* ,Pk)Y, Y E fl(Pk)•

Proof Since x is pkZ/pm-periodic, F(pm)x is pm-kZ/pm-decimated. By theorem 13.6,

F(pm)x(pni-kl) = pn"F(pk)y(1), 0 <1 < pk

From theorem 14.9,

F(pk)y(1) = G pk (y)y* (1), 0 < / < .

Since x(/) = y(/), 0 < / < pk , we have

F(r)x(prn-kl) = pm-kGrk(g)gx.(prn-k1),


In general, for x E 0(pm), define

Gpm(x)=F(r)x(1).

For 0 < / < t,

Gpm (xi) = E Ulk Wzk

k=0

implying that [ Gpm (x0) 1 w

[ w.' I . Gp,,,(xi) = F(t)

Gpm(st-i) wzt-i

By theorem 14.10 and corollary 14.6, we have

Gpm (xi) 0, p yt

Gpm(st)= 0, pll.


Since the Winograd core C (pm) is the skew-circulant matrix having 0-th TOW,

—1 w7 wz7 7 7 7 7 wzt

F(t)C(pm)F(t) = diag(Gpm(xt))o<t<t•

Consider the principle multiplicative character so of Zip'. Since ro is pZ/pm-periodic, F(pm)xo is pm-1 ZIpm-decimated. Define the pm-1Z/pm-decimated function go by setting

go(ipm—i) _ —(P — 1), / = 0, 1, 1 < / < p.

Theorem 14.11 F(pm)xo = —Prn lgo •

Proof Since ez7r4b/p F(r)x0(/pm-1) = E

bEU (pm)

the theorem follows.

14.5 Orthogonal Basis Diagonalizing F (p)

The formulas for the FT of the multiplicative characters will be used to de-compose W into the orthogonal direct sum of two-dimensional FT-invariant subspaces on which FT is simple to describe.

Fix a generator z of U(p). Set t = p — 1 and write

0-(P) — {xo, xi, • • • , xt—i}, t = p— 1,

with respect to z. Since the function eo E L(ZIp) taking the value 1 at the point 0 and vanishing on U(p) is orthogonal to 0(p), the set

is orthogonal and by dimension is a basis of L2(ZIp). Denote by W(0) the subspace spanned by

b(0) = leo, t-1/2sol.

Since F(p)eo = eo + xo

and F(P)xo = teo — xo,

W(0) is F(p)-invariant and the matrix of F(p) with respect to the basis b(0) is given by

[ 1 t1/2 Pi/2m(0) t112 --1

14.5 Orthogonal Ba.sis Diagonalizing F(p) 243

For 1 < k < t/2, denote by W(k) the subspa.ce spanned by

b(k) = It-1/2xk,t-1/2p-1/2F(P)xkl.

Since F(p)xk is a constant multiple of ek` =- xp_k, b(k) is orthogonal. In general,

F2(p)f (a) = pf (-a), f E L(ZIp), a E Zlp,

implying that

F2(P)xk = PXk(-1)Xk = P(-1)k X k •

The subspace W(k) is F(p)-invariant and the matrix of the F(p) with respect to b(k) is p1/2M(k), where

M(k) = [ ? (-01)k ] .

Denote by W(t/2) the one-dimensional subspace spanned by

b(t/2)= ft-1/2st/2}.

Since 4/2 = xt/2, W(t/2) is F(p)-invariant and F(p) acts on W(t/2) by the scalar multiple

[Gp(st/2)] -

Gp(xt/2) is called the Legendre symbol mod p and is given by

p1/2 p = 1 mod 4,

P1/2M(t/2) = GP(xt/2) = { p1/2i, p 3 mod 4,

implying that 1 p 1 mod 4

M(t12)= { i,' p -=-- 3 mod 4.

Since leo, xi,— , xt} is an orthogonal basis of L2(Z/p), the set

b(0) U b(1) U - - - U b(t/2)

is an orthonormal basis of L2(Z/p) and the spaces W(k), 0 < k < t, are pairwise orthogonal.

Theorem 14.12 L2(Z/p) is the orthogonal direct sum

ti2

L2 (Z lp) = E eW (k) k=0

of F(p)-invariant subspaces W(k). The rnatrix of F(p) relative to the orthonorrnal basis

b(0) U b(1) U • • • Ub(t/2)


is the matrix direct sum t/2

p1/2 E em(r),

k=0

where 1 t1/2 1

M(0) = P-1/2 t1/2 _1 _I

r (-1>k M(k) = 1 0

, 1 < k < t/2,

1 p 1 mod 4, M(t/2)=

p 3 mod 4.

We will diagonalize the matrices M(k), 0 < k < t/2, by orthogonal matrices in the sense given by the formulas below. Set

0+=_ ri

L -1 '

1 [ 1 i 0 - —

i 1

0+ and 0- are orthogonal matrices. For 1 < k < t/2, direct computation shows that

0+ M(k)(0±)-1 = [ 01 _Oil, k even,

M(k)(0-)-1 = [ k odd.

The orthogonal matrix diagonalization of M(0) is more difficult to write. Set

1 a = (-) 2 ,

b = )I

We can rewrite M(0) as

M(0) = [ab ba]

Set [-N./E a v±?,---pa

Oo = ÷,a. 0+ a] •

14.6 Orthogonal Basis Diagonalizing F(r) 245

By direct computation,

oom(o)cV = [ 1 0 0 —1

Form the matrix direct sum

t/2

0 = EED0k, k=0

where, for 1 < k < tI2,

{0+ , k even, Ok = 0-1, k odd,

Ot12 = 11.

By the preceding discussion, 0 is an orthogonal matrix diagonalizing the matrix given in theorem 14.12. Applying 0 to the basis

b(0) U b(1) U • • U b(tI2),

we construct an orthonormal basis diagonalizing F(p).

14.6 Orthogonal Basis Diagonalizing F(pnl)

14.6.1 Orthogonal Basis of W

Suppose that m > 1. Fix a generator z for U(pm) throughout this section. Set t = pm-1(p — 1) and u = e2'ilt. Observe that U(pm) is the disjoint union

O(Pm) = qPm) U O(Prn,Pni-1).

We require the following result.

Theorem 14.13 For x E fi(pm) and y E 0(pr n,pm-1), we have x x* and x y*.

Proof Write x = xi and y = sk relative to z. Since x E 17(pm) and y E U(pm , pm- 1), we have p yt and plk. If x = x*, then U2/ = 1, implying pll, a contradiction. If x = y*, then Iti+k = 1 and pl(1 + k), which, since plk, implies that pl/, a contradiction.

Denote by W the orthogonal complement of L(p,prn-1) in L(ZIpm). Since U(pm) is supported on U(pm) and L(p,prn-1) is supported on pZipm,

(1(prn) c W.

The F(r)-invariance of W implies that

F(pm)0(r) = {F(r)x : x (r)} C W.


Theorem 14.14 (/(pm) UF(pm)Npm, pm-1) is an orthogonal basis of W . The linear span of V (pm) and the linear span of

f (Pm Pm-1) U F (Pm )0 (Pm Pm —1)

are F(pm)-invariant.

Proof Since (pm , pm-1) is orthogonal, F(pm)(/(pm,pm-1) is orthog-onal. (1(pm) and F(pm)0(pm,pm-1) have disjoint supports, implying that

O(Pm) U F(Pm)0(PmIr-1)

is orthogonal. Since the order of this set is

prri 1 (p 1) ± pni 2(p 1) pin prn

the same as the dimension of W, the first part of the theorem is proved. By theorem 14.9, Ln(V(pm)) is F(pm)-invariant. Since

F(N)2 f (a) = N f (—a)

and

x(—a)= x(-1)x(a), x E (I (pm) ,

we have that the subspace spanned by

Nr,Pm-1) u CP171)0(Pm,Pm-1)

is F(pm)-invariant, completing the proof of the theorem.

14.6.2 Orthogonal Diagonalizing Basis

For x E Cf(pm), denote by W(x) the linear span of the set

b(x) = fx,p—ml2F(pm)x}.

Theorem 14.15 For x E (I(pm), b(x) is an orthogonal basis of the F(prn)- invariant subspace W(x) and the matrix of F(pm) with respect to b(x) is

M (x ) = pm/2 [ 0 x(- 1)

1 0 •

Proof Suppose that x E fi(pm). Since x x*, the orthogonality of Npm) implies b(x) is orthogonal.

Suppose that x E U(pm,pm-1). The support of x is disjoint from the support of F(pm)x and again b(x) is orthogonal.

The proof follows from F2(pm)x = pmx(-1)x.

References 247

Theorem 14.16 For distinct x, y (1.(pm), W (x) and W(y) are orthogonal unless x E V (pm) and y = x*.

Proof Since x and y are orthogonal, F(pm)x and F(pm)y are orthogonal. Suppose that x E V (pm). F(pm)x is a multiple of x* which implies that y is orthogonal to F(pm)x-unless y = x*. If y E V(pm), then, as just argued, x is orthogonal to F(pm)y unless y = x*. If y E U(pm,pm-1), then x is orthogonal to F(pm)y since their supports are disjoint. The theorem holds for x E V(pm).

Suppose that x, y e 0.(pm,pm-1). Then x is orthogonal to F(pnly and y is orthogonal to F(pm)x since these functions have disjoint supports, completing the proof.

Since x E fl(pm) implies that x x*, we can select a subset of i/‘.±(prn) of V(pm) such that:

• x E 1-7+(pm) implies that x* 4;1 I-4(pm).

• x E f./(pm) implies that x E 1^7±(pm) or x* E 14(pm).

Set

B + = 1-T+ (Pm ) U NPin Pm -1).

Since the order of ITT(pm, pm-1) is pm-2(p - 1) and the order of f/".+(pm) is pm-2(p _ 1)2 / z we have that the order of B+ is (pm - pm-2)/2. The dimension of the linear span of UxEB+b(s) is pin - pm-2, which is the same as the dimension of W, proving the next result.

Theorem 14.17 W is the orthogonal direct sum,,

w= E ew(x),

and the matrix of F(pm) with respect to the basis UxEB+b(x) is the matrix direct sum,

prn/2 E e [ x(-1) o

sEB+

We can argue as before to find an orthonormal basis diagonalizing the restriction of F(pm) to W. Since the restriction of F(pm) to L(p, pm-1) has a matrix representation pF(P'), we can use an induction argument to find an orthonormal basis of L(Z/pm) diagonalizing F(pm) [5].

References

[1] Tolimieri, R. "Multiplicative Characters and the Discrete Fourier Transform", Adv. in Appl. Math. ,7, 1986, pp. 344-380.


[2] Auslander, L., Feig, E. and Winograd, S. "The Multiplicative Com-plexity of the Discrete Fourier Transform", Adv. in Appl. Math. , 5, 1984, pp. 31-55.

[3] Rader, C. "Discrete Fourier Transforms When the Number of Data Samples is Prime", Proc. IEEE, 56, 1968, pp. 1107-1108.

[4] Winograd, S. Arithmetic Cornplexity of Computations, CBMS Re-gional Conf. Ser. in Math., Vol. 33, Soc. Indus. Appl. Math., Philadelphia, 1980.

[5] Tolimieri, R. "The Construction of Orthogonal Basis Diagonalizing the Discrete Fourier Transform", Adv. in Appl. Math. , 5, 1984, pp. 56-86.

Problems

Find a generator z for the unit group U of Z/11, and order U exponentially relative to z.

2. Define the ten multiplicative characters of Z/11.

3. Find a generator z for the unit group U of Z/13, and order U exponentially relative to z.

4. Define the 12 multiplicative characters of Z/13.

5. Find a generator z for the unit group U of Z/32, and order U ex-ponentially relative to z. Define the six multiplicative characters of Z/32.

6. Find a generator z for the unit group U of Z/52, and order U ex-ponentially relative to z. Define the 20 multiplicative chara.cters of Z/52.

7. Determine the Gauss sums Ck = Gi(xk), k = 1, 2, 3, 4, 5 for p

15 Rationality

Multiplicative character theory provides a natural setting for developing the complexity results of Auslander, Feig and Winograd [1]. The first reason for this is the simplicity of the formulas describing the action of FT on multiplicative characters. We will now discuss a second important property of multiplicative characters. In a sense defined below, the spaces spanned by certain subsets of multiplicative characters are rational subspaces. As a consequence, we will be able to rationally manipulate the FT matrix F(pm) into block diagonal matrices where each block action corresponds to some polynomial multiplication modulo a rational polynomial of a special kind. This is the main result in the work of Auslander, Feig and Winograd. Details from the point of view of multiplicative character theory can be found in [2].

Although these results proceed in a straightforward fashion, the notation becomes complicated. After some preliminary general definitions we will derive in detail some examples. A function f E L(Z IN) is called a rational vector if the standard representation of f is an N-tuple of rational numbers. A subspace X of L(Z /N) is called a 'rational subspace if X has a basis consisting solely of rational vectors. Such a basis is called a rational basis of X.

250 15. Rationality

Throughout this chapter we will use the following notation. For a vector a of size N, we set

0 0 - • aN_i 0

skew-diag(a) =

0 al ao • • • 0

15.1 An Example: 7

Set u = e2"i/6. Relative to the generator 3 of U(7), the set of primitive multiplicative characters

fi(7) = {xi, X2, X3, X4, X51

is given by the following table:

0 1 3 2 6 4 5

xi 0 1 u u u' u x2 0 1 u2 U4 u6 u8 u10

X3 0 1 U3 U6 u9 u12 u15

X4 0 1 U4 U8 u12 u16 u20

X5 0 1 U5 u10 u15 u20 u25

A rational basis will be constructed for Ln(c (7)). Define the set of functions

R(7) = rk : 0 < k < 5}

by

=- es, — es. (15.1)

These functions are given by the following table:

0 1 3 2 6 4 5 r0010000-1 ri001000-1 r2000100-1 r3000010-1 r4000001-1

At the point 0, each of these functions takes on the value O. We will now show that, for each x E V(7),

4

= E x(33)r). (15.2) 3=0

15.1 An Example: 7 251

By definition, the formula holds at the points

1, 3, 2, 6, 4.

At the point 5 35 mod 7, we have

4 4

E x(3-1)7.3(5) = - Ex(3j). i=o i=o

Since x xo, 5

Ex(3j) = 0, i=o

implying that (15.2) holds at the point 5. Using (15.2) and xk(33) = uki, we have

xi -

[

-ro x2 ri

ci(7) = x3 = X(7) r2 = X(7)R(7), x4 r3 X5 _ - r4

where [1 U U2 U3 U4

1 U2 u4 u6 u8

X(7) = 1 u3 u6 u9 u12 . 1 U4 u8 u12 u16

1 U5 u10 u15 u20

The matrix X(7) is a Vandermonde matrix and is nonsingular. It follows that R(7) is a rational basis of Ln(V(7)) and X(7) is the change of basis matrix.

Consider the restriction of F(7) to Ln(c7(7)). Since

F(7)xj = G7(xi)x;`, 1 < j < 6,

the matrix of F(7) with respect 1-7(7)) is the skew-diagonal m,atrix

G7(X1) -

G7(X2)

G(7) = skew-diag G7(x3) G7(x4)

_ G7(x5)

and the matrix of F(7) with respect to R(7) is

X(7)G(7)X(7)-1.

252 15. Rationality

Completing R(7) to the rational basis

eo, eo + xo; R(7)

of L(Z/7), the matrix of F(7) relative to this rational basis is the matrix direct sum

e [X(7)G(7)X(7)-1]. (15.3)

[ 07

Two matrices A and B are called rationally related if

A = QiBQ2,

where Qi and Q2 are rational nonsingular matrices. In classical complex-ity theory, rational multiplications are free and rationally related matrices have the same multiplicative complexity. In particular, F(7) is rationally related to (15.3) and the multiplicative complexity of F(7) is equal to the multiplicative complexity of X(7)G(7)X(7)-1.

15.2 Prime Case

The methods used in the example extend in a straightforward manner to any prime p. Take a generator z of U(p). Set t = p - 1 and u = ervt. Consider the set of primitive multiplicative characters

qP)= {xk 1 k < tl.

Define R(p) = Irk : 0<k<t-11,

by setting rk = ezk - eze-i, 0 < k < t - 1.

Theorem 15.1 R(p) is a rational basis of Ln(fl(p)).

Proof We claim that, for x E 1-7(p),

t-2

X = Es(zi)ri. 1-o

The expansion holds at zk, 0 < k < t- 1 by definition. We must show that

t-2 x(zt-1) Ex(zi),„,(2_1)

/=0

t-2

= - Es(zi). i=0

15.2 Prime Case 253

Since t_i

Ex(zi)= 0,

i=0

the claim is proved. Placing r = sk, 1 < k <""t in the expansion, we have

t-2

Xk Eukin,

i=o

which in matrix form can be written as

V(P) = X(P)R(P),

where

1 u

[

ut—i 1

1 U2 u2(t-1)

X(p) = : .

1 it' • • • u(t-2)(t-1)

Since the Vandermonde matrix X(p) is nonsingular, R(p) is a rational basis of Ln(V(p)).

The matrix X(p) is the change of basis matrix between the rational basis R(p) and V (p). Since

G(p) = skew-diag [Gp(rk)11<k<t

is the matrix of F(p) with respect to V(p), we have the next result.

Theorem 15.2 The matrix of F(p) with respect to R(p) is X (p)G(p)X (p)-1

Completing R(p) to the rational basis

R = feo,eo xo,R(P)},

we have from Theorem 14.3 the next result.

Theorem 15.3 The matrix of F(p) relative to the rational basis R is the matrix direct sum

[ 14 [X(p)G(p)X(p)-1]

It follows that F(p) is rationally related to the matrix in Theorem 15.3 and has multiplicative complexity equal to the multiplicative complexity of X (p)G (p) X (p)— 1 .

A linear isomorphism P of L(Z In) is called a permutation if relative to the standard basis P is a permutation matrix. The skew-diagonal matrix

254 15. Rationality

G(p) can be replaced by a diagonal matrix by introducing the permutation transformation P defined by

P(e0) = eo,

P(ezk) = ez-k, 0 < k < t = p —1.

Since P(xk) = ek` , 0 < k < t,


Theorem 15.4 The matrix of PF(p) relative to the rational basis R is the matrix direct surn

[01 id (x(P)D(P)x(P)-11,

where D(p) = diag [G,(x011<k<t •

15.3 An Example: 32

Set u = e27ri/6. Relative to the generator 2 of U(9), consider the set of primitive multiplicative characters

1/(9) = {xi, x2, x4, x51.

Define the set of functions

R(9) = fro, ri, r2, r3}

by ro = — e77

Ti = e2 — es,

r2 = e4 — e75

r3 = e8 — e8.

We claim that 3

lk Sk = Eu k = 1,2,4,5. (15.4)

i=o

By definition, (15.4) holds at the points

0, 3, 6, 1, 2, 4, 8.

15.3 An Example: 32 255

At the point 7 24 mod 9,

xk (7) = u4k,

3

E u/kri cry u2k k = 1, 2,4, 5. i=o

Since u 1,

1 ± u2k ± u4k = 0, k = 1,2,4, 5,

implying that (15.4) holds at the point 7. The same argument shows that (15.4) holds at the point 5, completing the proof of the claim.

Set [1 u u2 u3 1 n2 u4 u6

X(9) = 1 U4 u8 u12 •

1 U5 u10 u15

Then

xi re

1^7(9) = [r2 = X(9) [7.1 X(9)R(9).

x4 r2

x5 r3

Since X(9) is nonsingular, R(9) is a rational basis of Ln(17'(9)). Complete R(9) to the rational basis of L(ZI9)

R = If xo, gxo x3, gx3 R(9)},

where f = eo e3 es.

0 3 6 1 2 4 8 7 5 f 111 0 0 0 0 0 0

xo 0 0 0 1 1 1 1 1 1 go -2 1 1 0 0 0 0 0 0

X3 0 0 0 1 -1 1 -1 1 -1

93 0 1 -1 0 0 0 0 0 0

Since the matrix of F(9) with respect to 1^/(9) is the skew-diagonal matrix

[

Go(xi)

G(9) = skew-diag G,9(,X2),, ugkX4) 1 ' Gg (X5)

256 15. Rationality

the matrix of F(9) with respect to R(9) is X(9)G(9)X(9)-1. Since F(9)f = 3f, results from Theorem 14.3 prove that the matrix of F(9) relative to the rational basis R is the matrix direct product sum

[31 e (_3) [o3 G [X(9)G(9)X(9)-11

15.4 Transform Size: p2

Set t = p(p — 1) and u = e2'2/t. Fix a generator z of U(p2). Throughout this section, all constructions will be based on the generator z and on the generator z mod p of U (p).

Denote the multiplicative characters on Z/p2 by xk, 0 < k < t and the multiplicative characters on Z/p by yk, 0 < k < p — 1.

Consider the set of primitive multiplicative characters

(p2) = Ixk : 0 < k < t, and (p, k) = 11.

The dimension of Lri(i7"' (p2)) is s = (p — 1)2. Define the collection of functions

R(p2) = fri2) : 0 < k < s}

by setting, for 0 < < s,

r(2) =e1=e 0<1' < p — 1, /1 / mod (p — 1). z zs-F/ -

Theorem 15.5 R(p2) is a rational basis for Ln(f (p2)).

Proof We claim that, for all x E 1-7(p2),

(2) x = E x(z )7- i=o

Since both sides vanish off of U(2), we must show that the expansion holds at all points of U(p2). The expansion holds at zl, 0 < / < s, by definition. At z8 we must show that

s_i x(e) = Ex(z1)42)(e)

t=o p-1

-= - Es(zi(P-2)), (=.

which is the case since x(zP-1) is a nontrivial p-th root of unity. The same argument shows that the expansion holds at all points in U(p2), proving the claim.

15.4 Transform Size: p2 257

Placing x =- xk, 0 < k < t, and (p, k) = 1 in the expansion, we have

s-i

rk = E uk142),

which in matrbc form canle written as

fro92) = x(p2)R(p2),

where x032) _ rukti

.10<k<t, (p,k)=1, 0<1<s •

Since the Vandermonde matrix X(p2) is nonsingular, the proof is complete.

The matrix X(p2) is the change of basis matrix between the rational basis R(p2) and V(p2). Since

G(p2) = skew-diag [Gp2(xk)]o<k<t (P, k) = 1,

is the matrix of F(p2) with respect to the basis 17-(p2), we have the next result.

Theorem 15.6 The matrix of F(p2) with respect to R(p2) is

x(p2)G(p2)x(p2)-1.

Consider

q/32, P) = Ispk :1-k<P- 11.

The periodization map R-*(p2,p) bijectively maps fr. (p) onto ii(p2,p). Consider the rational basis R(p) of Ln(V(p)) and define

R(P2, P) = 7r* (P2, P)R(P).

Setting R(p)=Irk:0<k<p-1} as defined in (15.1), we have

R(p2,p) = fri) : 0 < k < p - 21,

where r11) = 7*(P2,Ark. The functions 4,1), 0 < k < p - 2, vanish off of U(p2), are pZ/p2-periodic and are defined by the values given in the following table.

1 Z • • ZP-3 ZP-2 (1)

ro 1 0 • - 0 —1

r(') 0 1 0 0 -1

rpw3 I 0 ... 0 1 -1

258 15. Rationality

Since 1-/(p)= X(p)R(p), we have

1-7(P2,P) = X(P)R(P2,P),


Theorem 15.7 R(p2,p) is a rational basis of Ln(fl(p2,p)) with X(p) the change of basis matrix between R(p2,p) and V(p2,p).

The decimation map o-*(p2,p) maps L(ZIp) isomorphically onto the space of pZ/p2-decimated functions in L(Z/p2). Define

FV(P2,P) = cr*(P2,P)CI(P)

and

S(P2,P) = u* (P2, P)R(P).

Then FV(p2,p)=Igpk:1<k<p— 11,

where gpk is pZ/p2-decimated,

gpk(Pa) = xpk(a), a E U(p2)

and S(p2,p)={41):0<k<p— 11,

where (i)

sk = epzk — epzp-2, 0 < k < p — 2.

Since 177(p) = X(p)R(p), we have

FV(P2,P) = X(P)S(P2, P),


Theorem 15.8 S(p2,p) is a rational basis of FV(p2,p) with X(p) the

change of basis matrix between S(p2,p) and FV(p2,p).

The set

W(P2, P) = Cr (P2 ) P) U FV(p2,p)

is a basis of Ln(W(p2,p)). Since

F(p2)xpk = PGp(M)94

F(p2)gpk = Gp(yk)xpk,

we have that Ln(W (p2 , p)) is F(p2)-invariant and the matrix of F(p2) with respect to W(p2,p) is

r 0 G(p) 1 . G(P2,P) = I_ pG (p) 0 j

15.4 Transform Size: p2 259

Theorem 15.9 R(p2 , p)US(p2, p) is a rational basis of the F (p2 ) -invariant subspace Ln(W(p2,p)) and the matrix of F(p2) with respect to this rational basis is

x(p2,p)G(p2,p)x(p2,p)-1,

where X(p2,p) = X(p) CB X(p).

Completing R(p2,p) U S (p2 , p) U R(p2) to the rational basis

R = If ; xo, 9zoi R(P2 ,P), S(P2 ,P); R(P2)}

of L(Z/p2), where f = Er02 eip, we have the next result.

Theorem 15.10 The matrix of F(p2) with respect to the rational basis R is the matrix direct sum,

[P] e (-P) lo ED [X (P2 , AG(P2, P)X-1(P2 ,P)]

e[x (P2)G(P2)x (P2 )- 11, where

X (p2 , p) = X (p) ED X (p).

Define the permutation matrix P by the formulas

P(e0) = eo, P(ezk)= ez-k, 0 < k < t,

P(ep,k) = epz-k, 0 < k < p - 1.

For 0 < k < t and (p, k) = 1, we have

PF(p2)xk = Gp2 (X0X1c.

For 0 < k < p -1,

P.F(p2)xpk = PGp(Yk)gpk

PF(p2)gpk = Gp(A)xpk•

Also, P(x13) = xo, P(9zo) -= gx,, and P(f) f , proving the next result.

Theorem 15.11 The matrix of PF(p2) relative to the rational basis R is the matrix direct sum,

0 1 11)1E1 (-P)[i 0] e [x(P2, p)D(p2, p)x-1(P2,

VX(p2)D(p2)X(p2)-11,

where 0 D(p)

D(P2,P) = [ pmp) 0

D(p2) = diag(Gp2(xk))o<k<t, (v,k)=1.

260 15. Rationality

15.5 Exponential Basis

The results of this chapter can be referred to as the exponential basis. First take z as a generator of U(p) and set

Ex(p)={ezk : 0<k <p— 1}

and L(p) = [4_2 - 4_2].

Then R(p) = L(p)Ex (p).

For a generator z of U(p2), set

Ex (p2) = { ezk : 0 < k < , t = p(p — 1)

and L(P2) = [is — lp_i 0 4_1], s = (p — 1)2.

Then R(p2) L(p2)Ex (p2)

and R(p2 , p) = L(p))Ex (p2).

15.6 Polynomial Product Modulo a Polynomial

Choose distinct complex numbers uk, 1 < k < n and form the polynomial

g(x) = (x - 14)•

k=1

The quotient polynomial ring C[xil g(x) is an n-dimensional vector space with basis

{x/ : 0 < / < n}.

For each d) E C[x]I g(x), define the linear transformation 7-(0) of C[x]Ig(x) by

1-(0)1,b = (blb, 7,1) C C[x]l g(x).

By the polynomial CRT there exists a basis of ideinpotents {Ek : 1 < k < n} for C[x]I g(x) satisfying the following two properties:

• For E g(x), we can write

= E O(Uk)Ek•

k=1

15.6 Polynomial Product Modulo a Polynomial 261

• For cb, c CHIg(x), we have

= E k)0(Uk)E k • k=1

-

These two properties imply the next result.

Theorem 15.12 For 0 E C[x]Ig(x), the matrix of r(0) relative to the idenapotent basis is the diagonal matrix

diag (0(nkni<k<n•

Since xl = Enk__4 UtEk, we have

1 _E .1

X E2 Z

Xn-1

with - 1

U

Z =

un-1 _ the change of basis matrix.

Theorem 15.13 For 0 E C[x]Ig(x), the inatrix of r(0) relative to the basis {x1 :0 < < n} is

Z-ldiag (0010)1<k<7,Z.

The preceding theorems will be used to interpret X(p)D(p)X(p)-1 and X(p2)D(p2)X(p2)-1 as polynomial multiplications in a quotient polynomial ring.

Fix a generator z of U(p) and denote by xi, 0 < / < p -1, the multiplica-tive characters of Z/p with respect to z. Set u = e21"/(P-1). Take uk = uk, 1 <k<p-1 in the discussion above. Then

p-2

(x) = ll(x _uk) =1, x ± • • • ± XP-2

k=1

and Z = X(p)t.

1 U2

n-1 U2

1 -

Un

n-1

262 15. Rationality

Define the Gauss polynomial Op by

p-2

(19p(X) = Ew.ksk, w= e2lri/P

k= 0

Since Op(uk) = Gp(xk), 1 < k < p - 1, we have the next result.

Theorem 15.14 The matrix of the linear transformation r(Op) of C[x]Igi(x) relative to the basis {xl :0 < < p - 2} is

(X(P)D(P)X(P)-1)t•

Fbc a generator z of U(p2) and denote by xi, 0 < / < t = p(p - 1) the = e2iri/t and multiplicative characters of Z/p2 with respect to z. Set u

g2(x) (x - 0<k<t, (p,k)=1

in the discussion above. Then

Z = X(p2)t.

Define the Gauss polynomial

t-i

,Pp2(x) = Ew-kxk, w=e2-0,2

k=0

Since (42(u/1 = Gp2 (Xk), 0 < k < t, (p, k) = 1, we have the next result.

Theorem 15.15 The matrix of the linear transformation r(Op2) of C[x]Ig2(x) relative to the basis {xl :0 < < t -1} is

(x(p2) D(p2)x(p2)-1)t.

15.7 An Example: 33

Set u = e21"118. Take 2 as a generator of U(27). Consider the set of primitive multiplicative characters

fI(27) = : 0 < k < 18, (3, k) =11.

Set Ex (27) = fezk : 0 < k <181.

A rational basis R(27) = : 0 < <121

15.7 An Example: 33 263

for Ln(c7(27)) is defined by the matrix formula

R(27) = L(27)Ex (27),

where 1427) = [112 - 12 0 16 } •


1-7(27) = X(27)R(27),

where X(27) = [ukti

i0<k<18,(3,k)=1, 0</<12

Consider the sets

c7(27, 9) = {X3, X16, X125 X15 b

FV (27 , 9) = a a 1 {,X3 ,I6 7

To be explicit, we repeat the tables describing 1-7(27,9) and FV(27,9):

x3 1 1

2

u 4

u 8 16

u' 5

u x6 1 u2 U4 u6 u8 u10

X12 u4 u8 u12 u16 u20

X15 1 U5 u10 u15 u20 u25

3 6 12 24 21 15

gx3 1 u u u' u

gx,

gxi2

1 1

u2 U4

u4 u8

us u12

us u16

uis u20

gzi5 1 U5 u10 u15 u20 u25

We see from the first table that -17(27, 9) is constructed by periodiz-ing 1-7(9) mod 9 in Z/27 and from the second table that FV(27,9) is constructed by decimating V(9) to 3U(27) in Z/27. As a result, the set R(27, 9), formed by periodizing R(9) mod 9, is a rational basis of Ln(V(27, 9)). The set S(27,9), formed by decimating R(9) to 3U(27) is

a rational basis of Ln(FV(27,9). Then

1-7(27,9) = X(9)R(27,9)

FV(27, 9) = X(9)5(27, 9).

Form the rational basis of L(Z/27),

R= Ifo, f2 xo, gs9 R(27, 9), S(27, 9), R(27)},

264 15. Rationality

where

fo = eo + es +

= e3 + ei2 +

f2 = es + els + e24.

Theorem 15.16 The matrix of F(27) relative to the rational basis R is given by

3F(3) 6

0 11

(-3) [3 0 v 31 9 [0 1]

0

ED[X(27, 9)G(27, 9)X (27, 9)-1] [X(27)G(27)X(27)-11,

where X(27,9)

G(27,9) = = [

X(9) e X(9)

0 G(9) 3G(9) 0

].

References

[1] Auslander, L., Feig, E. and Winograd, S. "The Multiplicative Com-plexity of the,Discrete Fourier Transform", Adv. in Appl. Math. , 5, 1984, pp. 31-55.

[2] Tolimieri, R. "Multiplicative Characters and the Discrete Fourier Transform", Adv. in Appl. Math. , 7, 1986, pp. 344-380.

[3] Rader, C. "Discrete Fourier Transforms When the Number of Data Samples is Prime", Proc. IEEE, 56, 1968, pp. 1107-1108.

[4] Winograd, S. "Arithmetic Complexity of Computations", CBMS Regional Conf. Ser. in Math. SIAM, 33, Philadelphia, 1980.

[5] Tolimieri, R. "The Construction of Orthogonal Basis Diagonalizing the Discrete Fourier Transform", Adv. in Appl. Math., 5, 1984, pp. 56-86.

Index

Additive algorithms, 147 character, 219 stage, 160

Agarwal-Cooley algorithm, 84 algebra, 19 auto-sort

algorithm, 80, 86

Basis canonical, 54 of idempotents, 260 orthonormal, 218 rational, 249 standard, 30, 218

bilinear, 28 algorithms, 121

binary bit representation, 74 bit-reversal, 73, 74 block

diagonal, 170 factor, 197 factors, 189

diagonalization method, 164

Canonical basis, 54 character

additive, 219 multiplicative, 229

group, 231 characteristic

field, 25 Chinese remainder theorem, 1 chracter

multiplicative group, 231

circulant matrix, 105

circulant matrix, 105 common divisor, 2

greatest, 2, 16 polynomial, 14

commutation theorem, 1, 39 complete systems of idempotents, 1 congruent, 6 conjugate, 56 constant polynomial, 14 convolution

cyclic, 103 linear, 101 theorem, 101, 107

two-dimensional, 142 Cooley-Tukey radix-2 FFT

algorithm, 76

266 Index

core Winograd, 204, 207, 212

cyclic convolution, 103, 138, 139

two-dimensional, 138 two-dimensional, matrix

description, 140 shift matrix, 105

cyclotomic polynomials, 129

Data transposition, 83 decimated, 222 decimation

in frequency, 76 in time, 76 map, 237

degree, 13 diagonal

block, 170 factor, 197 factors, 189

diagonalization method block, 164 partial, 164

direct product group, 11 ring, 7, 21

divide, 2, 14 divisibility condition, 2 divisor

common, 2 mro, 174

dual, 221

Euler quotient function, 12 exponential

ordering, 156 permutation, 156, 174, 182, 204,

207, 212 extension field, 17

Field characteristic, 25

FT factor, 189, 197 fundamental factorization, 149, 160

Gauss polynomial, 262 sum

of primitive multiplicative character, 240

Gauss sum, 238 generator, 12 Good-Thomas Prime Factor

algorithm, 91 greatest common divisor, 2 group direct product, 11

Homomorphism, ring, 6

Ideal, 2 idempotent, 8, 21

basis of, 260 system of, 8, 21

inner product, 218 inverse matrix, 57 irreducible, 14

Legendre symbol, 243 linear convolution, 101

Matrix inverse, 57 symmetric, 57

mixed radix auto-sorting FFT algorithm, 86 factorization, 82 FFT, 82

monic polynomial, 14 multiple, 2, 14 multiplication, ring, 19 multiplicative

character, 229, 231 group, 231 primitive, 233 principal, 231

characters set of, 233

factor, 197 factors, 189 stage, 160

Order, 11 of matrix, 56

ordering, exponential, 156 orthonormal basis, 218

Parallel operation, 32

partial diagonalization method, 164 Pease FFT, 78 Pease FT, 77 perfect shuffle, 33 periodic, 221 periodization map, 236 permutation, 253

exponential, 156, 174, 182, 204, 207, 212

stride, 33 preaddition

factor, 189, 197 matrix, 149, 169

prime factorization theorem, 4 polynomial, 14

prime number, 2 primitive multiplicative character,

233 Gauss sum of, 240

principal multiplicative character, 231

Quotient, 2, 14

Rational basis, 249 reducing polynomial, 113 register, vector length, 45 relatively prime, 2

polynomial, 14 remainder, 2, 14 ring

direct product, 7, 21 homomorphism, 6 multiplication, 19

Index 267

rotated core, 149 FT, 94, 95, 97

rotated Winograd core, 175, 182

Segmenting, 29 skew-circulant, 155 skew-diagonal matrix, 251 standard

basis, 30, 218 representation, 218

Stockham FFT, 80 stride permutation, 33 subfield, 17 symmetric matrix, 57 system of idempotents, 8, 21

Tensor product, 28 of matrices, 31

twiddle factor, 58

Unit group, 6

Vandermonde matrix, 112, 251 vector

length register, 45 operation, 32

Winograd core, 157, 169, 204, 207, 212

rotated, 149, 175, 182 large FT algorithm, 148 small FT Algorithm, 147

Zero divisor, 174

algorithms for discrete fourier transform and convolution (signal processing and digital filtering)

Documents