high-throughput programmable systolic array fft architecture and fpga implementations

21
High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg Nash www.centar.net [email protected] ICNC 2014

Upload: ziv

Post on 04-Feb-2016

77 views

Category:

Documents


3 download

DESCRIPTION

High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations. J . Greg Nash www.centar.net [email protected] ICNC 2014. Outline. Motivation for new FFT designs in wireless applications? Review of FFT architectures New systolic FFT architecture - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

High-Throughput Programmable Systolic ArrayFFT Architecture and FPGA Implementations

J. Greg Nash

www.centar.net

[email protected]

ICNC 2014

Page 2: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

Outline

• Motivation for new FFT designs in wireless

applications?

• Review of FFT architectures

• New systolic FFT architecture

• Circuit FPGA performance comparisons

– LTE SC-FDMA

– Fixed-size power-of-two transforms

– Variable transforms (LTE, WiMAX)

• Conclusions

Page 3: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

Future Drivers for Wireless FFT Design

• Algorithmic (OFDM)

– Large transform sizes (LTE: 2048 points; DVB: 32K points)

– Run-time scalable OFDMA (LTE : 128 to 2048 points)– Non-power-of-two transform sizes (LTE SC-FDMA: 35 sizes, 12 to 1296

points)– High performance (LTE advanced)

• BW = 100MHz with 8 MIMO streams <1.0sec for 2K FFT)

• Critical system requirements

– Power

– Cost

Page 4: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

FFT Architecture Review (1): Pipelined

W=e-2πI/N

Collapse onto pipelined hardware blocks

Signal Flow Graph (8-point DFT) Block Diagram

Features

• Fast• Hardware Intensive• Non-programmable

Page 5: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

FFT Architecture Review (2): Memory Based

Features• Programmable• Compact• Typically slow

Traditional Proposed Systolic Array

Features• Programmable• Faster than pipelined FFT• Scalable• Higher SQNR

Page 6: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

Matrix Form DFT (16-Point DFT)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 W W2

W3

W4

W5

W6

W7

W8

W9

W10

W11

W12

W13

W14

W15

1 W2

W4

W6

W8

W10

W12

W14

1 W2

W4

W6

W8

W10

W12

W14

1 W3

W6

W9

W12

W15

W2

W5

W8

W11

W14

W W4

W7

W10

W13

1 W4

W8

W12

1 W4

W8

W12

1 W4

W8

W12

1 W4

W8

W12

1 W5

W10

W15

W4

W9

W14

W3

W8

W13

W2

W7

W12

W W6

W11

1 W6

W12

W2

W8

W14

W4

W10

1 W6

W12

W2

W8

W14

W4

W10

1 W7

W14

W5

W12

W3

W10

W W8

W15

W6

W13

W4

W11

W2

W9

1 W8

1 W8

1 W8

1 W8

1 W8

1 W8

1 W8

1 W8

1 W9

W2

W11

W4

W13

W6

W15

W8

W W10

W3

W12

W5

W14

W7

1 W10

W4

W14

W8

W2

W12

W6

1 W10

W4

W14

W8

W2

W12

W6

1 W11

W6

W W12

W7

W2

W13

W8

W3

W14

W9

W4

W15

W10

W5

1 W12

W8

W4

1 W12

W8

W4

1 W12

W8

W4

1 W12

W8

W4

1 W13

W10

W7

W4

W W14

W11

W8

W5

W2

W15

W12

W9

W6

W3

1 W14

W12

W10

W8

W6

W4

W2

1 W14

W12

W10

W8

W6

W4

W2

1 W15

W14

W13

W12

W11

W10

W9

W8

W7

W6

W5

W4

W3

W2

W

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

z

z

z

z

z

z

z

z

z

z

z

z

z

z

z

z

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

Z = C X

W=e-2πI/N (N=16)

Page 7: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

Inputs X and Outputs Z in Bit-reversed Form(N=16)

1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 00 1 0 0 0 0 0 0 1 0 0 0 0 01 ; 2 ; 3 ; 40 0 1 0 0 0 1 0 0 0 1 0 0 0 1 00 0 0 1 0 0 0 0 0 0 1 0 0 0

I Id d d d

I I

Cb =

é

ë

êêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêê

ù

û

úúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúú

d1

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

d2

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

d3

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

d4

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

d1

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 - I -1 I

1 - I -1 I

1 - I -1 I

1 - I -1 I

Wd2

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 - I -1 I

1 - I -1 I

1 - I -1 I

1 - I -1 I

W2 d3

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 - I -1 I

1 - I -1 I

1 - I -1 I

1 - I -1 I

W3 d4

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 - I -1 I

1 - I -1 I

1 - I -1 I

1 - I -1 I

d1

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 -1 1 -1

1 -1 1 -1

1 -1 1 -1

1 -1 1 -1

W2 d2

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 -1 1 -1

1 -1 1 -1

1 -1 1 -1

1 -1 1 -1

W4 d3

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 -1 1 -1

1 -1 1 -1

1 -1 1 -1

1 -1 1 -1

W6 d4

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 -1 1 -1

1 -1 1 -1

1 -1 1 -1

1 -1 1 -1

d1

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 I -1 - I

1 I -1 - I

1 I -1 - I

1 I -1 - I

W3 d2

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 I -1 - I

1 I -1 - I

1 I -1 - I

1 I -1 - I

W6 d3

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 I -1 - I

1 I -1 - I

1 I -1 - I

1 I -1 - I

W9 d4

é

ë

êêêêêêêêêê

ù

û

úúúúúúúúúú

1 I -1 - I

1 I -1 - I

1 I -1 - I

1 I -1 - I

1 2 3 42 35 6 7 8

2 4 69 10 11 12

3 6 913 14 15 16

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

:

1 1 1 1 1 1 1 11 1 1

1 1 1 111 11

1 1 1 11

b

b

Z CX becomes

x x x xW W W x x x xI IY x x x xW W W

I I x x x xW W W

z z z zz z z zZ z z z zz z z z

11 1 1 11 1

tb

I I Y

I I

“ ”= element by element multiply

Page 8: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

New FFT Matrix Form

“ ”= element by element multiply1

2t

M b

b

M

M

Y W C XZ C Y

1 | |...tt t

M B BC C C

2 | |...M B BC C C1 1 1 11 11 1 1 11 1

BI ICI I

where

(for b=4)

Page 9: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

“Base-b” FFT Architecture

1

2t

b

b

M M

M

Y W C XZ C Y

Base-b DFT equations:

Base-4 DFT architecture:

Virtual Physical

Page 10: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

Processing flow for DFT of length N = Nr Nc

1. Nc column DFTs (Xci) of length Nr

2. Nr row DFTs (Xri) of length Nc

Page 11: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

Base-4 Array Architecture

256 Point FFT (Nr =Nc=16)

1024 Point FFT (Nr =Nc=32)

Array Processing Elements

Page 12: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

Interconnection Delays

Altera Pipelined FFT

65nm Technology: 256pt FFT

Systolic

Critical

Path

Fmax = 351 MHz Fmax = 537 MHz

Page 13: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

LTE Uplink: Single Carrier FDMA

• DFT spreading of data symbols in frequency domain– Reduces PAPR in uplink– Less dependence on frequency offset

• 35 DFT sizes N (12-points to 1296-points)

• Run-time choice of DFT size

Page 14: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

LTE Systolic DFT

• Array size uses base-b = 6

• Example→– N = 520-points (– Use subset of physical array for

P,Q≠6

36-ptDFTs

15-ptDFTs

Page 15: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

Programmability

• Parameter List (Matlab): – Matrix factorization parameters(ax,by,cz,…)– Addresses for coefficients

240 points

Page 16: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

LTE DFT: FPGA Cycle Counts

 Average Latency

Time

Average Throughput

Rate

Resource Block Computation

Time 

Altera 1.39 0.47 2.01

Xilinx 0.86 0.65 1.50

Systolic FFT 1.00 1.00 1.00

Page 17: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

LTE DFT: FPGA Circuit Usage Comparisons

Design FPGA LUTALM/LE

Fmax

(MHz)

Systolic Stratix III 3582 2733 394

Xilinx Virtex-5 4707 3864 276

Altera Stratix III 2600 n.a. 260

Chen Virtex-5 7791 n.a. 123

(65nm Technology)

Page 18: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

LTE Systolic DFT: Performance Comparisons

DesignAverage LTE Resource Block

Compute Time

Systolic FFT 1.0

Xilinx 2.1

Altera 3.0

Page 19: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

Fixed Size FFT: Power-of-two

• Streaming (continuous data in/out)• Array size uses base-b = 4• Altera Stratix III FPGAs (65nm technology)

Altera Systolic FFT Altera Systolic FFT

20-bits 16-bits 20-bits 16-bits

Transform Size 256 256 1024 1024

ALMs 4261 3982 4394 4331

Memory Bits (K) 49 40.6 195 145

Multipliers (18-bit) 24 33 24 33

SQNR 76.6 86.7 81.3 82.8

Sample Rate (MHz) 387 566 382 533

Page 20: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

Variable Size FFT: Power-of-two

• Transform sizes: 128/256/512/1024/2048-points• Streaming (continuous data in/out)• Run-time transform size• Array size uses base-b = 4• Altera Stratix III FPGAs (65nm technology)

Systolic FFT16-bits in/16-bits out

Altera16-bits in/30-bits out

Architecture Systolic Single Delay Feedback

ALMs 4522 3826

RAM Memory (K) 290 208

Multipliers (18-bits) 33 36

Fmax (MHz) 510 315

Page 21: High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

Conclusion: Better FFTs are Possible

• Improved performance– Algorithmic reduction in computation cycles– Localized interconnects for high clocks speeds (>500MHz for 65nm

FPGA technologies)

• Reduced usage of FPGA logic cells

• Programmability

• Throughput scalability due to the use of systolic algorithms

• Higher dynamic range (smaller word lengths needed)