synergy.cs.vt.edu accelerating fast fourier transform for wideband channelization carlo del mundo*,...
TRANSCRIPT
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband ChannelizationCarlo del Mundo*, Vignesh Adhinarayanan§, Wu-chun Feng*§
* Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Forecast
• Goal: Accelerate the Fast Fourier Transform (FFT) using graphics processing units (GPUs) – Replace fixed hardware ASICs with programmable
GPUs
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Forecast
• Goal: Accelerate the Fast Fourier Transform (FFT) using graphics processing units (GPUs) – Replace fixed hardware ASICs with programmable
GPUs
Carlo del Mundo, [email protected], carlodelmundo.com
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpgahttp://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Motivation
• FFT is a critical building blockacross many disciplines
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Motivation
• FFT is a critical building blockacross many disciplines
Carlo del Mundo, [email protected], carlodelmundo.com
http://www.ajnr.org/content/27/6/1230/F1.large.jpg
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Motivation
• FFT is a critical building blockacross many disciplines
Carlo del Mundo, [email protected], carlodelmundo.com
http://www.ajnr.org/content/27/6/1230/F1.large.jpg
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Motivation
• FFT is a critical building blockacross many disciplines
Carlo del Mundo, [email protected], carlodelmundo.com
http://www.ajnr.org/content/27/6/1230/F1.large.jpg
http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Motivation
• FFT is a critical building blockacross many disciplines
Carlo del Mundo, [email protected], carlodelmundo.com
http://www.ajnr.org/content/27/6/1230/F1.large.jpg
http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Motivation
• FFT is a critical building blockacross many disciplines
Carlo del Mundo, [email protected], carlodelmundo.com
http://www.ajnr.org/content/27/6/1230/F1.large.jpg
http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png
http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Motivation
• FFT is a critical building blockacross many disciplines
Carlo del Mundo, [email protected], carlodelmundo.com
http://www.ajnr.org/content/27/6/1230/F1.large.jpg
http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png
http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Introduction• Wideband Channelization
– Purpose: To isolate channels within a wideband signal
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Introduction• Wideband Channelization
– Purpose: To isolate channels within a wideband signal
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Introduction• Wideband Channelization
– Purpose: To isolate channels within a wideband signal
Carlo del Mundo, [email protected], carlodelmundo.com
http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Introduction• Wideband Channelization
– Purpose: To isolate channels within a wideband signal
Carlo del Mundo, [email protected], carlodelmundo.com
Figure: Stages in a PFB Channelizer http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Introduction (Channelization)
• Algorithm: Polyphase filter bank (PFB) channelizer
Carlo del Mundo, [email protected], carlodelmundo.com
Figure: Stages in a PFB Channelizer
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Introduction (Channelization)
• Algorithm: Polyphase filter bank (PFB) channelizer– Problem: FFT stage grows fastest in channelization
Carlo del Mundo, [email protected], carlodelmundo.com
Figure: Stages in a PFB Channelizer
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Introduction (Channelization)
• Algorithm: Polyphase filter bank (PFB) channelizer– Problem: FFT stage grows fastest in channelization
Carlo del Mundo, [email protected], carlodelmundo.com
Figure: Stages in a PFB Channelizer
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Choosing the Right Processor
• Criteria: Programmability & Performance
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Choosing the Right Processor
• Criteria: Programmability & Performance
Carlo del Mundo, [email protected], carlodelmundo.com
Carlo del Mundo, [email protected], carlodelmundo.com
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
synergy.cs.vt.edu
Choosing the Right Processor
• Criteria: Programmability & Performance
Carlo del Mundo, [email protected], carlodelmundo.com
Carlo del Mundo, [email protected], carlodelmundo.com
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
synergy.cs.vt.edu
Choosing the Right Processor
• Criteria: Programmability & Performance
Carlo del Mundo, [email protected], carlodelmundo.com
Carlo del Mundo, [email protected], carlodelmundo.com
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
synergy.cs.vt.edu
Choosing the Right Processor
• Criteria: Programmability & Performance
Carlo del Mundo, [email protected], carlodelmundo.com
Carlo del Mundo, [email protected], carlodelmundo.com
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
synergy.cs.vt.edu
Choosing the Right Processor
• Criteria: Programmability & Performance
Carlo del Mundo, [email protected], carlodelmundo.com
Carlo del Mundo, [email protected], carlodelmundo.com
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
synergy.cs.vt.edu
Choosing the Right Processor
• Criteria: Programmability & Performance
Carlo del Mundo, [email protected], carlodelmundo.com
Carlo del Mundo, [email protected], carlodelmundo.com
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Outline
• Motivation• Introduction• Background• Approach
– System-level optimizations– Algorithm-level optimizations
• Results– Optimizations in isolation– Optimizations in concert
• Conclusion
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs)
• GPU Memory Hierarchy
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs)
• GPU Memory Hierarchy
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs)
• GPU Memory Hierarchy– Global Memory
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs)
• GPU Memory Hierarchy– Global Memory
Carlo del Mundo, [email protected], carlodelmundo.com
Memory Unit
Read Bandwidth (TB/s)
Global 0.17
Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs)
• GPU Memory Hierarchy– Global Memory– Image Memory
Carlo del Mundo, [email protected], carlodelmundo.com
Memory Unit
Read Bandwidth (TB/s)
L1/L2 Cache 1.35 / 0.45
Global 0.17
Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs)
• GPU Memory Hierarchy– Global Memory– Image Memory– Constant Memory
Carlo del Mundo, [email protected], carlodelmundo.com
Memory Unit
Read Bandwidth (TB/s)
Constant 5.4
L1/L2 Cache 1.35 / 0.45
Global 0.17
Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs)
• GPU Memory Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory
Carlo del Mundo, [email protected], carlodelmundo.com
Memory Unit
Read Bandwidth (TB/s)
Constant 5.4
Local 2.7
L1/L2 Cache 1.35 / 0.45
Global 0.17
Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs)
• GPU Memory Hierarchy– Global Memory– Image Memory– Constant Memory– Local Memory– Registers
Carlo del Mundo, [email protected], carlodelmundo.com
Memory Unit
Read Bandwidth (TB/s)
Registers 16.2
Constant 5.4
Local 2.7
L1/L2 Cache 1.35 / 0.45
Global 0.17
Table: Memory Read Bandwidth for Radeon HD 6970
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Outline
• Motivation• Introduction• Background• Approach
– System-level optimizations– Algorithm-level optimizations
• Results– Optimizations in isolation– Optimizations in concert
• Conclusion
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Approach• Act as the “human compiler”
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Approach• Act as the “human compiler”
1. Derive a candidate set of optimizations for FFT on GPUs
Carlo del Mundo, [email protected], carlodelmundo.com
Candidate Optimizations
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Approach• Act as the “human compiler”
1. Derive a candidate set of optimizations for FFT on GPUs
2. Apply optimizations in isolation
Carlo del Mundo, [email protected], carlodelmundo.com
Candidate Optimizations
Optimizations in Isolation
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Approach• Act as the “human compiler”
1. Derive a candidate set of optimizations for FFT on GPUs
2. Apply optimizations in isolation3. Apply optimizations in concert
Carlo del Mundo, [email protected], carlodelmundo.com
Candidate Optimizations
Optimizations in Concert
Optimizations in Isolation
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Approach
• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory
• Algorithm-level Optimizations1. Transpose via LM2. Compute/Transpose via LM3. Compute/No Transpose via LM
Carlo del Mundo, [email protected], carlodelmundo.com
C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Approach
• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory
• Algorithm-level Optimizations1. Transpose via LM2. Compute/Transpose via LM3. Compute/No Transpose via LM
Carlo del Mundo, [email protected], carlodelmundo.com
C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Approach
• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory
Carlo del Mundo, [email protected], carlodelmundo.com
C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Approach
• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory
• Algorithm-level Optimizations1. Naïve Transpose (LM-CM)2. Compute/Transpose via LM (LM-CC)3. Compute/No Transpose via LM (LM-CT)
Carlo del Mundo, [email protected], carlodelmundo.com
C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
1. Register Preloading (RP)– Load to registers first
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Without Register Preloading
79 __kernel void unoptimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]);
1. Register Preloading (RP)– Load to registers first
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
With Register Preloading
79 __kernel void optimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 __private float2 r0, r1, r2, r3; // Register Declaration 85 // Explicit Loads 86 r0 = buffer[0]; r1 = buffer[1]; r2 = buffer[2]; r3 = buffer[3]; 87 FFT4_in_order_output(&r0, &r1, &r2, &r3);
Without Register Preloading
79 __kernel void unoptimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]);
1. Register Preloading (RP)– Load to registers first
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
2. Vector Access (float{2, 4, 8, 16})
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
a[0]
2. Vector Access (float{2, 4, 8, 16})
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
a[0] a[1]
2. Vector Access (float{2, 4, 8, 16})
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
a[0] a[1] a[2] a[3]
2. Vector Access (float{2, 4, 8, 16})
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
2. Vector Access (float{2, 4, 8, 16})
– Scalar Math (VASM)
Carlo del Mundo, [email protected], carlodelmundo.com
a[0] a[1] a[2] a[3]
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
a[0] a[1] a[2] a[3]
+ =
2. Vector Access (float{2, 4, 8, 16})
– Scalar Math (VASM)• float + float
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
a[0] a[1] a[2] a[3]
+ =
2. Vector Access (float{2, 4, 8, 16})
– Scalar Math (VASM)• float + float
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
2. Vector Access (float{2, 4, 8, 16})
– Scalar Math (VASM)• float + float
– Vector Math (VAVM)• float4 + float4
Carlo del Mundo, [email protected], carlodelmundo.com
a[0] a[1] a[2] a[3]
+ =
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
a[0] a[1] a[2] a[3]
+ =
+ =
2. Vector Access (float{2, 4, 8, 16})
– Scalar Math (VASM)• float + float
– Vector Math (VAVM)• float4 + float4
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
a[0] a[1] a[2] a[3]
+ =
+ =
2. Vector Access (float{2, 4, 8, 16})
– Scalar Math (VASM)• float + float
– Vector Math (VAVM)• float4 + float4
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Approach
• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory
• Algorithm-level Optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
1C. del Mundo, W. Feng. “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Approach
• System-level Optimizations (applicable to any application) 1. Register Preloading2. Vector Access/{Vector,Scalar} Arithmetic3. Constant Memory Usage 4. Dynamic Instruction Reduction5. Memory Coalescing6. Image Memory
• Algorithm-level Optimizations1. Naïve Transpose (LM-CM)2. Compute/Transpose via LM (LM-CC)3. Compute/No Transpose via LM (LM-CT)
Carlo del Mundo, [email protected], carlodelmundo.com
1C. del Mundo, W. Feng. “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
• Transpose – elements across the diagonal are exchanged
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
• Transpose – elements across the diagonal are exchanged
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
4x4 matrix
Transposed matrix
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
• Transpose – elements across the diagonal are exchanged
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
4x4 matrix
Transposed matrix
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
• Transpose – elements across the diagonal are exchanged
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
4x4 matrix
Transposed matrix
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
• Transpose – elements across the diagonal are exchanged
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
4x4 matrix
Transposed matrix
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
• Transpose – elements across the diagonal are exchanged
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
4x4 matrix
Transposed matrix
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Original Transposed
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
1. Naïve Transpose (LM-CM)
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Local Memory
t0 t1 t2 t3
Original Transposed
Register File
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
1. Naïve Transpose (LM-CM)
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Local Memory
t0 t1 t2 t3
Original Transposed
Register File
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
1. Naïve Transpose (LM-CM)
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Local Memory
t0 t1 t2 t3
Original Transposed
Register File
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
1. Naïve Transpose (LM-CM)
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Local Memory
t0 t1 t2 t3
Original Transposed
Register File
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
3. The pseudo transpose (LM-CT)
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Original Transposed
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
3. The pseudo transpose (LM-CT)
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Original Transposed
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
3. The pseudo transpose (LM-CT)– Idea:
• Load data to local memory
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Original Transposed
Local Memory
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
3. The pseudo transpose (LM-CT)– Idea:
• Load data to local memory
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Original Transposed
Local Memory
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
3. The pseudo transpose (LM-CT)– Idea:
• Load data to local memory• Perform computation on
columns,
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Original Transposed
Local Memory
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
3. The pseudo transpose (LM-CT)– Idea:
• Load data to local memory• Perform computation on
columns, then rows.
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Original Transposed
Local Memory
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
3. The pseudo transpose (LM-CT)– Idea:
• Load data to local memory• Perform computation on
columns, then rows.
– Advantage: • Skips the transpose step
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Original Transposed
Local Memory
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
3. The pseudo transpose (LM-CT)– Idea:
• Load data to local memory• Perform computation on
columns, then rows.
– Advantage: • Skips the transpose step
– Disadvantage:• Local memory has lower
throughput than registers.
Algorithm-level optimizations
Carlo del Mundo, [email protected], carlodelmundo.com
Original Transposed
Local Memory
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Outline
• Motivation• Introduction• Background• Approach
– System-level optimizations– Algorithm-level optimizations
• Results– Optimizations in isolation– Optimizations in concert
• Conclusion
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (Experimental Testbed)
Carlo del Mundo, [email protected], carlodelmundo.com
GPU Testbed
Device (AMD Radeon)
CoresPeak
Performance
(GFLOPS)
PeakBandwidth
(GB/s)
HD 7970 2048 3788 264
HD 6970 (VLIW) 1536 2703 176
HD 5870 (VLIW) 1600 2720 154
• Algorithm:– 1D FFT (batched), N = 16 pts– Cooley-Tukey Decomposition
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
100%
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
100%
160%
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
40%
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
40%
0% (No Change)
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
20%
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
20%
10%
41%
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
20%
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
20%
0% (No Change)
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
20%
Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
20%
40%
Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
0% (No Change)
Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)2. 0% - Dynamic instruction reduction (LU,
CSE, IL)
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
0% (No Change)
Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)2. 0% - Dynamic instruction reduction (LU,
CSE, IL)
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)2. 0% - Dynamic instruction reduction (LU,
CSE, IL)3. 18% - Avoid large vectors & vector math
(VASM16, VAVM8/16)
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)2. 0% - Dynamic instruction reduction (LU,
CSE, IL)3. 18% - Avoid large vectors & vector math
(VASM16, VAVM8/16)
61%
39%
50%
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (in isolation)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
Improvements to Baseline (Max. % Increase)1. 160% - Minimize bus traffic via on-
chip optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
Neutral/Detrimental to Baseline (Min. % Decrease)1. 20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)2. 0% - Dynamic instruction reduction (LU,
CSE, IL)3. 18% - Avoid large vectors & vector math
(VASM16, VAVM8/16)
53%
18%
34%
AMD Radeon HD 7970 (Scalar, non-VLIW)AMD Radeon HD 5870/6970 (VLIW)
synergy.cs.vt.edu
Results (in concert)• Improvements (Max.
Increase)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
synergy.cs.vt.edu
Results (in concert)• Improvements (Max.
Increase)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
2.9x 2.4
x
synergy.cs.vt.edu
Results (in concert)• Improvements (Max.
Increase)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
2.9x 2.4
x
2.4x
1.8x
synergy.cs.vt.edu
Results (in concert)• Improvements (Max.
Increase)– {RP + LM-CM} best on-
chip optimization
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
2.1x1.5
x
2.9x 2.4
x
2.4x
1.8x
synergy.cs.vt.edu
Results (in concert)• Improvements (Max. %
Increase)– {RP + LM-CM} best on-
chip optimization– Use Constant Memory
(CM) for twiddle calculations
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
2.1x1.5
x
2.9x 2.4
x
2.4x
1.8x
6.5x
5.6x
synergy.cs.vt.edu
Results (in concert)• Improvements (Max. %
Increase)– {RP + LM-CM} best on-
chip optimization– Use Constant Memory
(CM) for twiddle calculations
– Use global memory (instead of image memory)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
2.1x1.5
x
2.9x 2.4
x
2.4x
1.8x
6.5x
5.6x
5.6x 5.6
x
5.6x
synergy.cs.vt.edu
Results (in concert)• Improvements (Max. %
Increase)– {RP + LM-CM} best on-
chip optimization– Use Constant Memory
(CM) for twiddle calculations
– Use global memory (instead of image memory)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
2.1x1.5
x
2.9x 2.4
x
2.4x
1.8x
6.5x
5.6x
5.6x 5.6
x
5.6x
6.5x
synergy.cs.vt.edu
Results (in concert)• Improvements (Max. %
Increase)– {RP + LM-CM} best on-
chip optimization– Use Constant Memory
(CM) for twiddle calculations
– Use global memory (instead of image memory)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
2.1x1.5
x
2.9x 2.4
x
2.4x
1.8x
6.5x
5.6x
5.6x 5.6
x
5.6x
6.5x
6.3x
synergy.cs.vt.edu
Results (in concert)• Improvements (Max. %
Increase)– {RP + LM-CM} best on-
chip optimization– Use Constant Memory
(CM) for twiddle calculations
– Use global memory (instead of image memory)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
2.1x1.5
x
2.9x 2.4
x
2.4x
1.8x
6.5x
5.6x
5.6x 5.6
x
5.6x
6.5x
6.3x
2.4x
synergy.cs.vt.edu
Results (in concert)• Improvements (Max. %
Increase)– {RP + LM-CM} best on-
chip optimization– Use Constant Memory
(CM) for twiddle calculations
– Use global memory (instead of image memory)
Carlo del Mundo, [email protected], carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
2.9x 2.4
x
2.4x
2.1x
6.5x
2.4x
2.4x
6.5x
6.3x
1.8x1.5
x
5.6x
5.6x 5.6
x
5.6x
synergy.cs.vt.edu
Results (in concert)• Improvements (Max. %
Increase)– {RP + LM-CM} best on-
chip optimization– Use Constant Memory
(CM) for twiddle calculations
– Use global memory (instead of image memory)
– Optimal set for AMD GPUs
• RP – Register Preloading
• LM-CM – Transpose vialocal memory
• CM – Constant memoryusage
• CGAP – Coalesced Global Access Pattern
• VASM2 – Vector Access, Scalar Math (float2)Carlo del Mundo, [email protected],
carlodelmundo.com
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; CGAP: Coalesced Access Pattern; LU: Loop unrolling; CSE: Common subexpression elimination; IL: Function inlining; Baseline: VASM2.
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.2 All implementations are coalesced (CGAP) and use VASM2.3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
2.9x 2.4
x
2.4x
2.1x
6.5x
2.4x
2.4x
6.5x
6.3x
1.8x1.5
x
5.6x
5.6x 5.6
x
5.6x
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (1D FFT 16-pts, GPU versions)
Carlo del Mundo, [email protected], carlodelmundo.com
• Optimized GPU faster by factors of 14.5 over baseline GPU
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Results (1D FFT 16-pts, GPU versions)
Carlo del Mundo, [email protected], carlodelmundo.com
• Optimized GPU faster by factors of 14.5 over baseline GPU
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Conclusions
• Contributions:– A portable building block for FFT towards GPU-based radios– Architecture-aware insights for mapping and optimizing FFT across
three generations of AMD GPUs• Contact:
– Carlo del Mundo– [email protected]
• Optimal set for AMD GPUs– RP – Register Preloading– LM-CM – Transpose via
local memory– CM – Constant memory
usage– CGAP – Coalesced Global
Access Pattern– VASM2 – Vector Access,
Scalar Math (float2)
Carlo del Mundo, [email protected], carlodelmundo.com
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Appendix Slides
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Introduction (FFT)
• Fast Fourier Transform (FFT)– A spectral method
• Key computational idiom for present and future applications (dwarf)§
List of Dwarfs1. Finite State Machine2. Circuits3. Graph Algorithms4. Structured Grid5. Dense Matrix6. Sparse Matrix7. Spectral Methods
8. Dynamic Prog.9. Particle Methods10. Backtrack/B&B11. Graphical Models12. Unstructured
Grids13. Map Reduce
Carlo del Mundo, [email protected], carlodelmundo.com
§ Asanovic et al. A View of the Parallel Computing Landscape. CACM, 2009.
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Background (Optimizing on GPUs)1. RP (Register Preloading) - All data elements are first preloaded onto the register file of the
respective GPU. Computation is facilitated solely on registers.2. CGAP (Coalesced Global Access Pattern) - Threads access memory contiguously (the kth
thread accesses memory element k) 3. VASM2/4 (Vector Access, Scalar Math, float{2/4}) - Data elements are loaded as the
listed vector type. Arithmetic operations are scalar (float x float).4. LM-CM (Local Memory, Communication Only) - Data elements are loaded into local
memory only for communication. Threads swap data elements solely in local memory.5. LM-CT (Local Memory, Computation, No Transpose) - Data elements are loaded into
local memory for computation. The communication step is avoided by algorithm reorganization.
6. LM-CC (Local Memory, Computation and Communication) - All data elements are preloaded into local memory. Computation is performed in local memory, while registers are used for scratchpad communication.
7. CM-K (Constant Memory - Kernel Argument) - The twiddle multiplication stage of FFT is precomputed on the CPU and stored in the GPU constant memory for fast look up.
8. CSE (Common Subexpression Elimination) - A traditional optimization that collapses identical expressions in order to save computation. This optimization may increase register live time, therefore, increasing register pressure.
9. IL (Function Inlining) - A function's code body is inserted in place of a function call. It is used primarily for functions that are frequently called.
10. IM (Image Memory) – The use of a texture image replaces the use of global memory.
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Motivation (GPU FFT vs. CPU FFT)
Carlo del Mundo, [email protected], carlodelmundo.com
* Device-Host Data Transfer Not Included
• GPU FFT outperforms CPU FFT by factors as high as 6.5*– 1D batched FFT, N = 16 pts
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
Introduction (Channelizer Architecture)• Channelizer Architecture
– FIR Filtering, FFT, and Channel Mapping.
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
S3: Constant Memory
• Fast cached lookup for frequently used data
Carlo del Mundo, [email protected], carlodelmundo.com
synergy.cs.vt.edu
Accelerating Fast Fourier Transform for Wideband Channelization
S3: Constant Memory
• Fast cached lookup for frequently used data
Carlo del Mundo, [email protected], carlodelmundo.com
16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f), ... more sin/cos values};
Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 }
With Constant Memory61 for (int j = 1; j < 4; ++j)62 result[j] = buffer[j*4] *
twiddles[4*j+tid];