high throughput mimo-ofdm detection with graphics...

1
High Throughput MIMO-OFDM Detection with Graphics Processing Units Dan Sui 1 , Peng Wang 2 , Yunzhou Li 1 , Yangdong Deng 1 , Bin Zhou 2 , Jing Wang 1 1 Tsinghua University, Beijing 100084, China; 2 NVIDIA, China System Model Conclusions Performance Results Optimization Strategy Contact Information Dr. Dan Sui Wireless & Mobile Communication Technology R&D Center Research Institute of Information Technology (RIIT) Tsinghua University [email protected] MIMO-OFDM Transceiver Model Minimum Mean Square Error (MMSE) Detector Problem Statement Copy y, H to GPU MMSE detector Copy to CPU ˆ x Kernel 1 Compute J Kernel 2 Compute J -1 Kernel 3 Compute ˆ x time (ms) Bandwidth (MHz) CPU (ms) GPU Sync (ms) GPU Async (ms) kernel Data transfer total 10 5 19.94 1.43 1.24 2.67 1.60 10 39.92 2.65 1.61 4.26 2.89 15 59.79 3.95 2.00 5.95 4.19 20 92.44 5.20 2.53 7.73 5.37 20 5 39.92 2.65 1.61 4.26 2.89 10 79.80 5.20 2.44 7.64 5.37 15 119.23 7.72 3.38 11.10 8.00 20 184.68 10.26 4.22 14.48 10.53 implement efficiently an MMSE-based MIMO detector achieve a throughput over 100 Mbps satisfy LTE/LTE-Advanced requirements Main difficulties handle millions of matrices with small sizes e.g. 2×2, 4×4, 8×8 matrix inversion is most time-consuming Input Data Channel Encoder MIMO Encoder OFDM Modulator 0 OFDM Modulator 1 OFDM Modulator M-2 OFDM Modulator M-1 ... ... x 0 (n) x 1 (n) x M-2 (n) x M-1 (n) y 0 (n) y 1 (n) y N-2 (n) y N-1 (n) OFDM Demodulator 0 OFDM Demodulator 1 OFDM Demodulator N-2 OFDM Demodulator N-1 ... Output Data Channel Decoder MIMO Detection ˆ MMSE = x G y 1 1 H H H MMSE M ρ = + G HH I H JH = + y Hx w ... Gaussian Elimination with complete pivoting permute the largest element in the current submatrix A(k : N-1, k : N-1) into the position (k, k) One matrix per thread consume too much shared memory limit the amount of concurrent threads on a SM One matrix per block cause little parallelism for all the threads in a block, since the matrix dimension is typically 2×2 to 8×8 Our method: Multiple matrix per block each thread reads N elements in a single matrix to the shared memory N threads process a matrix inversion in parallel Fine-tuning Optimization overlap kernel execution and data transfer transfer data between CPU and GPU while kernel is executed 0 0 0 0 0 0 0 0 0 (0, 0) (0,1) (0, 1) (1, 0) (1,1) (1, 1) ( 1, 0) ( 1,1) ( 1, 1) N N N N N N J J J J J J J J J J 2 J B-1 J 1 Block thread 0 thread 1 thread N-1 ... Execution time for CPU and GPU under different circumstances Compare the throughput 2×2 16QAM 5MHz MIMO 4×4 64QAM 20MHz MIMO time (ms) time (ms) Throughput (Mbps) Throughput (Mbps)

Upload: ngodung

Post on 02-Apr-2018

228 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: High Throughput MIMO-OFDM Detection with Graphics ...developer.download.nvidia.com/GTC/PDF/GTC2012/Posters/P0228_MI… · High Throughput MIMO-OFDM Detection with Graphics Processing

High Throughput MIMO-OFDM Detection with Graphics Processing UnitsDan Sui1, Peng Wang2, Yunzhou Li1, Yangdong Deng1, Bin Zhou2, Jing Wang1

1 Tsinghua University, Beijing 100084, China; 2 NVIDIA, China

System Model

Conclusions

Performance ResultsOptimization Strategy

Contact InformationDr. Dan SuiWireless & Mobile Communication Technology R&D CenterResearch Institute of Information Technology (RIIT) Tsinghua [email protected]

MIMO-OFDM Transceiver Model

Minimum Mean Square Error (MMSE) Detector

Problem StatementCopy y, H to GPU MMSE detector Copy to CPUx̂

Kernel 1Compute J

Kernel 2Compute J-1

Kernel 3Compute x̂

time (ms)

Bandwidth (MHz)

CPU(ms)

GPU Sync (ms) GPU Async(ms)

kernel Data transfer

total

10 5 19.94 1.43 1.24 2.67 1.6010 39.92 2.65 1.61 4.26 2.8915 59.79 3.95 2.00 5.95 4.1920 92.44 5.20 2.53 7.73 5.37

20 5 39.92 2.65 1.61 4.26 2.8910 79.80 5.20 2.44 7.64 5.3715 119.23 7.72 3.38 11.10 8.0020 184.68 10.26 4.22 14.48 10.53

implement efficiently an MMSE-based MIMO detector achieve a throughput over 100 Mbps satisfy LTE/LTE-Advanced requirements

Main difficulties handle millions of matrices with small sizes

e.g. 2×2, 4×4, 8×8 matrix inversion is most time-consuming

Input Data

Channel Encoder

MIMO Encoder

OFDM Modulator 0

OFDM Modulator 1

OFDM Modulator M-2

OFDM Modulator M-1

......

x0(n)

x1(n)

xM-2(n)

xM-1(n)

y0(n)

y1(n)

yN-2(n)

yN-1(n)

OFDM Demodulator 0

OFDM Demodulator 1

OFDM Demodulator N-2

OFDM Demodulator N-1

...

Output Data

Channel Decoder

MIMO DetectionMIMO

Detection

ˆ MMSE=x G y

11H H H

MMSE Mρ

= +

G H H I H JH

= +y Hx w

...

Gaussian Elimination with complete pivoting permute the largest element in the current submatrix

A(k : N-1, k : N-1) into the position (k, k)

One matrix per thread• consume too much sharedmemory• limit the amount ofconcurrent threads on a SM

One matrix per block• cause little parallelism forall the threads in a block,

since the matrix dimensionis typically 2×2 to 8×8

Our method: Multiple matrix per block each thread reads N elements in a single matrix to

the shared memory

N threads process a matrix inversion in parallel

Fine-tuning Optimization overlap kernel execution and data transfer transfer data between CPU and GPU while kernel

is executed

0 0 0

0 0 0

0 0 0

(0, 0) (0,1) (0, 1)

(1, 0) (1,1) (1, 1)

( 1,0) ( 1,1) ( 1, 1)

N

N

N N N N

− − − −

J J J

J J J

J J J

J2

JB-1

J1

…Block

thread 0

thread 1

thread N-1

...

Execution time for CPU and GPU under different circumstances

Compare the throughput 2×2 16QAM 5MHz MIMO 4×4 64QAM 20MHz MIMO

time (ms) time (ms)

Throughput (Mbps)

Throughput (Mbps)