high throughput mimo-ofdm detection with graphics...

High Throughput MIMO-OFDM Detection with Graphics Processing UnitsDan Sui1, Peng Wang2, Yunzhou Li1, Yangdong Deng1, Bin Zhou2, Jing Wang1

1 Tsinghua University, Beijing 100084, China; 2 NVIDIA, China

System Model

Conclusions

Performance ResultsOptimization Strategy

Contact InformationDr. Dan SuiWireless & Mobile Communication Technology R&D CenterResearch Institute of Information Technology (RIIT) Tsinghua [email protected]

MIMO-OFDM Transceiver Model

Minimum Mean Square Error (MMSE) Detector

Problem StatementCopy y, H to GPU MMSE detector Copy to CPUx̂

Kernel 1Compute J

Kernel 2Compute J-1

Kernel 3Compute x̂

time (ms)

Bandwidth (MHz)

CPU(ms)

GPU Sync (ms) GPU Async(ms)

kernel Data transfer

total

10 5 19.94 1.43 1.24 2.67 1.6010 39.92 2.65 1.61 4.26 2.8915 59.79 3.95 2.00 5.95 4.1920 92.44 5.20 2.53 7.73 5.37

20 5 39.92 2.65 1.61 4.26 2.8910 79.80 5.20 2.44 7.64 5.3715 119.23 7.72 3.38 11.10 8.0020 184.68 10.26 4.22 14.48 10.53

implement efficiently an MMSE-based MIMO detector achieve a throughput over 100 Mbps satisfy LTE/LTE-Advanced requirements

Main difficulties handle millions of matrices with small sizes

e.g. 2×2, 4×4, 8×8 matrix inversion is most time-consuming

Input Data

Channel Encoder

MIMO Encoder

OFDM Modulator 0

OFDM Modulator 1

OFDM Modulator M-2

OFDM Modulator M-1

......

x0(n)

x1(n)

xM-2(n)

xM-1(n)

y0(n)

y1(n)

yN-2(n)

yN-1(n)

OFDM Demodulator 0

OFDM Demodulator 1

OFDM Demodulator N-2

OFDM Demodulator N-1

...

Output Data

Channel Decoder

MIMO DetectionMIMO

Detection

ˆ MMSE=x G y

11H H H

MMSE Mρ

−

= +

G H H I H JH

= +y Hx w

...

Gaussian Elimination with complete pivoting permute the largest element in the current submatrix

A(k : N-1, k : N-1) into the position (k, k)

One matrix per thread• consume too much sharedmemory• limit the amount ofconcurrent threads on a SM

One matrix per block• cause little parallelism forall the threads in a block,

since the matrix dimensionis typically 2×2 to 8×8

Our method: Multiple matrix per block each thread reads N elements in a single matrix to

the shared memory

N threads process a matrix inversion in parallel

Fine-tuning Optimization overlap kernel execution and data transfer transfer data between CPU and GPU while kernel

is executed

0 0 0

0 0 0

0 0 0

(0, 0) (0,1) (0, 1)

(1, 0) (1,1) (1, 1)

( 1,0) ( 1,1) ( 1, 1)

N

N

N N N N

−

−

− − − −

J J J

J J J

J J J

J2

JB-1

J1

…Block

thread 0

thread 1

thread N-1

...

Execution time for CPU and GPU under different circumstances

Compare the throughput 2×2 16QAM 5MHz MIMO 4×4 64QAM 20MHz MIMO

time (ms) time (ms)

Throughput (Mbps)

Throughput (Mbps)

high throughput mimo-ofdm detection with graphics...

Documents