high throughput mimo-ofdm detection with graphics...
TRANSCRIPT
High Throughput MIMO-OFDM Detection with Graphics Processing UnitsDan Sui1, Peng Wang2, Yunzhou Li1, Yangdong Deng1, Bin Zhou2, Jing Wang1
1 Tsinghua University, Beijing 100084, China; 2 NVIDIA, China
System Model
Conclusions
Performance ResultsOptimization Strategy
Contact InformationDr. Dan SuiWireless & Mobile Communication Technology R&D CenterResearch Institute of Information Technology (RIIT) Tsinghua [email protected]
MIMO-OFDM Transceiver Model
Minimum Mean Square Error (MMSE) Detector
Problem StatementCopy y, H to GPU MMSE detector Copy to CPUx̂
Kernel 1Compute J
Kernel 2Compute J-1
Kernel 3Compute x̂
time (ms)
Bandwidth (MHz)
CPU(ms)
GPU Sync (ms) GPU Async(ms)
kernel Data transfer
total
10 5 19.94 1.43 1.24 2.67 1.6010 39.92 2.65 1.61 4.26 2.8915 59.79 3.95 2.00 5.95 4.1920 92.44 5.20 2.53 7.73 5.37
20 5 39.92 2.65 1.61 4.26 2.8910 79.80 5.20 2.44 7.64 5.3715 119.23 7.72 3.38 11.10 8.0020 184.68 10.26 4.22 14.48 10.53
implement efficiently an MMSE-based MIMO detector achieve a throughput over 100 Mbps satisfy LTE/LTE-Advanced requirements
Main difficulties handle millions of matrices with small sizes
e.g. 2×2, 4×4, 8×8 matrix inversion is most time-consuming
Input Data
Channel Encoder
MIMO Encoder
OFDM Modulator 0
OFDM Modulator 1
OFDM Modulator M-2
OFDM Modulator M-1
......
x0(n)
x1(n)
xM-2(n)
xM-1(n)
y0(n)
y1(n)
yN-2(n)
yN-1(n)
OFDM Demodulator 0
OFDM Demodulator 1
OFDM Demodulator N-2
OFDM Demodulator N-1
...
Output Data
Channel Decoder
MIMO DetectionMIMO
Detection
ˆ MMSE=x G y
11H H H
MMSE Mρ
−
= +
G H H I H JH
= +y Hx w
...
Gaussian Elimination with complete pivoting permute the largest element in the current submatrix
A(k : N-1, k : N-1) into the position (k, k)
One matrix per thread• consume too much sharedmemory• limit the amount ofconcurrent threads on a SM
One matrix per block• cause little parallelism forall the threads in a block,
since the matrix dimensionis typically 2×2 to 8×8
Our method: Multiple matrix per block each thread reads N elements in a single matrix to
the shared memory
N threads process a matrix inversion in parallel
Fine-tuning Optimization overlap kernel execution and data transfer transfer data between CPU and GPU while kernel
is executed
0 0 0
0 0 0
0 0 0
(0, 0) (0,1) (0, 1)
(1, 0) (1,1) (1, 1)
( 1,0) ( 1,1) ( 1, 1)
N
N
N N N N
−
−
− − − −
J J J
J J J
J J J
J2
JB-1
J1
…Block
thread 0
thread 1
thread N-1
...
Execution time for CPU and GPU under different circumstances
Compare the throughput 2×2 16QAM 5MHz MIMO 4×4 64QAM 20MHz MIMO
time (ms) time (ms)
Throughput (Mbps)
Throughput (Mbps)