improving memory system performance for soft vector processors
Post on 31-Dec-2015
46 Views
Preview:
DESCRIPTION
TRANSCRIPT
Improving Memory System Performance for Soft Vector ProcessorsPeter YiannacourasJ. Gregory SteffanJonathan Rose
WoSPS – Oct 26, 2008
2
Soft Processors in FPGA Systems
SoftProcessor
CustomLogic
HDL+
CAD
C+
Compiler
Easier Faster Smaller Less Power
Data-level parallelism → soft vector processors
Configurable – how can we make use of this?
3
Vector Processing Primer
// C codefor(i=0;i<16; i++) b[i]+=a[i]
// Vectorized codeset vl,16vload vr0,bvload vr1,avadd vr0,vr0,vr1vstore vr0,b
Each vector instructionholds many units of independent operations
b[0]+=a[0]b[1]+=a[1]b[2]+=a[2]
b[4]+=a[4]b[3]+=a[3]
b[5]+=a[5]b[6]+=a[6]b[7]+=a[7]b[8]+=a[8]b[9]+=a[9]
b[10]+=a[10]b[11]+=a[11]b[12]+=a[12]b[13]+=a[13]b[14]+=a[14]b[15]+=a[15]
vadd
1 Vector Lane
4
Vector Processing Primer
// C codefor(i=0;i<16; i++) b[i]+=a[i]
// Vectorized codeset vl,16vload vr0,bvload vr1,avadd vr0,vr0,vr1vstore vr0,b
Each vector instructionholds many units of independent operations
vadd
16 Vector Lanes
b[0]+=a[0]b[1]+=a[1]b[2]+=a[2]
b[4]+=a[4]b[3]+=a[3]
b[5]+=a[5]b[6]+=a[6]b[7]+=a[7]b[8]+=a[8]b[9]+=a[9]
b[10]+=a[10]b[11]+=a[11]b[12]+=a[12]b[13]+=a[13]b[14]+=a[14]b[15]+=a[15]
16x speedup
5
Sub-Linear Scalability
4.7
8.0
6.05.2
3.1
01
23
45
67
89
autc
or
conv
en
ip_c
heck
sum
imgb
lend
GM
EA
N
Cycle
Perf
orm
ance
Rela
tive to 1
Lane
1 Lane
2 Lanes
4 Lanes
8 Lanes
16 Lanes
Vector lanes not being fully utilized
6
Where Are The Cycles Spent?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
autcor conven ip_checksum imgblend AVERAGE
Fra
ction o
f Tota
l Cycle
s
~
Memory Unit Stall Cycles
Miss Cycles 67%
2/3 cycles spent waiting on memory unit, often from cache misses
16 lanes
7
Our Goals
1. Improve memory system Better cache design Hardware prefetching
2. Evaluate improvements for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz)
8
Current Infrastructure
Vectorizedassembly
subroutines
GNU as+
Vectorsupport
ELFBinary
MINT Instruction
Set Simulator
scalar μP
+vpu
VCRF
VSRF
VCWB
VSWB
Logic
DecodeRepli-cate
Hazardcheck
VRRF
ALU
MemUnit
x & satur.
VRWB
MUX
Satu-rate
Rshift
VRRF
ALU
x & satur.
VRWB
MUX
Satu-rate
Rshift
EEMBC CBenchmarks
Modelsim(RTL
Simulator)
SOFTWARE HARDWAREVerilog
AlteraQuartus II
v 8.0
cyclesarea,frequency
GCC
ld
verification verification
9
VESPA Architecture Design
ScalarPipeline3-stage
VectorControlPipeline3-stage
VectorPipeline6-stage
Icache Dcache
Decode RFALU
MUX WB
VCRF
VSRF
VCWB
VSWB
Logic
DecodeRepli-cate
Hazardcheck
VRRF A
LU
x & satur.
VRWB
MUX
Satu-rate
Rshift
VRRF A
LU
x & satur.
VRWB
MUX
Satu-rate
Rshift
MemUnit
Decode
Supports integerand fixed-point operations, and predication
32-bitdatapaths
Shared Dcache
10
10
Vector Memory Crossbar
Memory System Design
DDR
ScalarVectorCoproc
Lane0Lane
0Lane
0Lane
4
Dcache4KB,
16B line …
Lane0Lane
0Lane
0Lane
8Lane
0Lane
0Lane
0Lane12
Lane4Lane
4Lane15Lane16
VESPA16 lanes
DDR9 cycle access
vld.w (load 16 contiguous 32-bit words)
11
Vector Memory Crossbar
Memory System Design
DDR
ScalarVectorCoproc
Lane0Lane
0Lane
0Lane
4
Dcache16KB,
64B line …
Lane0Lane
0Lane
0Lane
8Lane
0Lane
0Lane
0Lane12
Lane4Lane
4Lane15Lane16
VESPA16 lanes
DDR9 cycle access
vld.w (load 16 contiguous 32-bit words)
4x
4x
Reducedcache accesses +some prefetching
12
Improving Cache Design
Vary the cache depth & cache line size Using parameterized design Cache line size: 16, 32, 64, 128 bytes Cache depth: 4, 8, 16, 32, 64 KB
Measure performance on 9 benchmarks 6 from EEMBC, all executed in hardware
Measure area cost Equate silicon area of all resources used
Report in units of Equivalent LEs
13
Cache Design Space – Performance (Wall Clock Time)
1.68
1.93
1.55
1.77
1.37
1.50
1.13
1.00
1.25
1.50
1.75
2.00
4KB 8KB 16KB 32KB 64KB
Speedup V
s 4
KB
,16B
128B
64B
32B
16B
Best cache design almost doubles performance of original VESPA
122MHz
123MHz
126MHz129MHz
More pipelining/retiming could reduce clock frequency penalty
Cache line more important than cache depth (lots of streaming)
14
Cache Design Space – Area
1.00
1.25
1.50
1.75
2.00
4KB 8KB 16KB 32KB 64KB
Are
a V
s 4
KB
,16B
128B
64B
32B
16B
M4K
MRAM
16bits
4096bits
64B (512 bits)
16bits
4096bits
16bits
4096bits
…16bits
4096bits
16bits
4096bits
16bits
4096bits
16bits
4096bits
32 => 16KB of storage
System area almost doubled in worst case
15
Cache Design Space – Area
1.00
1.25
1.50
1.75
2.00
4KB 8KB 16KB 32KB 64KB
Are
a V
s 4
KB
,16B
128B
64B
32B
16B
M4K
MRAM
b) Don’t use MRAMs: big, few, and overkill
a) Choose depth to fill block RAMs needed for line size
16
Hardware Prefetching Example
DDR
Dcache
…
vld.w
No Prefetching Prefetching 3 blocks
DDR
Dcache
…
vld.w
MISS MISS
9 cyclepenalty
9 cyclepenalty
vld.w vld.w
HITMISS
17
Hardware Data Prefetching
Advantages Little area overhead Parallelize memory fetching with computation Use full memory bandwidth
Disadvantages Cache pollution
We use Sequential Prefetching triggered on: a) any miss, or b) sequential vector instruction miss
We measure performance/area using a 64B, 16KB dcache
18
Prefetching K Blocks – Any Miss
0.5
1
1.5
2
2.5
0 1 3 7 15 31 63
Number of Cache Lines Prefetched
Speedup v
s n
o
Pre
fetc
hin
g
autcor
conven
viterb
fbital
rgbcmyk
rgbyiq
ip_checksum
imgblend
filt3x3
GMEAN
Peak average speedup 28%
2.2x
Not receptive
Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%
19
dirtylines
…
Prefetching Area Cost: Writeback Buffer
Two options: Deny prefetch Buffer all dirty lines
Area cost is small 1.6% of system area Mostly block RAMs Little logic
No clock frequency impact
Prefetching 3 blocks
DDR
Dcache
…
vld.w
MISS
9 cyclepenalty
WBBuffer
20
Any Miss vs Sequential Vector Miss
0.70
0.80
0.90
1.00
1.10
1.20
1.30
0 1 3 7 15 31 63
Number of Cache Lines Prefetched
Speedup
Any Cache Misses
Sequential Vector only
Collinear – nearly all misses in our benchmarks are sequential vector
21
Vector Length Prefetching
Previously: constant # cache lines prefetched Now: Use multiple of vector length
Only for sequential vector memory instructions Eg. Vector load of 32 elements
Guarantees <= 1 miss per vector memory instr
vld.w0 31
fetch +prefetch 28*k
22
Vector Length Prefetching - Performance
0.5
1
1.5
2
2.5N
one
1*V
L
2*V
L
4*V
L
8*V
L
16*V
L
32*V
L
Amount of Prefetching
Speedup
autcor
conven
fbital
viterb
rgbcmyk
rgbyiq
ip_checksum
imgblend
filt3x3
GMEAN
Peak 29%
2.2x
Not receptive
1*VL prefetching provides good speedup without tuning, 8*VL best
no cachepollution
21%
23
Overall Memory System Performance
00.10.20.30.40.50.60.70.8
16-byte line 64-byte line 64-byte line +prefetch
Fra
ction o
f Tota
l Cycle
s
Memory Unit Stall Cycles
Miss Cycles
(4KB) (16KB)
67%
48%
31%
4%
15
Wider line + prefetching reduces memory unit stall cycles significantly
Wider line + prefetching eliminates all but 4% of miss cycles
24
Improved Scalability
02468
101214
autc
or
conv
en
fbita
l
vite
rb
rgbc
myk
rgby
iq
ip_c
heck
sum
imgb
lend
filt3
x3
GM
EA
N
Cyc
le P
erfo
rman
ce
Rel
ativ
e to
1 L
ane
1 Lane
2 Lanes
4 Lanes
8 Lanes
16 Lanes
Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10x for 16 lanes
25
Summary
Explored cache design ~2x performance for ~2x system area
Area growth due largely to memory crossbar Widened cache line size to 64B and depth to 16KB
Enhanced VESPA w/ hardware data prefetching Up to 2.2x performance, average of 28% for K=15 Vector length prefetcher gains 21% on average for 1*VL
Good for mixed workloads, no tuning, no cache pollution Peak at 8*VL, average of 29% speedup
Overall improved VESPA memory system & scalability Decreased miss cycles to 4%, Decreased memory unit stall cycles to 31%
top related