improving memory system performance for soft vector processors

Improving Memory System Performance for Soft Vector ProcessorsPeter YiannacourasJ. Gregory SteffanJonathan Rose

WoSPS – Oct 26, 2008

Soft Processors in FPGA Systems

SoftProcessor

CustomLogic

Compiler

Easier Faster Smaller Less Power

Data-level parallelism → soft vector processors

Configurable – how can we make use of this?

Vector Processing Primer

// C codefor(i=0;i<16; i++) b[i]+=a[i]

// Vectorized codeset vl,16vload vr0,bvload vr1,avadd vr0,vr0,vr1vstore vr0,b

Each vector instructionholds many units of independent operations

b[0]+=a[0]b[1]+=a[1]b[2]+=a[2]

b[4]+=a[4]b[3]+=a[3]

b[5]+=a[5]b[6]+=a[6]b[7]+=a[7]b[8]+=a[8]b[9]+=a[9]

b[10]+=a[10]b[11]+=a[11]b[12]+=a[12]b[13]+=a[13]b[14]+=a[14]b[15]+=a[15]

1 Vector Lane

Vector Processing Primer

// C codefor(i=0;i<16; i++) b[i]+=a[i]

// Vectorized codeset vl,16vload vr0,bvload vr1,avadd vr0,vr0,vr1vstore vr0,b

Each vector instructionholds many units of independent operations

16 Vector Lanes

b[0]+=a[0]b[1]+=a[1]b[2]+=a[2]

b[4]+=a[4]b[3]+=a[3]

b[5]+=a[5]b[6]+=a[6]b[7]+=a[7]b[8]+=a[8]b[9]+=a[9]

b[10]+=a[10]b[11]+=a[11]b[12]+=a[12]b[13]+=a[13]b[14]+=a[14]b[15]+=a[15]

16x speedup

Sub-Linear Scalability

6.05.2

tive to 1

1 Lane

2 Lanes

4 Lanes

8 Lanes

16 Lanes

Vector lanes not being fully utilized

Where Are The Cycles Spent?

autcor conven ip_checksum imgblend AVERAGE

ction o

f Tota

l Cycle

Memory Unit Stall Cycles

Miss Cycles 67%

2/3 cycles spent waiting on memory unit, often from cache misses

16 lanes

Our Goals

1. Improve memory system Better cache design Hardware prefetching

2. Evaluate improvements for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz)

Current Infrastructure

Vectorizedassembly

subroutines

GNU as+

Vectorsupport

ELFBinary

MINT Instruction

Set Simulator

scalar μP

DecodeRepli-cate

Hazardcheck

MemUnit

x & satur.

Satu-rate

Rshift

x & satur.

Satu-rate

Rshift

EEMBC CBenchmarks

Modelsim(RTL

Simulator)

SOFTWARE HARDWAREVerilog

AlteraQuartus II

cyclesarea,frequency

verification verification

VESPA Architecture Design

ScalarPipeline3-stage

VectorControlPipeline3-stage

VectorPipeline6-stage

Icache Dcache

Decode RFALU

MUX WB

DecodeRepli-cate

Hazardcheck

VRRF A

x & satur.

Satu-rate

Rshift

VRRF A

x & satur.

Satu-rate

Rshift

MemUnit

Decode

Supports integerand fixed-point operations, and predication

32-bitdatapaths

Shared Dcache

Vector Memory Crossbar

Memory System Design

ScalarVectorCoproc

Lane0Lane

Dcache4KB,

16B line …

Lane0Lane

0Lane12

Lane4Lane

4Lane15Lane16

VESPA16 lanes

DDR9 cycle access

vld.w (load 16 contiguous 32-bit words)

Vector Memory Crossbar

Memory System Design

ScalarVectorCoproc

Lane0Lane

Dcache16KB,

64B line …

Lane0Lane

0Lane12

Lane4Lane

4Lane15Lane16

VESPA16 lanes

DDR9 cycle access

vld.w (load 16 contiguous 32-bit words)

Reducedcache accesses +some prefetching

Improving Cache Design

Vary the cache depth & cache line size Using parameterized design Cache line size: 16, 32, 64, 128 bytes Cache depth: 4, 8, 16, 32, 64 KB

Measure performance on 9 benchmarks 6 from EEMBC, all executed in hardware

Measure area cost Equate silicon area of all resources used

Report in units of Equivalent LEs

Cache Design Space – Performance (Wall Clock Time)

4KB 8KB 16KB 32KB 64KB

Speedup V

Best cache design almost doubles performance of original VESPA

122MHz

123MHz

126MHz129MHz

More pipelining/retiming could reduce clock frequency penalty

Cache line more important than cache depth (lots of streaming)

Cache Design Space – Area

16bits

4096bits

64B (512 bits)

16bits

4096bits

16bits

4096bits

…16bits

4096bits

16bits

4096bits

16bits

4096bits

16bits

4096bits

32 => 16KB of storage

System area almost doubled in worst case

Cache Design Space – Area

b) Don’t use MRAMs: big, few, and overkill

a) Choose depth to fill block RAMs needed for line size

Hardware Prefetching Example

Dcache

No Prefetching Prefetching 3 blocks

Dcache

MISS MISS

9 cyclepenalty

vld.w vld.w

HITMISS

Hardware Data Prefetching

Advantages Little area overhead Parallelize memory fetching with computation Use full memory bandwidth

Disadvantages Cache pollution

We use Sequential Prefetching triggered on: a) any miss, or b) sequential vector instruction miss

We measure performance/area using a 64B, 16KB dcache

Prefetching K Blocks – Any Miss

0 1 3 7 15 31 63

Number of Cache Lines Prefetched

Speedup v

autcor

conven

viterb

fbital

rgbcmyk

rgbyiq

ip_checksum

imgblend

filt3x3

Peak average speedup 28%

Not receptive

Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%

dirtylines

Prefetching Area Cost: Writeback Buffer

Two options: Deny prefetch Buffer all dirty lines

Area cost is small 1.6% of system area Mostly block RAMs Little logic

No clock frequency impact

Prefetching 3 blocks

Dcache

9 cyclepenalty

WBBuffer

Any Miss vs Sequential Vector Miss

0 1 3 7 15 31 63

Number of Cache Lines Prefetched

Speedup

Any Cache Misses

Sequential Vector only

Collinear – nearly all misses in our benchmarks are sequential vector

Vector Length Prefetching

Previously: constant # cache lines prefetched Now: Use multiple of vector length

Only for sequential vector memory instructions Eg. Vector load of 32 elements

Guarantees <= 1 miss per vector memory instr

vld.w0 31

fetch +prefetch 28*k

Vector Length Prefetching - Performance

Amount of Prefetching

Speedup

autcor

conven

fbital

viterb

rgbcmyk

rgbyiq

ip_checksum

imgblend

filt3x3

Peak 29%

Not receptive

1*VL prefetching provides good speedup without tuning, 8*VL best

no cachepollution

Overall Memory System Performance

00.10.20.30.40.50.60.70.8

16-byte line 64-byte line 64-byte line +prefetch

ction o

f Tota

l Cycle

Memory Unit Stall Cycles

Miss Cycles

(4KB) (16KB)

Wider line + prefetching reduces memory unit stall cycles significantly

Wider line + prefetching eliminates all but 4% of miss cycles

Improved Scalability

101214

1 Lane

2 Lanes

4 Lanes

8 Lanes

16 Lanes

Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10x for 16 lanes

Summary

Explored cache design ~2x performance for ~2x system area

Area growth due largely to memory crossbar Widened cache line size to 64B and depth to 16KB

Enhanced VESPA w/ hardware data prefetching Up to 2.2x performance, average of 28% for K=15 Vector length prefetcher gains 21% on average for 1*VL

Good for mixed workloads, no tuning, no cache pollution Peak at 8*VL, average of 29% speedup

Overall improved VESPA memory system & scalability Decreased miss cycles to 4%, Decreased memory unit stall cycles to 31%

Vector Memory Unit

Dcache

stride*0

index0

stride*1

index1

+MUXstride*L

indexL

MemoryRequestQueue

ReadCrossbar

…Memory Lanes=4

rddata0rddata1

rddataL

wrdata0wrdata1

wrdataL ...

WriteCrossbar

MemoryWrite

L = # Lanes - 1……

improving memory system performance for soft vector processors

vector load

memory system performance

sequential prefetching

chip memory ddr

multiple of vector lengthonly

w load

complete hardware design

cache lines prefetchednow

Documents

adaptable register file organization for vector processors

parallell processing systems1 chapter 4 vector processors

multiprocessor concluding remarks, vector processors ch...

the soft-margin support vector machine - svcl · the...

fpga-based soft vector processors - computer engineering

vector processors motivations - ryerson university

instructionâ€“level parallelism vliw, vector, array and...

dynamic compilation of data-parallel kernels for vector...

why vector processors? g-2 g.2 g.3 g.4 g.5 g.6 g.7 g.8...

vegas: a soft vector processor

vector network processors - university of …vector network...

comp 515: advanced compilation for vector and parallel...

the microarchitecture of fpga-based soft processors

empirical performance assessment using soft-core processors...

vector processors part 2

tuning sparse matrix vector multiplication for multi-core...

multiprocessor concluding remarks, vector processors

fpga-based soft vector processors - university of...

improving pipelined soft processors with multithreading

curriculum structure semester vii -...