achieving over 50% system speedup with custom instructions and multi-threading. kaiming ho...
Post on 14-Dec-2015
217 Views
Preview:
TRANSCRIPT
Kaiming Ho
Achieving over 50% system speedup with custom instructions and multi-threading.
Kaiming Ho
Fraunhofer IISkaiming.ho@iis.fraunhofer.de
June 3rd, 2014
2
Overview
• Introduction and system description• Motivation for work• Optimization approach– using user defined instructions (UDI)– using multi-threading (MT)
• Results• Concluding remarks
Kaiming Ho
3
Video Encoder System (Overview)
video in(1080p30)
ethernet out(1000Mbps)
DDR memory
mem
ory
dedicatedhardware
MIPSprocessorrunning s/w
- encoded byte stream (IP/UDP/RTP)
- statistics (IP/UDP)Kaiming Ho
ff 4c ff 51 00 2f 00 0007 80 00 04 38 00 ff 93f3 b6 ...
4
Overview of software
• Main software is partitioned into three parts– Each part must finish before the next starts
PART2(codestreamformation)
PART3(output to network)
DONEPART1(rate
optimization)
fromh/w
• Timestamps are added to measure how long each part takes. Add up time for all three parts for performance metric.– convert absolute time to frames/sec. (33.33ms -> 30fps)
• s/w also instrumented to count instructions.– can calculate instr./cycle (IPC)
• h/w delivers input at 30 fps. Analyze rate at which s/w is done.– visualize in GUI
Kaiming Ho
6
Optimization approach1. Identify functional hot-spots which can be replaced by user-
defined custom instructions (UDI).– base instruction-set is extended– One custom instruction replaces many instructions from the base-
ISA.– Highest impact when
• # instructions replaced is high• function is called often.
2. Use multi-threading (MT) to run all three parts simultaneously.– stalls in execution pipeline reduce instructions/cycle (IPC).– when one thread stalls, attempt to schedule an instruction from
another thread.– increases effective IPC.
Kaiming Ho
7
Using User-defined instructions (UDI)• MIPS UDI allows complex functions to be implemented
in a single custom instruction.– ISA is extended to include new custom instructions– Fully supported in compiler tool-chain.
• Instructions take the form:reg_result = custom_udi(reg_src1, reg_src2);
– Two 32-bit source operands (both optional) and one 32-bit result (also optional).
– Typical RISC style.– Instructions can be pure (no side-effects), or can update
internal state.
• Instructions are likely domain specific.Kaiming Ho
8
UDI Examples (1)• Bit accumulation, with zero-stuffing.
– hard for 32-bit processor to do.
• <n> bits are pushed into an accumulator.• When eight 1’s in a row occur, an extra “0” is added.• data is popped out 16/32-bits at a time.
bitwr_push 0x1f2, 10
0 1 1 11 01 10 1 bitwr_push 0xfd, 8
bitwr_getlen r10(r10 <= 19)
bitwr_pop16 r11(r11 <= 0xecff)
1 1 1 1 1 1 0 10
bitwr_push 0x17ffd, 18
0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1accumulator state
Kaiming Ho
0 0 1
9
UDI Examples (2)• FIFO pointer management.
– not domain specific. Could find use in multiple applications.ring_start
ring_end
rd_ptr
wr_ptr
Kaiming Ho
struct { unsigned *ring_start; unsigned *ring_end; unsigned *wr_ptr; unsigned *rd_ptr;} FIFO_PTR;
unsigned *FIFO_PTR_INC_WP() { unsigned *retval, *next_wp; next_wp = retval = FIFO_PTR.wr_ptr; // increment and wrap next_wp += 1; if (next_wp == FIFO_PTR.ring_end) next_wp = FIFO_PTR.ring_start; // check for full if (next_wp == FIFO_PTR.rd_ptr) return NULL; FIFO_PTR.wr_ptr = next_wp; return retval;}
• Internal state:
• s/w writes one word at a time– check for buffer full– handle wraparound
ptr = FIFO_PTR_INC_WP();if (ptr) *ptr = data;
PC: bfc059fc UDI r3 // inc_wpPC: bfc05a00 BEQZ r3, 0xbfc05ac8PC: bfc05a04 NOPPC: bfc05a08 SW r3, 0(r3)
FIFO_PTR_INC_WP() reduced to oneatomic UDI
Usage:
10Kaiming Ho
UDI name cyclessaved(per use)
instr.saved(per use)
freq.of use(per frame)
overallspeedup
BIT WRITE (push) 46-161 29-77 20889
22%BIT WRITE (get_len) 46-108 24-48 4185
BIT WRITE (pop) 31-82 16-42 3288
FIFO PTR (inc wp) 39-101 22-46 32881.9%
FIFO PTR (inc rp) 16 8 9
UDI savings
13 instr, 38 cyc.
34 instr, 57 cyc.
cyclecount
instr.count
• Two UDI replace 47 standard instructions, taking 95 cycles.• UDI does not stall.
• Amount saved is dependent on input.• # standard instructions variable.• With UDI, always 2 instructions.
12
multi-threading (1)• instructions/cycle (IPC) is a measure of efficiency in CPU
execution pipeline.– stalls due to cache misses, multi-cycle instructions, branch
penalties, etc… decrease IPC.• A CPU working in multi-threaded mode attempts to schedule
instructions from a different thread when one stalls.– increases effective IPC
• Programs with low IPC in single-threaded mode benefit most from multi-threading.
Representative execution statistics of our program gathered in the lab:part1: 3056 cyc, 1587 instr.part2: 4597034 cyc, 1954337 instr.part3: 2454570 cyc, 816940 instr.total: 7054660 cyc, 2772864 instr. avg. IPC is 0.393
avg. IPC is low!!
Expect MT to have significant impact
Kaiming Ho
30fps 30fps 30fps
part1 part2 part3
frame1
part1 part2 part3
frame2
part1 part2 part3
frame3
TOO SL
OW
multi-threading (2)• Execution of our program (in ST), over time is shown
below.
TOO SL
OW
TOO
SLOW
Kaiming Ho 13
– Too slow. The 30fps time budget is overrun.
• With MT, each part runs in its own thread, which are interleaved together.– overall effect is better performance.
14
Multi-threading and IRQ handling• Traditional ST programs get interrupted when external IRQs are asserted.
– running of ‘normal’ program is interrupted with running IRQ handler.• When MT programs are architected the same way, ALL threads are
interrupted when IRQ occurs.– On IRQ, CPU goes to exception level and MT is effectively turned off.– very inefficient. When IRQ handler stalls, cycles are wasted.
• Our program takes many interrupts. (175k / sec.)
• Different approach:– IRQ handler is given its own thread.– Assertion of IRQ does not cause a CPU interrupt. They wake up the thread with
the IRQ handler.– When IRQ handler runs, it is scheduled simultaneously with other threads in the
system.– No IRQ overhead.– CPU never goes to exception level.
Kaiming Ho
15
Performance gain from MT
Kaiming Ho
45%
Originalperformance: 83.72ms
With UDIand MT : 43.37ms
ST
MT
16
Discussion of Results
• Adding UDI decreases #instr. and IPC.– custom instructions are part of multiplier pipeline.
• When MT is used, same # instr. takes longer.– IPC of individual threads lower– Overall IPC (performance) is higher.
• lower IPC in ST means greater gain from ST->MT• Frequency of CPU does not matter
– Our application is not I/O or memory bound. Kaiming Ho
ST/noUDI (111MHz):86.6ms. IPC 42.42%
cyc. instr. IPCp1: 2*1126 1130 50.17%p2: 2*3301726 2967154 44.95%p3: 2*1508140 1114213 37.01%
ST/UDI (111MHz):65.4ms. IPC 39.39%
cyc. instr. IPCp1: 2*1126 1130 50.17%p2: 2*2118058 1741554 41.17%p3: 2*1508130 1114291 37.01%
MT/noUDI (111MHz):68.6ms.
cyc. instr. IPCp1: 2*1458 1125 38.58%p2: 2*3745384 2967201 39.61%p3: 2*1508443 1080524 35.89%
26%
MT/UDI (111MHz):43.8ms.
cyc. instr. IPCp1: 2*1973 1125 28.50%p2: 2*2435277 1741563 35.76%p3: 2*1508548 1078515 35.83%
49%
ST/UDI/rate_alloc (111MHz):89.5ms. IPC 35.22%
cyc. instr. IPCp1: 2*1339904 639013 23.84%p2: 2*2115672 1741554 41.17%p3: 2*1508090 1113907 37.00%
MT/UDI/rate_alloc (111MHz):57.3ms. (34/30/32)
cyc. instr. IPCp1: 2*1531915 639041 19.25%p2: 2*3187194 1741574 27.34%p3: 2*2249536 1057951 23.56%
56%
• adding extra processing with memory accesses and FPU decreases IPC.
• effect of MT is enhanced.
98%
17
Concluding Remarks• Over 50% improvement in performance was obtained by
using two simple techniques:– Use of custom user-defined instructions (UDI)– Use of multi-threading (MT) technology.
• UDI reduces the number of instructions executed. Consistently saves 20-25%.– Easy to implement compared to dedicated h/w design.– man-weeks of work vs. man-years.
• Benefit of MT is more variable.– Between 26-49% has been measured.– depends on operating point. Image complexity. IPC of application.– Heavily loaded systems benefit more.– memory or I/O bound applications benefit more
Kaiming Ho
top related