lecture 9 thread divergence scheduling instruction level ... · 6 pm 7 8 9 time 30 40 40 40 40 20...

39
Lecture 9 Thread Divergence Scheduling Instruction Level Parallelism

Upload: others

Post on 20-Oct-2019

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Lecture 9

Thread Divergence Scheduling

Instruction Level Parallelism

Page 2: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Announcements •  Tuesday’s lecture on 2/11 will be moved to

room 4140 from 6.30 PM to 7.50 PM •  Office hours cancelled on Thursday; make

an appointment to see me later this week •  CUDA profiling with nvprof

See slides 13-16 http://www.nvidia.com/docs/IO/116711/sc11-perf-optimization.pdf

Scott B. Baden /CSE 260/ Winter 2014 2

Page 3: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

3

Recapping optimized matrix multiply •  Volkov and Demmel, SC08 •  Improve performance using fewer threads

u  Reducing concurrency frees up registers to trade locality against parallelism

u  ILP to increase processor utilization

Scott B. Baden /CSE 260/ Winter 2014 3

Page 4: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

2/4/14 4

Hiding memory latency •  Parallelism = latency × throughput

Arithmetic: 576 ops/SM = 18CP x 32/SM/CP Memory: 150KB = ~500CP (1100 nsec) x 150 GB/sec

•  How can we keep 150KB in flight? u  Multiple threads: ~35,000 threads @ 4B/thread u  ILP (increase fetches per thread) u  Larger fetches (64 or 128 bit/thread) u  Higher occupancy

Copy 1 float /thread, need 100% occupancy int indx = threadIdx.x + block * blockDim.x; float a0 = src[indx]; dest[indx] = a0;

Copy 2 floats /thread, need 50% occ float a0 = src[indx]; float a1 = src[indx+blockDim.x]; dest[indx] = a0; dst[index+blockDim.x] = a1; Copy 4 floats /thread, need 25% occ

int indx = threadIdx.x + 4 * block * blockDim.x; float a[4]; // in registers for(i=0;i<4;i++) a[i]=src[indx+i*blockDim.x]; for(i=0;i<4;i++) dst[indx+i*blockDim.x]=a[i];

Scott B. Baden /CSE 260/ Winter 2014 4

λ p

Page 5: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Today’s lecture

•  Thread divergence •  Scheduling •  Instruction Level Parallelism

Scott B. Baden /CSE 260/ Winter 2014 5

Page 6: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

6

Recapping: warp scheduling on Fermi •  Threads assigned to an SM in units of

a thread block, multiple blocks •  Each block divided into warps of 32

(SIMD) threads, a schedulable unit u  A warp becomes eligible for execution

when all its operands are available u  Dynamic instruction reordering: eligible

warps selected for execution using a prioritized scheduling policy

u  All threads in a warp execute the same instruction, branches serialize execution

•  Multiple warps simultaneously active, hiding data transfer delays

•  All registers in all the warps are available, 0 overhead scheduling

•  Hardware is free to assign blocks to any SM

t0 t1 t2 … tm

warp 8 instruction 11

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

. . .

time

warp 3 instruction 96

Shared /L1

MT IU

t0 t1 t2 … tm

t0 t1 t2 … tm

Scott B. Baden /CSE 260/ Winter 2014 6

Page 7: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Instruction Issue on Fermi •  Each vector unit

•  32 CUDA cores for integer and floating-point arithmetic •  4 special function units for Single Precision transcendentals

•  2 Warp schedulers: each scheduler issues: 1 (2) instructions for capability 2.0 (2.1)

•  One scheduler in charge of odd warp IDs, the other even warp IDs

•  Only 1 scheduler can issue a double-precision floating-point instruction at a time

•  Warp scheduler can issue an instruction to ½ the CUDA cores in an SM

•  Scheduler must issue the instruction over 2 clock cycles for an integer or floating-point arithmetic instructions

Scott B. Baden /CSE 260/ Winter 2014 7

Page 8: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Dynamic behavior – resource utilization •  Each vector core (SM):

1024 thread slots and 8 block slots •  Hardware partitions slots into blocks

at run time, accommodates different processing capacities

•  Registers are split dynamically across all blocks assigned to the vector core

•  A register is private to a single thread within a single block Courtesy DavidKirk/NVIDIA and Wen-mei Hwu/UIUC

Scott B. Baden /CSE 260/ Winter 2014 8

Page 9: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Thread Divergence •  All the threads in a warp execute the same

instruction • Different control paths are serialized • Divergence when a predicate

is a function of the thread Id if (threadId < 2) { }

• No divergence if all follow the same path

if (threadId / WARP_SIZE < 2) { }

• Consider reduction, e.g. summation ∑i xi

warp 8 instruction 11

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

. . .

warp 3 instruction 96

Scott B. Baden /CSE 260/ Winter 2014 9

Page 10: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Thread divergence •  All the threads in a warp execute the same

instruction • Different control paths are serialized

Branch

Path A

Path B

Branch

Path A

Path B

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt, UBC

Warp

Scalar Thread

Scalar Thread

Scalar Thread

Scalar Thread

Thread Warp 3 Thread Warp 8

Thread Warp 7

Scott B. Baden /CSE 260/ Winter 2014 10

Page 11: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Divergence example

if (threadIdx >= 2) a=100; else a=-100;

P0   P!  PM-­‐1  

Reg  

...  Reg   Reg   compare

threadIdx,2

Mary Hall

Scott B. Baden /CSE 260/ Winter 2014 11

Page 12: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Divergence example

if (threadIdx >= 2) a=100; else a=-100;

Mary Hall

P0   P!  PM-­‐1  

Reg  

...  Reg   Reg  

X X ✔ ✔

Scott B. Baden /CSE 260/ Winter 2014 12

Page 13: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Divergence example

if (threadIdx >= 2) a=100; else a=-100;

Mary Hall

P0   P!  PM-­‐1  

Reg  

...  Reg   Reg  

✔ ✔ X X

Scott B. Baden /CSE 260/ Winter 2014 13

Page 14: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Thread Divergence •  All the threads in a warp execute the same

instruction • Different control paths are serialized • Divergence when a predicate

is a function of the threadId if (threadId < 2) { }

• No divergence if all follow the same path within a warp

if (threadId / WARP_SIZE < 2) { }

• We can have different control paths within the thread block

warp 8 instruction 11

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

. . .

warp 3 instruction 96

Scott B. Baden /CSE 260/ Winter 2014 14

Page 15: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Example – reduction – thread divergence 4 7 5 9 11 14

25

3 1 7 0 4 1 6 3

0! 1! 2! 3! 4! 5! 7!6! 10!9!8! 11!

0+1! 2+3! 4+5! 6+7! 10+11!8+9!

0...3! 4..7! 8..11!

0..7! 8..15!

1!

2!

3!

Thread 0! Thread 8!Thread 2! Thread 4! Thread 6! Thread 10!

DavidKirk/NVIDIA & Wen-mei Hwu/UIUC Scott B. Baden /CSE 260/ Winter 2014 15

Page 16: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

The naïve code __global__ void reduce(int *input, unsigned int N, int *total){ unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

__shared__ int x[BSIZE]; x[tid] = (i<N) ? input[i] : 0; __syncthreads();

for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (tid % (2*stride) == 0) x[tid] += x[tid + stride]; } if (tid == 0) atomicAdd(total,x[tid]); }

Scott B. Baden /CSE 260/ Winter 2014 16

Page 17: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Reducing divergence and avoiding bank conflicts

DavidKirk/NVIDIA & Wen-mei Hwu/UIUC

Thread 0!

0! 1! 2! 3! …! 13! 15!14! 18!17!16! 19!

0+16! 15+31!

Scott B. Baden /CSE 260/ Winter 2014 18

Page 18: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

The improved code

for (stride = 1; stride < blockDim.x; stride *= 2) {

__syncthreads(); if (tid % (2*stride) == 0)

x[tid] += x[tid + stride]; }

__shared__ int x[ ]; unsigned int tid = threadIdx.x; unsigned int s; for (s = blockDim.x/2; s > 1; s /= 2) { __syncthreads(); if (tid < s ) x[tid] += x[tid + s ]; }

•  All threads in a warp execute the same instruction reduceSum <<<N/512,512>>> (x,N) •  No divergence until stride < 32 •  All warps active when stride ≥ 32

Scott B. Baden /CSE 260/ Winter 2014 19

s = 256: threads 0:255 s = 128: threads 0:127 s = 64: threads 0:63 s = 32: threads 0:31 s = 16: threads 0:15 …

Page 19: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Predication on Fermi •  All instructions support predication in 2.x •  Condition code or predicate per thread:

set to true or false •  Execute only if the predicate is true

if (x>1) y = 7;

•  Compiler replaces a branch instruction with predicated instructions only if the number of instructions controlled by branch condition is not too large

•  If the compiler predicts too many divergent warps…. threshold = 7, else 4

Scott B. Baden /CSE 260/ Winter 2014 20

test = (x>1) test: y=7

Page 20: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Concurrency – Host & Device •  Nonbocking operations

•  Asynchronous Device↔ {Host,Device} •  Kernel launch

•  Interleaved CPU and device calls: kernel calls run asynchronous with respect to host

•  But a kernel call sequence runs sequentially on device •  Multiple kernel invocations running in separate CUDA

streams: interleaved, independent computations cudaStream_t st1, st2, st3, st4; cudaStreamCreate(&st1)… cudaMemcpyAsync(d1, h1, N, H2D, st1); kernel2<<<Grid,Block, 0, st2 >>> ( …,d2,.. ); kernel3<<<Grid,Block, 0, st3 >>> ( …,d3,.. ); cudaMemcpyAsync(h4, d4, N, D2H, st4); CPU_function(…);

on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf

Scott B. Baden /CSE 260/ Winter 2014 21

Page 21: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Today’s lecture

•  Bank conflicts (previous lecture) •  Further improvements to Matrix

Multiplication •  Thread divergence •  Scheduling

Scott B. Baden /CSE 260/ Winter 2014 22

Page 22: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

23

Dual warp scheduling on Fermi •  2 schedulers/SM find an eligible warp

u  Each issues 1 instruction per cycle u  Dynamic instruction reordering: eligible

warps selected for execution using a prioritized scheduling policy

u  Issue selection: round-robin/age of warp u  Odd/even warps u  Warp scheduler can issue instruction to ½

the cores, each scheduler issues: 1 (2) instructions for capability 2.0 (2.1)

u  16 load/store units u  Scheduler must issue the instruction over

2 clock cycles for an integer or floating-point arithmetic instruction

u  Most instructions dual issued 2int/2 FP, int/FP/ ld/st

u  Only 1 double prec instruction at a time •  All registers in all the warps are

available, 0 overhead scheduling •  Overhead may be different when

switching blocks

warp 8 instruction 11

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

. . .

time

warp 3 instruction 96

Scott B. Baden /CSE 260/ Winter 2014 23

Page 23: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

What makes a processor run faster? •  Registers and cache •  Pipelining •  Instruction level parallelism •  Vectorization, SSE

Scott B. Baden /CSE 260/ Winter 2014 24

Page 24: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Pipelining •  Assembly line processing - an auto plant

Dave Patterson’s Laundry example: 4 people doing laundry

wash (30 min) + dry (40 min) + fold (20 min) = 90 min

A

B

C

D

6 PM 7 8 9 Time

30 40 40 40 40 20

•  Sequential execution takes 4 * 90min = 6 hours

•  Pipelined execution takes 30+4*40+20 = 3.5 hours

•  Bandwidth = loads/hour •  Pipelining helps bandwidth but

not latency (90 min) •  Bandwidth limited by slowest

pipeline stage •  Potential speedup = Number

pipe stages

Scott B. Baden /CSE 260/ Winter 2014 25

Page 25: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Instruction level parallelism •  Execute more than one instruction at a time

x= y / z; a = b + c;

•  Dynamic techniques u  Allow stalled instructions to proceed u  Reorder instructions

x = y / z;

a = x + c; t = c – q;

x = y / z; t = c – q; a = x + c;

Scott B. Baden /CSE 260/ Winter 2014 26

Page 26: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Multi-execution pipeline •  MIPS R4000

IF ID MEM WB

integer unit

FP/int Multiply

FP adder

FP/int divider

ex

m1 m2 m3 m4 m5 m6 m7

a1 a2 a3 a4

Div (latency = 25)

David Culler

x= y / z; a = b + c; s = q * r; i++;

x = y / z; t = c – q; a = x + c;

Scott B. Baden /CSE 260/ Winter 2014 27

Page 27: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Scoreboarding, Tomasulo’s algorithm •  Hardware data structures

keep track of u  When instructions complete u  Which instructions depend

on the results u  When it’s safe to write a reg.

•  Deals with data hazards u  WAR (Write after read) u  RAW (Read after write) u  WAW (Write after write)

Instruction Fetch

Issue & Resolve

ex ex

op fetch

Sco

rebo

ard

op fetch

FU

a = b*c x= y – b q = a / y y = x + b

Scott B. Baden /CSE 260/ Winter 2014 28

Page 28: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Scoreboarding •  Keep track of all register operands of all instructions in the

Instruction Buffer u  Instruction becomes ready after the needed values are written u  Eliminates hazards u  Ready instructions are eligible for issue

•  Decouples the Memory/Processor pipelines u  Threads can issue instructions until the scoreboard prevents issue u  Allows Memory/Processor ops to proceed in parallel with other

waiting Memory/Processor ops

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Scott B. Baden /CSE 260/ Winter 2014 29

Page 29: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Static scheduling limits performance •  The ADDD instruction is stalled on the DIVide .. •  stalling further instruction issue, e.g. the SUBD

DIV F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14

•  But SUBD doesn’t depend on ADDD or DIV •  If we have two adder/subtraction units, one will sit

idle uselessly until the DIV finishes

DIV

ADDD

SUBD

Scott B. Baden /CSE 260/ Winter 2014 30

Page 30: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Dynamic scheduling •  Idea: modify the pipeline to permit

instructions to execute as soon as their operands become available

•  This is known as out-of-order execution (classic dataflow)

•  The SUBD can now proceed normally •  Complications: dynamically scheduled

instructions also complete out of order

Scott B. Baden /CSE 260/ Winter 2014 31

Page 31: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Dynamic scheduling splits the ID stage •  Issue sub-stage

u  Decode the instructions u  Check for structural hazards

•  Read operands sub-stage u  Wait until there are no data hazards u  Read operands

•  We need additional registers to store pending instructions that aren’t ready to execute

I $

Multithreaded Instruction Buffer

R F C $

L 1 Shared Mem

Operand Select

MAD SFU Mary Hall

Scott B. Baden /CSE 260/ Winter 2014 32

Page 32: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Consequences of a split ID stage •  We distinguish between the time when an

instruction begins execution, and when it completes

•  Previously, an instruction stalled in the ID stage, and this held up the entire pipeline

•  Instructions can now be in a suspended state, neither stalling the pipeline, nor executing

•  They are waiting on operands

Scott B. Baden /CSE 260/ Winter 2014 33

Page 33: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Two schemes for dynamic scheduling •  Scoreboard

u  CDC 66000

•  Tomasulo’s algorithm u  IBM 360/91

•  We’ll vary the number of functional units, their latency, and functional unit pipelining

Scott B. Baden /CSE 260/ Winter 2014 34

Page 34: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

What is a scoreboarding? •  A technique that allows instructions to

execute out of order… u  So long as there are sufficient resources and u  No data dependencies

•  The goal of scoreboarding u  Maintain an execution rate of one instruction

per clock cycle

Scott B. Baden /CSE 260/ Winter 2014 35

Page 35: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Multiple execution pipelines in DLX with scoreboarding

Unit Latency

Integer 0

Memory 1

FP Add 2

FP Multiply 10

FP Div 40

Scott B. Baden /CSE 260/ Winter 2014 36

• • • • •

Scoreboard

Integer unit

FP add

FP divide

FP mult

FP mult

Control/status Control/status

Registers Data buses

Page 36: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

Scoreboard controls the pipeline

Scott B. Baden /CSE 260/ Winter 2014 37

Instruction Fetch

Instruction Issue

Pre-issue buffer

Execution unit 1

Execution unit 2 …

Write results

Pre-execution buffers

Post-execution buffers

Read operands

Read operands

Scoreboard / Control Unit

Instruction Decode

WAW

RAW

WAR

Mike Frank

Page 37: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

What are the requirements? •  Responsibility for instruction issue and

execution, including hazard detection •  Multiple instructions must be in the EX

stage simultaneously •  Either through pipelining or multiple

functional units •  DLX has: 2 multipliers, 1 divider, 1 integer

unit (memory, branch, integer arithmetic)

Scott B. Baden /CSE 260/ Winter 2014 38

Page 38: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

How does it work? •  As each instruction passes through the scoreboard,

construct a description of the data dependencies (Issue)

•  Scoreboard determines when the instruction can read operands and begin execution

•  If the instruction can’t begin execution, the scoreboard keeps a record, and it listens for one the instruction can execute

•  Also controls when an instruction may write its result

•  All hazard detection is centralized

Scott B. Baden /CSE 260/ Winter 2014 39

Page 39: Lecture 9 Thread Divergence Scheduling Instruction Level ... · 6 PM 7 8 9 Time 30 40 40 40 40 20 • Sequential execution takes 4 * 90min = 6 hours • Pipelined execution takes

How is a scoreboard implemented? •  A centralized bookkeeping table •  Tracks instructions, along with register

operand(s) they depend on and which register they modify

•  Status of result registers (who is going to write to a given register)

•  Status of the functional units

Scott B. Baden /CSE 260/ Winter 2014 40