caltech cs137 winter2006 -- dehon 1 cs137: electronic design automation day 9: january 30, 2006...

61
CALTECH CS137 Winter2006 -- DeH on 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

Upload: daniela-snow

Post on 30-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 1

CS137:Electronic Design Automation

Day 9: January 30, 2006

Parallel Prefix

Page 2: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 2

Today

• Bit-Level– Addition– LUT Cascades

• For Sums– Applications

• FSMs• SATADD• Data Forwarding• Pointer Jumping

– Applications

Page 3: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 3

Introduction / Reminder

Addition in Log Time

Page 4: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 4

Ripple Carry Addition• Simple “definition” of addition

• Serially resolve carry at each bit

Page 5: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 5

CLA

• Think about each adder bit as a computing a function on the carry in– C[i]=g(c[i-1])– Particular function f will

depend on a[i], b[i]– G=f(a,b)

Page 6: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 6

Functions

• What functions can g(c[i-1]) be?– g(x)=1

• a[i]=b[i]=1

– g(x)=x• a[i] xor b[i]=1

– g(x)=0• A[i]=b[i]=0

Page 7: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 7

Functions

• What functions can g(c[i-1]) be?– g(x)=1 Generate

• a[i]=b[i]=1

– g(x)=x Propagate• a[i] xor b[i]=1

– g(x)=0 Squash• A[i]=b[i]=0

Page 8: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 8

Combining

• Want to combine functions– Compute c[i]=gi(gi-1(c[i-2]))

– Compute compose of two functions

• What functions will the compose of two of these functions be?– Same as before

• Propagate, generate, squash

Page 9: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 9

Compose Rules (LSB MSB) Compose Result

GG

GP

GS

PG

PP

PS

SG

SP

SS

Page 10: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 10

Compose Rules (LSB MSB) Compose Result

GG S

GP G

GS S

PG G

PP P

PS S

SG G

SP S

SS S

Page 11: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 11

Combining

• Do it again…

• Combine g[i-3,i-2] and g[i-1,i]

• What do we get?

Page 12: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 12

Reduce Tree

Page 13: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 13

Associative Reduce Prefix

• Shows us how to compute the Nth value in O(log(N)) time

• Can actually produce all intermediate values in this time– w/ only a constant factor more hardware

Page 14: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 14

Prefix TreeP

refix

T

ree

Page 15: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 15

Parallel Prefix

• Important Pattern

• Applicable any time operation is associative

• Function Composition is always associative

Page 16: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 16

Generalizing

LUT Cascade

Page 17: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 17

Cascaded LUT Delay Model

• Tcascade =T(3LUT) + T(mux)• Don’t pay

– General interconnect– Full 4-LUT delay

Page 18: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 18

Parallel Prefix LUT Cascade?

• Can we do better than N×Tmux?• Can we compute LUT cascade in O(log(N))

time?• Can we compute mux cascade using parallel

prefix?

• Can we make mux cascade associative?

Page 19: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 19

Parallel Prefix Mux cascade

• How can mux transform Smux-out?– A=0, B=0 mux-out=0– A=1, B=1 mux-out=1– A=0, B=1 mux-out=S– A=1, B=0 mux-out=/S

Page 20: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 20

Parallel Prefix Mux cascade

• How can mux transform Smux-out?– A=0, B=0 mux-out=0 Stop= S– A=1, B=1 mux-out=1 Generate= G– A=0, B=1 mux-out=S Buffer = B– A=1, B=0 mux-out=/S Invert = I

Page 21: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 21

Parallel Prefix Mux cascade

• How can 2 muxes transform input?

• Can I compute 2-mux transforms from 1 mux transforms?

Page 22: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 22

Two-mux transforms

• SSS• SGG• SBS• SIG

• GSS• GGG• GBG• GIS

• BSS• BGG• BBB• BII

• ISS• IGG• IBI• IIB

Page 23: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 23

Generalizing mux-cascade

• How can N muxes transform the input?

• Is mux transform composition associative?

Page 24: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 24

Associative Reduce Mux-Cascade

Can be hardwired, no general interconnect

Page 25: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 25

For Sums

Page 26: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 26

Prefix Sum

• Common Operation:– Want B[x] such that B[x]=A[0]+A[1]+…A[x]– For I=0 to x

• B[x]=B[x-1]+A[x]

Page 27: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 27

Prefix Sum

• Compute in tree fashion– A[I]+A[I+1]– A[I]+A[I+1]+A[I+2]+A[I+3]– …

• Combine partial sums back down tree– S(0:7)+S(8:9)+S(10)=S(0:10)

Page 28: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 28

Other simple operators

• Prefix-OR

• Prefix-AND

• Prefix-MAX

• Prefix-MIN

Page 29: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 29

Find-First One

• Useful for arbitration– Finds first (highest-priority) requestor– Also magnitude finding in numbers

• How:– Prefix-OR– Locally compute X[I-1]^X[I]– Flags the first one

Page 30: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 30

Arbitration

• Often want to find first M requestors– E.g. Assign unique memory ports to first M

processors requesting

• Prefix-sum across all potential requesters

• Counts requesters, giving unique number to each

• Know if one of first M– Perhaps which resource assigned

Page 31: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 31

Partitioning

• Use something to order – E.g. spectral linear ordering– …or 1D cellular swap to produce linear

order

• Parallel prefix on area of units – If not all same area

• Know where the midpoint is

Page 32: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 32

Channel Width

• Prefix sum on delta wires at each node – To compute net channel widths at all points

along channel– E.g. 1D ordered

• Maybe use with cellular placement scheme

Page 33: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 33

Rank Finding

• Looking for I’th ordered element• Do a prefix-sum on high-bit only

– Know m=number of things > 01111111…

• High-low search on result– I.e. if number > I, recurse on half with

leading zero– If number < I, search for (I-m)’th element in

half with high-bit true

• Find median in log2(N) time

Page 34: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 34

FA/FSM Evaluation

(regular expression recognition)

Page 35: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 35

Finite Automata

• Machine has finite state: S• On each cycle

– Input I– Compute output and new state

• Based on inputs and current state

• Oi,S(i+1)=f(Si,Ii)• Intuitively, a sequential process

– Must know previous state to compute next– Must know state to compute output

Page 36: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 36

Function Specialization

• But, this is just functions– …and function composition is associative

• Given that we know input sequence:– I0,I1,I2…

• Can compute specialized functions:– fi(s)=f(s,Ii)

• What is fi(s)?– Worst-case, a translation table:

• S=0 NS0, S=1 NS1 ….

Page 37: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 37

Function Composition

• Now: O(i+m),S(i+m+1)=

f(i+m)(f(i+m-1)(f(i+m-2)(…fi(Si))))

• Can we compute the function composition?– f(i+1,i)(s)=f(i+1)(fi(s))

– What is f(i+1,i)(s)?

• A translation table just like fi(s) and f(i+1)(s)

• Table of size |S|, can fillin in O(|S|) time

Page 38: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 38

Recursive Function Composition

• Now: O(i+m),S(i+m+1)=

f(i+m)(f(i+m-1)(f(i+m-2)(…fi(Si))))

• We can compute the composition– f(i+1,i)(s)=f(i+1)(fi(s))

• Repeat to compute – f(i+3,i)(s)=f(i+3,i+2)(f(i+1,i)(s))

– Etc. until have computed: f(i+m,i)(s) in O(log(m)) steps

Page 39: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 39

Implications

• If can get input stream,– Any FA can be evaluated in O(log(N)) time– Regular Expression recognition in

O(log(N))

• Any streaming operator with finite state– Where the input stream is independent of

the output stream– Can be run arbitrarily fast by using parallel-

prefix on FSM evaluation

Page 40: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 40

Saturated Addition

• S(i+1)=max(min(Ii+Si,maxval),minval)

• Could model as FSM with:– |S|=maxval-minval

• So, in theory, FSM result applies

• …but |S| might be 216, 224

Page 41: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 41

SATADD Composition

• Can compute composition efficiently

[Papadantonakis et al. FPT2005]

Page 42: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 42

SATADD Composition

Page 43: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 43

SATADD Reduce Tree

Page 44: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 44

Data Forwarding

UltraScalar From Henry, Kuszmaul, et al.ARVLSI’99, SPAA’99, ISCA’00

Page 45: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 45

Consider Machine

• Each FU has a full RF– FU=Functional Unit– RF=Register File

• Build network between FUs– use network to connect produce/consume – user register names to configure

interconnect

• Signal data ready along network

Page 46: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 46

Ultrascalar: concept model

Page 47: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 47

Ultrascalar Concept

• Linear delay

• O(1) register cost / FU

• Complete renaming at each FU– different set of registers– so when say complete RF at each FU,

that’s only the logical registers

Page 48: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 48

Ultrascalar: cyclic prefix

Page 49: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 49

Parallel Prefix• Basic idea is one we saw with adders• An FU will either

– produce a register (generate)– or transmit a register (propagate)– can do tree combining

• pair of FUs will either both propagate or will generate• compute function by pair in one stage• recurse to next stage• get log-depth tree network connecting producer and

consumer

Page 50: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 50

Ultrascalar: cyclic prefix

Page 51: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 51

Pointer Jumping

Page 52: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 52

Pointer Jumping Motivation

• Have a tree– E.g. is-a relationship tree in NETL

• Want to know if a node is of a particular type (is-a mammal)

• How long to find out?– Naïve: O(distance)

• Spread one level per timestep

Page 53: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 53

Following Pointer Chain

• Naïve: spread/color from target node– On each step push down to children

• Most nodes idle– Only active on the step something arrives

• Can the idle nodes do something to accelerate?

Page 54: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 54

Jumping Intermediates

• Add notion of transitive parent

• Initially: transitive-parent=parent

• On each step:– If my transitive-parent marked

• Mark self

– else• Transitive-parent =

transitive-parent(transitive-parent)

Page 55: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 55

How Much Jumping?

• On each step:– If my transitive-parent marked

• Mark self

– else• Transitive-parent =

transitive-parent(transitive-parent)

• How many such steps?– O(log(distance))

Page 56: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 56

Pointer Jumping

• Same basic idea as data forwarding

• Can find length of a list in O(log(length)) time

Page 57: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 57

Variations

Page 58: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 58

Segmented Parallel Prefix

• fi() can ignore its input

– …or the function can let special I’s tell it to reset the state

• E.g. build huge/hardwired carry chain hardware and configurably break into separate adders (LUT cascades)

Page 59: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 59

Cyclic Segmented Parallel Prefix

• Wrap output back to input• Configurable segmentation defines the

starting/stopping point• E.g.

– In Ultrascalar dataforwarding• Leave data in place and use FUs in FIFO fashion,

redefining the “head” at each cycle

– Priority allocation scheme• Mark priority item as start of segment

– Perhaps chose randomly (e.g. hardware router)

Page 60: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 60

Admin

• Class Wed.

• Baseline due Friday

Page 61: CALTECH CS137 Winter2006 -- DeHon 1 CS137: Electronic Design Automation Day 9: January 30, 2006 Parallel Prefix

CALTECH CS137 Winter2006 -- DeHon 61

Big Ideas

• Any associative operation can be made parallel– Performed in log(N) time with O(N) hardware

• Any Finite Automata computation can be accelerated with parallelism– (FA evaluation NC)

• Function composition is associated all functional operations can be associative