modeling and solving code generation for real 2015.pdf · •model and solve combinatorial...

Post on 05-Jul-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Modeling and SolvingCode Generation for Real

Christian SchulteKTH Royal Institute of Technology & SICS (Swedish Institute of Computer Science)

joint work with: Mats Carlsson SICS

Roberto Castañeda Lozano SICS + KTH

Frej Drejhammar SICS

Gabriel Hjort Blindell KTH + SICS

funded by: Ericsson AB

Swedish Research Council (VR 621-2011-6229)

Who Am I?

• Professor of Computer Science at KTH Royal Institute of Technology, Stockholm, Sweden

• Expert Researcher at SICS (Swedish Institute of Computer Science), Stockholm, Sweden

• Education• diploma in computer science, Karlsruhe, Germany, 1992

• doctoral degree in engineering, Saarbrücken, Germany, 2001

• docent in computer systems, KTH, Sweden, 2009

• Research interests• constraint programming

• programming languages

• systems-based research (Gecode, for example)

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

2

Compilation

• Front-end: depends on source programming language• changes infrequently (well…)

• Optimizer: independent optimizations• changes infrequently (well…)

• Back-end: depends on processor architecture• changes often: new process, new architectures, new features, …

3

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

front-end optimizer back-end(code generator)

sourceprogram

assemblyprogram

Generating Code: Unison

• Infrequent changes: front-end & optimizer• reuse state-of-the-art: LLVM, for example

• Frequent changes: back-end• use flexible approach: Unison

4

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

front-end optimizer back-end(code generator)

sourceprogram

assemblyprogram

Unison

State of the Art

• Code generation organized into stages• instruction selection,

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

5

instructionselection

x = y + z;add r0 r1 r2mv $a6f0 r0

State of the Art

• Code generation organized into stages• instruction selection, register allocation,

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

6

registerallocation

x = y + z;x register r0y memory (spill to stack)…

State of the Art

• Code generation organized into stages• instruction selection, register allocation, instruction scheduling

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

7

instructionscheduling

x = y + z;…u = v – w;

u = v – w;…x = y + z;

State of the Art

• Code generation organized into stages• stages are interdependent: no optimal order possible

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

8

instructionselection

registerallocation

instructionscheduling

State of the Art

• Code generation organized into stages• stages are interdependent: no optimal order possible

• Example: instruction scheduling register allocation• increased delay between instructions can increase throughput

registers used over longer time-spans

more registers needed

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

9

instructionselection

instructionscheduling

register allocation

State of the Art

• Code generation organized into stages• stages are interdependent: no optimal order possible

• Example: instruction scheduling register allocation• put variables into fewer registers

more dependencies among instructions

less opportunity for reordering instructions

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

10

instructionselection

registerallocation

instructionscheduling

State of the Art

• Code generation organized into stages• stages are interdependent: no optimal order possible

• Stages use heuristic algorithms• for hard combinatorial problems (NP hard)

• assumption: optimal solutions not possible anyway

• difficult to take advantage of processor features

• error-prone when adapting to change

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

11

instructionselection

instructionscheduling

register allocation

State of the Art

• Code generation organized into stages• stages are interdependent: no optimal order possible

• Stages use heuristic algorithms• for hard combinatorial problems (NP hard)

• assumption: optimal solutions not possible anyway

• difficult to take advantage of processor features

• error-prone when adapting to change

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

12

instructionselection

instructionscheduling

register allocation

preclude optimal code,make development

complex

Rethinking: Unison Idea

• No more staging and complex heuristic algorithms!• many assumptions are decades old...

• Use state-of-the-art technology for solving combinatorial optimization problems: constraint programming

• tremendous progress in last two decades...

• Generate and solve single model• captures all code generation tasks in unison

• high-level of abstraction: based on processor description

• flexible: ideally, just change processor description

• potentially optimal: tradeoff between decisions accurately reflected

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

13

Unison Approach

• Generate constraint model• based on input program and processor description

• constraints for all code generation tasks

• generate but not solve: simpler and more expressive

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

14

instructionselection

instructionscheduling

registerallocation

input program

processor description

constraint model

constraints

constraints

constraints

Unison Approach

• Off-the-shelf constraint solver solves constraint model• solution is assembly program

• optimization takes inter-dependencies into account

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

15

instructionselection

instructionscheduling

registerallocation

input program

processor description

constraint model

constraints

constraints

constraints

off-the-shelfconstraint

solver

assemblyprogram

Talk Overview

• Constraint programming in a nutshell

• Register Allocation & Instruction Scheduling• Basic Register Allocation

• Instruction Scheduling

• Advanced Register Allocation

• Global Register Allocation

• Discussion

• Instruction Selection [if time allows]• Graph-based Instruction Selection

• Universal Instruction Selection

• Discussion

• Summary

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

16

Source Material

• Register Allocation & Instruction Scheduling• Constraint-based Register Allocation and Instruction Scheduling,

Roberto Castañeda Lozano, Mats Carlsson, Frej Drejhammar, Christian Schulte. CP 2012.

• Combinatorial Spill Code Optimization and Ultimate Coalescing, Roberto Castañeda Lozano, Mats Carlsson, Gabriel Hjort Blindell, Christian Schulte. LCTES 2014.

• Instruction Selection• Modeling Universal Instruction Selection, Gabriel Hjort Blindell,

Roberto Castañeda Lozano, Mats Carlsson, Christian Schulte. CP 2015.

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

17

Source Material

• Surveys• Survey on Combinatorial Register Allocation and Instruction

Scheduling, Roberto Castañeda Lozano, Christian Schulte. CoRR entry, 2014. Revised version to be submitted.

• Instruction Selection: Principles, Techniques and Applications. Gabriel Hjort Blindell, Springer, 2015. To appear.

• Additional Material• Integrated Register Allocation and Instruction Scheduling with

Constraint Programming, Roberto Castañeda Lozano. KTH Royal Institute of Technology, Sweden, Licentiate thesis, 2014.

• Constraint-based Code Generation. Roberto Castañeda Lozano, Gabriel Hjort Blindell, Mats Carlsson, Frej Drejhammar, Christian Schulte. M-SCOPES 2013.

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

18

CONSTRAINT PROGRAMMINGIN A NUTSHELL

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

19

Constraint Programming

• Model and solve combinatorial (optimization) problems

• Modeling• variables

• constraints

• branching heuristics

• (cost function)

• Solving• constraint propagation

• heuristic search

• Of course simplified…

…array of modeling and solving techniques

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

20

Problem: Send More Money

• Find distinct digits for letters such that

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

21

SEND

+ MORE

= MONEY

Constraint Model

• Variables:

S,E,N,D,M,O,R,Y {0,…,9}

• Constraints:distinct(S,E,N,D,M,O,R,Y)

1000×S+100×E+10×N+D

+ 1000×M+100×O+10×R+E

= 10000×M+1000×O+100×N+10×E+Y

S0 M0

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

22

Constraints

• State relations between variables• legal combinations of values for variables

• Examples• all variables pair wise distinct: distinct(x1, …, xn)

• arithmetic constraints: x + 2×y = z

• domain-specific: cumulative(t1, …, tn)

nooverlap(r1, …, rn)

• Success story: global constraints• modeling: capture recurring problem structures

• solving: enable strong reasoning

constraint-specific methods

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

23

Solving: Variables and Values

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

24

• Record possible values for variables• solution: single value left

• failure: no values left

x{1,2,3,4} y{1,2,3,4} z{1,2,3,4}

Constraint Propagation

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

25

• Prune values that are in conflict with constraint

x{1,2,3,4} y{1,2,3,4} z{1,2,3,4}

distinct(x, y, z) x + y = 3

Constraint Propagation

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

26

• Prune values that are in conflict with constraint

x{1,2} y{1,2} z{1,2,3,4}

distinct(x, y, z) x + y = 3

Constraint Propagation

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

27

• Prune values that are in conflict with constraint• propagation is often smart if not perfect!

x{1,2} y{1,2} z{3,4}

distinct(x, y, z) x + y = 3

Heuristic Search

• Propagation alone not sufficient• decompose into simpler sub-problems

• search needed

• Create subproblems with additional constraints• enables further propagation

• defines search tree

• uses problem specific heuristic

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

28

x{1,2} y{1,2} z{3,4}

distinct(x, y, z) x + y = 3

x{1} y{2} z{3,4}

distinct(x, y, z) x + y = 3

x{2} y{1} z{3,4}

distinct(x, y, z) x + y = 3

x=1 x≠1

What Makes It Work?

• Essential: avoid search...

...as it always suffers from combinatorial explosion

• Constraint propagation drastically reduces search space

• Efficient and powerful methods for propagation available

• When using search, use a clever heuristic

• Array of modeling techniques available that reduce search

• Hybrid methods (together with LP, SAT, stochastic, ...)

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

29

Register Allocation & Instruction Scheduling

Unit and Scope

• Function is unit of compilation• generate code for one function at a time

• Scope• local generate code for each basic block in isolation

• global generate code for whole function

• Basic block: instructions that are always executed together• execute at start

• execute all instructions

• leave execution at end

• that is: no control flow within basic block (in or out)

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

31

BASIC REGISTER ALLOCATION

Local (and slightly naïve) register allocation

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

32

Local Register Allocation

• Instruction selection has already been performed

• Temporaries• defined or def-occurence (lhs) t3 in t3 sub t1, 2

• used or use-occurence (rhs) t1 in t3 sub t1, 2

• Basic blocks are in SSA (single static assignment) form• each temporary is defined once

• standard state-of-the-art approach

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

33

t2 mul t1, 2t3 sub t1, 2t4 add t2, t3

...t5 mul t1, t4 jr t5

Liveness & Interference

• Temporary is live from def to last use, defining its live range• live ranges are linear (basic block + SSA)

• Temporaries interfere if their live ranges overlap

• Non-interfering temporaries can be assigned to same register

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

34

t2 mul t1, 2t3 sub t1, 2t4 add t2, t3

...t5 mul t1, t4 jr t5

t1 t2 t3 t4 t5

live ranges

Spilling

• If not enough registers available: spill

• Spilling moves temporary to memory (stack)• store in memory after defined• load from memory before used• memory access typically considerably more expensive• decision on spilling crucial for performance

• Architectures might have more than one register bank• some instructions only capable of addressing a particular bank• “spilling” from one register bank to another

• Unified register array• limited number of registers for each register file• memory is just another “register” file• unlimited number of memory “registers”

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

35

Coalescing

• Temporaries d (“destination”) and s (“source”) are move-related if

d s• d and s should be coalesced (assigned to same register)

• coalescing saves move instructions and registers

• Coalescing is important due to• how registers are managed (calling convention)

• how our model deals with global register allocation (more later)

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

36

Copy Operations

• Copy operations replicate a temporary t to a temporary t’

t’ {i1, i2, …, in} t• copy is implemented by one of the alternative instructions i1, i2, …, in• instruction depends on where t and t’ are stored

similar to [Appel & George, 2001]

• Example MIPS32

t’ {move, sw, nop} t• t’ memory and t register: sw spill

• t’ register and t register: move move-related

• t’ and t same register: nop coalescing

• MIPS32: instructions can only be performed on registers

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

37

Model Variables

• Decision variables• reg(t) N register to which temporary t is assigned

• instr(o) N instruction that implements operation o

• cycle(o) N issue cycle for operation o

• active(o) {0,1} whether operation o is active

• Derived variables• start(t) start of live range of temporary t

= cycle(o) where o defines t

• end(t) end of live range of temporary t

= max { cycle(o) | o uses t }

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

38

Sanity Constraints

• Copy operation o is active no coalescing

active(o) = 1 reg(s) ≠ reg(d)• s is source of move, d is destination of move operation o

• Operations implemented by suitable instructions• single possible instruction for non-copy operations

• Miscellaneous• some registers are pre-assigned

• some instructions can only address certain registers (or memory)

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

39

Geometrical Interpretation

• Temporary t is rectangle• width is 1 (occupies one register)

• top = start(t) issue cycle of def

• bottom = end(t) last issue cycle of any use

• Consequence of linear live range (basic block + SSA)

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

40

unified register array

r0 r1 rn-1

registers memory registers

m0 m1clock cycle

temporary t

Register Assignment

• Register assignment = geometric packing problem• find horizontal coordinates for all temporaries

• such that no two rectangles for temporaries overlap

• For block B

nooverlap({reg(t),reg(t)+1,start(t),end(t) | tB})

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

41

unified register array

r0 r1 rn-1

registers memory registers

m0 m1clock cycle

temporary t

Register Packing

• Temporaries might have different width width(t)• many processors support access to register parts

• still modeled as geometrical packing problem [Pereira & Palsberg, 2008]

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

42

Register Packing

• Temporaries might have different width width(t)• many processors support access to register parts

• still modeled as geometrical packing problem [Pereira & Palsberg, 2008]

• Example: Intel x86• assign two 8 bit temporaries (width = 1) to 16 bit register (width = 2)

• register parts: AH, AL, BH, BL, CH, CL

• possible for 8 bit: AH, AL, BH, BL, CH, CL

• possible for 16 bit: AH, BH, CH

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

43

AH AL BH BL CH CLAX BX CX

clock cycle

width(t1)=1

width(t3)=2

width(t3)=1

width(t4)=2

Register Packing

• Temporaries might have different width width(t)• many processors support access to register parts

• still modeled as geometrical packing problem [Pereira & Palsberg, 2008]

• Example: Intel x86• assign two 8 bit temporaries (width = 1) to 16 bit register (width = 2)

• register parts: AH, AL, BH, BL, CH, CL

• possible for 8 bit: AH, AL, BH, BL, CH, CL

• possible for 16 bit: AH, BH, CH

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

44

AH AL BH BL CH CLAX BX CX

clock cycle

t3

t1

t4

t2

start(t1)=0

start(t2)=0

start(t3)=0

start(t4)=1

end(t1)=1

end(t2)=2

end(t3)=1

end(t4)=2

width(t1)=1

width(t3)=2

width(t3)=1

width(t4)=2

Register Packing

• Temporaries might have different width width(t)• many processors support access to register parts

• still modeled as geometrical packing problem [Pereira & Palsberg, 2008]

• Example: Intel x86• assign two 8 bit temporaries (width = 1) to 16 bit register (width = 2)

• register parts: AH, AL, BH, BL, CH, CL

• possible for 8 bit: AH, AL, BH, BL, CH, CL

• possible for 16 bit: AH, BH, CH

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

45

AH AL BH BL CH CLAX BX CX

clock cycle

t3 t1

t4t2

start(t1)=0

start(t2)=0

start(t3)=0

start(t4)=1

end(t1)=1

end(t2)=2

end(t3)=1

end(t4)=2

width(t1)=1

width(t3)=2

width(t3)=1

width(t4)=2

Modeling Register Packing

• Take width of temporaries into account (for block B)

nooverlap({reg(t),reg(t)+width(t),start(t),end(t) | tB})

• Exclude sub-registers depending on width(t)• simple domain constraint on reg(t)

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

46

INSTRUCTION SCHEDULING

Local instruction scheduling (standard)

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

47

Dependencies

• Data and control dependencies• data, control, artificial (for making in and out first/last)

• If operation o2 depends on o1:

active(o1) active(o2)

cycle(o2) cycle(o1) + latency(instr(o1))

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

48

t3lit4slti t2

bne t4

in

li

out

bne

slti

1 (t2)

1 (t4)

1 (t3)

1 (t2)

1 (t1)

1 (t2) 1

1

2

Processor Resources

• Processor resources: functional units, data buses, ...• also: instruction bundle width for VLIW processors (how many

instructions can be issued simultaneously)

• Classical cumulative scheduling problem functional

units• processor resource has capacity #units

• instructions occupy parts of resource 1 unit

• resource consumption can never exceed capacity

• Modeling for block B

cumulative({cycle(o),dur(o,r),active(o)use(o,r) | oB})

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

49

ADVANCED REGISTER ALLOCATION

Ultimate Coalescing & Spill Code Optimization

using alternative temporaries

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

50

Interference Too Naïve!

• Move-related temporaries might interfere…

…but contain the same value!

• Ultimate notion of interference =

temporaries interfere their live ranges overlap and

they have different values[Chaitin ea, 1981]

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

51

t1 ......

t2 mv t1...

... ... t1 ......

... ... t2 ...

t1 and t2 interfere

t1 t2

Spilling Too Naïve!

• Known as spill-everywhere model• reload from memory before every use of original temporary

• Example: t3 should be used rather than reloading t2• t2 allocated in memory!

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

52

t1 ......

... ... t1 ......

... ... t1 ...

t1 ...t2 st t1

...t3 ld t2... ... t3 ...

...t4 ld t2... ... t4 ...

spill t1

Alternative Temporaries

• Used to track which temporaries are equal

• Representation is augmented by operands• act as def and use ports in operations

• temporaries hold values transferred among operations by connecting to operands

• Example• operation t2 abs t1• transformed to p2:t2 abs p1:t1 (p1, p2 operands)

• if t1 and t3 hold same value then transformed to

p2:t2 abs p1:{t1,t3}

where either t1 or t3 can be connected to p1

• Model: whether a temporary is live (it is being used)

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

53

GLOBAL REGISTER ALLOCATION

Register allocation for entire functions

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

54

Entire Functions

• Use control flow graph (CFG) and turn it into LSSA form• edges = control flow

• nodes = basic blocks (no control flow)

• LSSA = linear SSA = SSA for basic blocks plus… to be explained

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

55

int fac(int n) {int f = 1;while (n > 0) {f = f * n; n--;

}return f;

}

t3lit4slti t2

bne t3

jr t10

t8mul t7,t6t9subiu t6

bgtz t9

Linear SSA (LSSA)

• Linear live range of a temporary cannot span block boundaries

• Liveness across blocks defined by temporary congruence

t t’ represent same original temporary

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

56

t3lit4slti t2

bne t4

jr t10

t8mul t7,t6t9subiu t6

bgtz t9

t1t5t2t6

t3t7

t1t10t3t11

t5t10t8t11

t6t9

t7t8

Linear SSA (LSSA)

• Linear live range of a temporary cannot span block boundaries

• Liveness across blocks defined by temporary congruence

t t’ represent same original temporary

• Example: t3, t7, t8, t11 are congruent• correspond to the program variable f (factorial result)

• not discussed: t1 return address, t2 first argument, t11 return value

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

57

t3lit4slti t2

bne t4

jr t10

t8mul t7,t6t9subiu t6

bgtz t9

t1t5t2t6

t3t7

t1t10t3t11

t5t10t8t11

t6t9

t7t8

Linear SSA (LSSA)

• Linear live range of a temporary cannot span block boundaries

• Liveness across blocks defined by temporary congruence

t t’ represent same original temporary

• Advantage• simple modeling for linear live ranges (geometrical interpretation)

• enables problem decomposition for solving

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

58

t3lit4slti t2

bne t4

jr t10

t8mul t7,t6t9subiu t6

bgtz t9

t1t5t2t6

t3t7

t1t10t3t11

t5t10t8t11

t6t9

t7t8

Global Register Allocation

• Try to coalesce congruent temporaries• this is why coalescing is (even more) crucial in this model

• Introduces natural problem decomposition• master problem (function) coalesce congruent temporaries

• slave problems (basic blocks) register allocation & instruction scheduling

• What is happening• if register pressure is low…

no copy instruction needed (nop)

= coalescing

• if register pressure is high…

copy operation might be implemented by a move

= no coalescing

copy operation might be implemented by a load/store

= spill

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

59

DISCUSSION

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

60

Solving

• Approach• use master-slave decomposition

• use naïve (very) portfolio of heuristics for basic blocks

• use some pre-solving (symmetry, no-goods, dominance)

• not very advanced (future work)

• Benchmark setup• selection of medium-sized functions (25 to 1000 instructions)

• comparison to LLVM 3.3 for Qualcomm’s Hexagon V4 using –O3

• run for ten iterations where each iteration is given more time

• using Gecode 4.2.1

• full details in [Castañeda ea, LCTES 2014]

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

61

Experiments Summary

• Code quality (estimated)• 7% mean improvement over LLVM

• provably optimal for 29% of functions

• Quadratic average (roughly) complexity up to 1000 instructions

• Can be easily changed to optimize for code size• 1% mean improvement over LLVM

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

62

Related Approaches

• Idea and motivation in Unison for combinatorial optimization is absolutely not new!

• starting in the early 1990s

[Castañeda & Schulte, CoRR 2014]

• Approaches differ• which code generation tasks covered

• which technology used (ILP, CP, SAT, Genetic Algorithms, ...)

• Common to most approaches• compilation unit is basic block, or

• just a single task covered, or

• very poor scalability

• Challenge: integration, robustness, and scalability

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

67

Unique to Unison Approach

• First global approach for register allocation (function as compilation unit)

• Constraint programming using global constraints• sweet spot: cumulative and nooverlap

• Full register allocation with ultimate coalescing, packing, spilling, and spill code optimization

• key property of model: spilling is internalized

• Robust at the expense of optimality• problem decomposition

• But: instruction selection not yet there!

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

68

Instruction Selection[Based on slides from Gabriel Hjort Blindell]

Graph-basedInstruction Selection

• Represent program as graph program graph

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

70

int f(int a) {int b = a * 2;int c = a * 4;return b + c;

}

a 2

*

a 2

*

+

ret

Graph-basedInstruction Selection

• Represent program as graph program graph

• Represent instructions as graph instruction graph

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

71

int f(int a) {int b = a * 2;int c = a * 4;return b + c;

}

a 2

*

a 2

*

+

ret

*

+

mac

Graph-basedInstruction Selection

• Represent program as graph program graph

• Represent instructions as graph instruction graph

• Select matches such that program graph is covered

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

72

int f(int a) {int b = a * 2;int c = a * 4;return b + c;

}

a 2

*

a 2

*

+

ret

*

+

mac

Graph-basedInstruction Selection

• Represent program as graph program graph

• Represent instructions as graph instruction graph

• Select matches such that program graph is covered

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

73

int f(int a) {int b = a * 2;int c = a * 4;return b + c;

}

a 2

*

a 2

*

+

ret

*

+

mac

State of the Art

• Local instruction selection

• Program graphs per block

• Graphs restricted to data flow• cannot handle control flow such as branching instructions

• Greedy heuristics• For example, maximal munch

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

74

Instruction Examples

• satadd

• Exists in many DSPs

• Incorporates control flow

• Extends across basic blocks

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

75

i=0

if i < N

t1 = i * 4t2 = A + t1; t3 = B + t1a = load t2; b = load t3c = a + bif MAX < c

t4 = C + t1store t4, ci = i + 1

c = MAX

FT

Instruction Examples

• satadd

• repeat

• Exists in many processors• for example Intel’s x86

• Incorporates control flow

• Extends across basic blocks

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

76

i=0

if i < N

t1 = i * 4t2 = A + t1; t3 = B + t1a = load t2; b = load t3c = a + bif MAX < c

t4 = C + t1store t4, ci = i + 1

c = MAX

FT

Instruction Examples

• satadd

• repeat

• add4

• SIMD-style instruction• very common

• Requires global code motion• move computations across

blocks

• Depending on hardware may require copying

• different register file

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

77

i=0

if i < N

t1 = i * 4t2 = A + t1; t3 = B + t1a = load t2; b = load t3c = a + bif MAX < c

t4 = C + t1store t4, ci = i + 1

c = MAX

FT

Universal Instruction Selection

• Global instruction selection

• Program graphs for entire functions

• Instruction graphs capture both data and control flow• handles broad range of instructions found in today’s processors

• Integrates global code motion

• Takes data-copying overhead into account

• Presupposes an expressive approach such as CP

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

78

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

79

Program Graph (Example)

Instruction Graph (satadd)

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

80

Approach

• Before: create instruction graphs

• Code generation• create program graph

• compute possible matches (standard algorithm VF2 [Cordella ea, 2004])

• generate model in MiniZinc

• solve model with CPX 1.0.2

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

81

transformer

transformer

matcher modeler solver

processorinstructions

inputprogram

instruction graphs

program graph

matches model

outputprogram

Model Summary

• Decision variables• which match is selected?

• in which block are selected matches placed?

• in which block is data made available?

• Constraints (selection)• operations must be covered by exactly one match

• control flow cannot be moved

• data must be defined before used

• definition edges must be enforced

• blocks must be ordered (respect fall-through branching if possible)

• implied and dominance constraints

• Objective functions • minimize estimated execution time

• minimize code size

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

82

Experiments

• Benchmarks• 16 functions from MediaBench

• program graphs have 34-203 nodes

• all models solved to optimality with CPX 1.0.2

• For Simple MIPS32• simple RISC architecture: worst-case scenario

• surprise: 1.4% mean speedup over LLVM 3.4

• better: global code motion; worse: constant reloading

• runtimes: 0.3-83.2 seconds, median 10.5 seconds

• For Funky MIPS32 (made up)• MIPS32 + common SIMD instructions: good case

• 3% mean speedup over Simple MIPS32

• surprise: sometimes SIMD-style is not really that good!

• runtimes: 0.3-146.8% seconds, median 10.5 seconds

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

83

Discussion

• Overcomes many restrictions of state-of-the-art approaches• control flow

• global code motion

• sophisticated instructions

• Model and representation designed together• expressive representation requires expressive models

• Limitations• constant reloading

• if-conversion (predication), well: no approach can do this anyway!

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

84

SUMMARY

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

85

The Only Important Slide

• Are you interested in combinatorial optimization for compilation?

• Do you want to do a postdoc in one of the most beautiful and dynamic cities in the world?

• Then talk to me!

• Open position at KTH for one year, might be prolonged to two years

• salary and benefits are good

• deadline is end of October

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

86

Now and Then…

• Status• instruction scheduling: local, standard

• register allocation: global, unique

• instruction selection: global, unique

• not fully integrated

• solving pretty naïve

• Future• instruction scheduling: superblocks, if-conversion (predication)

• register allocation: rematerialization

• more sophisticated solving

• integration!!!

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

87

Project & Goals

• Unison has a considerable engineering part• processor descriptions (separate large project)

• robust and maintainable tool chain

• testing and transfer

• A production-quality tool that will be deployed• industrial strength re-implementation started

• An open-source contribution to LLVM• legal process started, but need to convince LLVM developers…

• Real significance

simplicity even for today’s freak processors

Sep

10,

201

5C

od

e G

ene

rati

on

for

Rea

l, Sc

hu

lte

88

top related