modeling and solving code generation for real 2015.pdf · •model and solve combinatorial...

Modeling and SolvingCode Generation for Real

Christian SchulteKTH Royal Institute of Technology & SICS (Swedish Institute of Computer Science)

joint work with: Mats Carlsson SICS

Roberto Castañeda Lozano SICS + KTH

Frej Drejhammar SICS

Gabriel Hjort Blindell KTH + SICS

funded by: Ericsson AB

Swedish Research Council (VR 621-2011-6229)

Who Am I?

• Professor of Computer Science at KTH Royal Institute of Technology, Stockholm, Sweden

• Expert Researcher at SICS (Swedish Institute of Computer Science), Stockholm, Sweden

• Education• diploma in computer science, Karlsruhe, Germany, 1992

• doctoral degree in engineering, Saarbrücken, Germany, 2001

• docent in computer systems, KTH, Sweden, 2009

• Research interests• constraint programming

• programming languages

• systems-based research (Gecode, for example)

Compilation

• Front-end: depends on source programming language• changes infrequently (well…)

• Optimizer: independent optimizations• changes infrequently (well…)

• Back-end: depends on processor architecture• changes often: new process, new architectures, new features, …

front-end optimizer back-end(code generator)

sourceprogram

assemblyprogram

Generating Code: Unison

• Infrequent changes: front-end & optimizer• reuse state-of-the-art: LLVM, for example

• Frequent changes: back-end• use flexible approach: Unison

front-end optimizer back-end(code generator)

sourceprogram

assemblyprogram

Unison

State of the Art

• Code generation organized into stages• instruction selection,

instructionselection

x = y + z;add r0 r1 r2mv $a6f0 r0

State of the Art

• Code generation organized into stages• instruction selection, register allocation,

registerallocation

x = y + z;x register r0y memory (spill to stack)…

State of the Art

• Code generation organized into stages• instruction selection, register allocation, instruction scheduling

instructionscheduling

x = y + z;…u = v – w;

u = v – w;…x = y + z;

State of the Art

• Code generation organized into stages• stages are interdependent: no optimal order possible

registerallocation

State of the Art

• Example: instruction scheduling register allocation• increased delay between instructions can increase throughput

registers used over longer time-spans

more registers needed

register allocation

State of the Art

• Example: instruction scheduling register allocation• put variables into fewer registers

more dependencies among instructions

less opportunity for reordering instructions

registerallocation

State of the Art

• Stages use heuristic algorithms• for hard combinatorial problems (NP hard)

• assumption: optimal solutions not possible anyway

• difficult to take advantage of processor features

• error-prone when adapting to change

register allocation

State of the Art

• Stages use heuristic algorithms• for hard combinatorial problems (NP hard)

• assumption: optimal solutions not possible anyway

• difficult to take advantage of processor features

• error-prone when adapting to change

register allocation

preclude optimal code,make development

complex

Rethinking: Unison Idea

• No more staging and complex heuristic algorithms!• many assumptions are decades old...

• Use state-of-the-art technology for solving combinatorial optimization problems: constraint programming

• tremendous progress in last two decades...

• Generate and solve single model• captures all code generation tasks in unison

• high-level of abstraction: based on processor description

• flexible: ideally, just change processor description

• potentially optimal: tradeoff between decisions accurately reflected

Unison Approach

• Generate constraint model• based on input program and processor description

• constraints for all code generation tasks

• generate but not solve: simpler and more expressive

registerallocation

input program

processor description

constraint model

constraints

Unison Approach

• Off-the-shelf constraint solver solves constraint model• solution is assembly program

• optimization takes inter-dependencies into account

registerallocation

input program

processor description

constraint model

constraints

off-the-shelfconstraint

solver

assemblyprogram

Talk Overview

• Constraint programming in a nutshell

• Register Allocation & Instruction Scheduling• Basic Register Allocation

• Instruction Scheduling

• Advanced Register Allocation

• Global Register Allocation

• Discussion

• Instruction Selection [if time allows]• Graph-based Instruction Selection

• Universal Instruction Selection

• Discussion

• Summary

Source Material

• Register Allocation & Instruction Scheduling• Constraint-based Register Allocation and Instruction Scheduling,

Roberto Castañeda Lozano, Mats Carlsson, Frej Drejhammar, Christian Schulte. CP 2012.

• Combinatorial Spill Code Optimization and Ultimate Coalescing, Roberto Castañeda Lozano, Mats Carlsson, Gabriel Hjort Blindell, Christian Schulte. LCTES 2014.

• Instruction Selection• Modeling Universal Instruction Selection, Gabriel Hjort Blindell,

Roberto Castañeda Lozano, Mats Carlsson, Christian Schulte. CP 2015.

Source Material

• Surveys• Survey on Combinatorial Register Allocation and Instruction

Scheduling, Roberto Castañeda Lozano, Christian Schulte. CoRR entry, 2014. Revised version to be submitted.

• Instruction Selection: Principles, Techniques and Applications. Gabriel Hjort Blindell, Springer, 2015. To appear.

• Additional Material• Integrated Register Allocation and Instruction Scheduling with

Constraint Programming, Roberto Castañeda Lozano. KTH Royal Institute of Technology, Sweden, Licentiate thesis, 2014.

• Constraint-based Code Generation. Roberto Castañeda Lozano, Gabriel Hjort Blindell, Mats Carlsson, Frej Drejhammar, Christian Schulte. M-SCOPES 2013.

CONSTRAINT PROGRAMMINGIN A NUTSHELL

Constraint Programming

• Model and solve combinatorial (optimization) problems

• Modeling• variables

• constraints

• branching heuristics

• (cost function)

• Solving• constraint propagation

• heuristic search

• Of course simplified…

…array of modeling and solving techniques

Problem: Send More Money

• Find distinct digits for letters such that

+ MORE

= MONEY

Constraint Model

• Variables:

S,E,N,D,M,O,R,Y {0,…,9}

• Constraints:distinct(S,E,N,D,M,O,R,Y)

1000×S+100×E+10×N+D

+ 1000×M+100×O+10×R+E

= 10000×M+1000×O+100×N+10×E+Y

Constraints

• State relations between variables• legal combinations of values for variables

• Examples• all variables pair wise distinct: distinct(x1, …, xn)

• arithmetic constraints: x + 2×y = z

• domain-specific: cumulative(t1, …, tn)

nooverlap(r1, …, rn)

• Success story: global constraints• modeling: capture recurring problem structures

• solving: enable strong reasoning

constraint-specific methods

Solving: Variables and Values

• Record possible values for variables• solution: single value left

• failure: no values left

x{1,2,3,4} y{1,2,3,4} z{1,2,3,4}

Constraint Propagation

• Prune values that are in conflict with constraint

x{1,2,3,4} y{1,2,3,4} z{1,2,3,4}

distinct(x, y, z) x + y = 3

• Prune values that are in conflict with constraint

x{1,2} y{1,2} z{1,2,3,4}

• Prune values that are in conflict with constraint• propagation is often smart if not perfect!

x{1,2} y{1,2} z{3,4}

Heuristic Search

• Propagation alone not sufficient• decompose into simpler sub-problems

• search needed

• Create subproblems with additional constraints• enables further propagation

• defines search tree

• uses problem specific heuristic

x{1,2} y{1,2} z{3,4}

x{1} y{2} z{3,4}

x{2} y{1} z{3,4}

x=1 x≠1

What Makes It Work?

• Essential: avoid search...

...as it always suffers from combinatorial explosion

• Constraint propagation drastically reduces search space

• Efficient and powerful methods for propagation available

• When using search, use a clever heuristic

• Array of modeling techniques available that reduce search

• Hybrid methods (together with LP, SAT, stochastic, ...)

Register Allocation & Instruction Scheduling

Unit and Scope

• Function is unit of compilation• generate code for one function at a time

• Scope• local generate code for each basic block in isolation

• global generate code for whole function

• Basic block: instructions that are always executed together• execute at start

• execute all instructions

• leave execution at end

• that is: no control flow within basic block (in or out)

BASIC REGISTER ALLOCATION

Local (and slightly naïve) register allocation

Local Register Allocation

• Instruction selection has already been performed

• Temporaries• defined or def-occurence (lhs) t3 in t3 sub t1, 2

• used or use-occurence (rhs) t1 in t3 sub t1, 2

• Basic blocks are in SSA (single static assignment) form• each temporary is defined once

• standard state-of-the-art approach

t2 mul t1, 2t3 sub t1, 2t4 add t2, t3

...t5 mul t1, t4 jr t5

Liveness & Interference

• Temporary is live from def to last use, defining its live range• live ranges are linear (basic block + SSA)

• Temporaries interfere if their live ranges overlap

• Non-interfering temporaries can be assigned to same register

t2 mul t1, 2t3 sub t1, 2t4 add t2, t3

...t5 mul t1, t4 jr t5

t1 t2 t3 t4 t5

live ranges

Spilling

• If not enough registers available: spill

• Spilling moves temporary to memory (stack)• store in memory after defined• load from memory before used• memory access typically considerably more expensive• decision on spilling crucial for performance

• Architectures might have more than one register bank• some instructions only capable of addressing a particular bank• “spilling” from one register bank to another

• Unified register array• limited number of registers for each register file• memory is just another “register” file• unlimited number of memory “registers”

Coalescing

• Temporaries d (“destination”) and s (“source”) are move-related if

d s• d and s should be coalesced (assigned to same register)

• coalescing saves move instructions and registers

• Coalescing is important due to• how registers are managed (calling convention)

• how our model deals with global register allocation (more later)

Copy Operations

• Copy operations replicate a temporary t to a temporary t’

t’ {i1, i2, …, in} t• copy is implemented by one of the alternative instructions i1, i2, …, in• instruction depends on where t and t’ are stored

similar to [Appel & George, 2001]

• Example MIPS32

t’ {move, sw, nop} t• t’ memory and t register: sw spill

• t’ register and t register: move move-related

• t’ and t same register: nop coalescing

• MIPS32: instructions can only be performed on registers

Model Variables

• Decision variables• reg(t) N register to which temporary t is assigned

• instr(o) N instruction that implements operation o

• cycle(o) N issue cycle for operation o

• active(o) {0,1} whether operation o is active

• Derived variables• start(t) start of live range of temporary t

= cycle(o) where o defines t

• end(t) end of live range of temporary t

= max { cycle(o) | o uses t }

Sanity Constraints

• Copy operation o is active no coalescing

active(o) = 1 reg(s) ≠ reg(d)• s is source of move, d is destination of move operation o

• Operations implemented by suitable instructions• single possible instruction for non-copy operations

• Miscellaneous• some registers are pre-assigned

• some instructions can only address certain registers (or memory)

Geometrical Interpretation

• Temporary t is rectangle• width is 1 (occupies one register)

• top = start(t) issue cycle of def

• bottom = end(t) last issue cycle of any use

• Consequence of linear live range (basic block + SSA)

unified register array

r0 r1 rn-1

registers memory registers

m0 m1clock cycle

temporary t

Register Assignment

• Register assignment = geometric packing problem• find horizontal coordinates for all temporaries

• such that no two rectangles for temporaries overlap

• For block B

nooverlap({reg(t),reg(t)+1,start(t),end(t) | tB})

unified register array

r0 r1 rn-1

registers memory registers

m0 m1clock cycle

temporary t

Register Packing

• Temporaries might have different width width(t)• many processors support access to register parts

• still modeled as geometrical packing problem [Pereira & Palsberg, 2008]

Register Packing

• Example: Intel x86• assign two 8 bit temporaries (width = 1) to 16 bit register (width = 2)

• register parts: AH, AL, BH, BL, CH, CL

• possible for 8 bit: AH, AL, BH, BL, CH, CL

• possible for 16 bit: AH, BH, CH

AH AL BH BL CH CLAX BX CX

clock cycle

width(t1)=1

width(t3)=2

width(t3)=1

width(t4)=2

Register Packing

clock cycle

start(t1)=0

start(t2)=0

start(t3)=0

start(t4)=1

end(t1)=1

end(t2)=2

end(t3)=1

end(t4)=2

width(t1)=1

width(t3)=2

width(t3)=1

width(t4)=2

Register Packing

clock cycle

start(t1)=0

start(t2)=0

start(t3)=0

start(t4)=1

end(t1)=1

end(t2)=2

end(t3)=1

end(t4)=2

width(t1)=1

width(t3)=2

width(t3)=1

width(t4)=2

Modeling Register Packing

• Take width of temporaries into account (for block B)

nooverlap({reg(t),reg(t)+width(t),start(t),end(t) | tB})

• Exclude sub-registers depending on width(t)• simple domain constraint on reg(t)

INSTRUCTION SCHEDULING

Local instruction scheduling (standard)

Dependencies

• Data and control dependencies• data, control, artificial (for making in and out first/last)

• If operation o2 depends on o1:

active(o1) active(o2)

cycle(o2) cycle(o1) + latency(instr(o1))

t3lit4slti t2

bne t4

1 (t2)

1 (t4)

1 (t3)

1 (t2)

1 (t1)

1 (t2) 1

Processor Resources

• Processor resources: functional units, data buses, ...• also: instruction bundle width for VLIW processors (how many

instructions can be issued simultaneously)

• Classical cumulative scheduling problem functional

units• processor resource has capacity #units

• instructions occupy parts of resource 1 unit

• resource consumption can never exceed capacity

• Modeling for block B

cumulative({cycle(o),dur(o,r),active(o)use(o,r) | oB})

ADVANCED REGISTER ALLOCATION

Ultimate Coalescing & Spill Code Optimization

using alternative temporaries

Interference Too Naïve!

• Move-related temporaries might interfere…

…but contain the same value!

• Ultimate notion of interference =

temporaries interfere their live ranges overlap and

they have different values[Chaitin ea, 1981]

t1 ......

t2 mv t1...

... ... t1 ......

... ... t2 ...

t1 and t2 interfere

Spilling Too Naïve!

• Known as spill-everywhere model• reload from memory before every use of original temporary

• Example: t3 should be used rather than reloading t2• t2 allocated in memory!

t1 ......

... ... t1 ......

... ... t1 ...

t1 ...t2 st t1

...t3 ld t2... ... t3 ...

...t4 ld t2... ... t4 ...

spill t1

Alternative Temporaries

• Used to track which temporaries are equal

• Representation is augmented by operands• act as def and use ports in operations

• temporaries hold values transferred among operations by connecting to operands

• Example• operation t2 abs t1• transformed to p2:t2 abs p1:t1 (p1, p2 operands)

• if t1 and t3 hold same value then transformed to

p2:t2 abs p1:{t1,t3}

where either t1 or t3 can be connected to p1

• Model: whether a temporary is live (it is being used)

GLOBAL REGISTER ALLOCATION

Register allocation for entire functions

Entire Functions

• Use control flow graph (CFG) and turn it into LSSA form• edges = control flow

• nodes = basic blocks (no control flow)

• LSSA = linear SSA = SSA for basic blocks plus… to be explained

int fac(int n) {int f = 1;while (n > 0) {f = f * n; n--;

}return f;

t3lit4slti t2

bne t3

jr t10

t8mul t7,t6t9subiu t6

bgtz t9

Linear SSA (LSSA)

• Linear live range of a temporary cannot span block boundaries

• Liveness across blocks defined by temporary congruence

t t’ represent same original temporary

t3lit4slti t2

bne t4

jr t10

bgtz t9

t1t5t2t6

t1t10t3t11

t5t10t8t11

Linear SSA (LSSA)

• Example: t3, t7, t8, t11 are congruent• correspond to the program variable f (factorial result)

• not discussed: t1 return address, t2 first argument, t11 return value

t3lit4slti t2

bne t4

jr t10

bgtz t9

t1t5t2t6

t1t10t3t11

t5t10t8t11

Linear SSA (LSSA)

• Advantage• simple modeling for linear live ranges (geometrical interpretation)

• enables problem decomposition for solving

t3lit4slti t2

bne t4

jr t10

bgtz t9

t1t5t2t6

t1t10t3t11

t5t10t8t11

Global Register Allocation

• Try to coalesce congruent temporaries• this is why coalescing is (even more) crucial in this model

• Introduces natural problem decomposition• master problem (function) coalesce congruent temporaries

• slave problems (basic blocks) register allocation & instruction scheduling

• What is happening• if register pressure is low…

no copy instruction needed (nop)

= coalescing

• if register pressure is high…

copy operation might be implemented by a move

= no coalescing

copy operation might be implemented by a load/store

= spill

DISCUSSION

Solving

• Approach• use master-slave decomposition

• use naïve (very) portfolio of heuristics for basic blocks

• use some pre-solving (symmetry, no-goods, dominance)

• not very advanced (future work)

• Benchmark setup• selection of medium-sized functions (25 to 1000 instructions)

• comparison to LLVM 3.3 for Qualcomm’s Hexagon V4 using –O3

• run for ten iterations where each iteration is given more time

• using Gecode 4.2.1

• full details in [Castañeda ea, LCTES 2014]

Experiments Summary

• Code quality (estimated)• 7% mean improvement over LLVM

• provably optimal for 29% of functions

• Quadratic average (roughly) complexity up to 1000 instructions

• Can be easily changed to optimize for code size• 1% mean improvement over LLVM

Related Approaches

• Idea and motivation in Unison for combinatorial optimization is absolutely not new!

• starting in the early 1990s

[Castañeda & Schulte, CoRR 2014]

• Approaches differ• which code generation tasks covered

• which technology used (ILP, CP, SAT, Genetic Algorithms, ...)

• Common to most approaches• compilation unit is basic block, or

• just a single task covered, or

• very poor scalability

• Challenge: integration, robustness, and scalability

Unique to Unison Approach

• First global approach for register allocation (function as compilation unit)

• Constraint programming using global constraints• sweet spot: cumulative and nooverlap

• Full register allocation with ultimate coalescing, packing, spilling, and spill code optimization

• key property of model: spilling is internalized

• Robust at the expense of optimality• problem decomposition

• But: instruction selection not yet there!

Instruction Selection[Based on slides from Gabriel Hjort Blindell]

Graph-basedInstruction Selection

• Represent program as graph program graph

int f(int a) {int b = a * 2;int c = a * 4;return b + c;

• Represent instructions as graph instruction graph

• Select matches such that program graph is covered

State of the Art

• Local instruction selection

• Program graphs per block

• Graphs restricted to data flow• cannot handle control flow such as branching instructions

• Greedy heuristics• For example, maximal munch

Instruction Examples

• satadd

• Exists in many DSPs

• Incorporates control flow

• Extends across basic blocks

if i < N

t1 = i * 4t2 = A + t1; t3 = B + t1a = load t2; b = load t3c = a + bif MAX < c

t4 = C + t1store t4, ci = i + 1

c = MAX

• satadd

• repeat

• Exists in many processors• for example Intel’s x86

• Incorporates control flow

• Extends across basic blocks

if i < N

t4 = C + t1store t4, ci = i + 1

c = MAX

• satadd

• repeat

• add4

• SIMD-style instruction• very common

• Requires global code motion• move computations across

blocks

• Depending on hardware may require copying

• different register file

if i < N

t4 = C + t1store t4, ci = i + 1

c = MAX

Universal Instruction Selection

• Global instruction selection

• Program graphs for entire functions

• Instruction graphs capture both data and control flow• handles broad range of instructions found in today’s processors

• Integrates global code motion

• Takes data-copying overhead into account

• Presupposes an expressive approach such as CP

Program Graph (Example)

Instruction Graph (satadd)

Approach

• Before: create instruction graphs

• Code generation• create program graph

• compute possible matches (standard algorithm VF2 [Cordella ea, 2004])

• generate model in MiniZinc

• solve model with CPX 1.0.2

transformer

matcher modeler solver

processorinstructions

inputprogram

instruction graphs

program graph

matches model

outputprogram

Model Summary

• Decision variables• which match is selected?

• in which block are selected matches placed?

• in which block is data made available?

• Constraints (selection)• operations must be covered by exactly one match

• control flow cannot be moved

• data must be defined before used

• definition edges must be enforced

• blocks must be ordered (respect fall-through branching if possible)

• implied and dominance constraints

• Objective functions • minimize estimated execution time

• minimize code size

Experiments

• Benchmarks• 16 functions from MediaBench

• program graphs have 34-203 nodes

• all models solved to optimality with CPX 1.0.2

• For Simple MIPS32• simple RISC architecture: worst-case scenario

• surprise: 1.4% mean speedup over LLVM 3.4

• better: global code motion; worse: constant reloading

• runtimes: 0.3-83.2 seconds, median 10.5 seconds

• For Funky MIPS32 (made up)• MIPS32 + common SIMD instructions: good case

• 3% mean speedup over Simple MIPS32

• surprise: sometimes SIMD-style is not really that good!

• runtimes: 0.3-146.8% seconds, median 10.5 seconds

Discussion

• Overcomes many restrictions of state-of-the-art approaches• control flow

• global code motion

• sophisticated instructions

• Model and representation designed together• expressive representation requires expressive models

• Limitations• constant reloading

• if-conversion (predication), well: no approach can do this anyway!

SUMMARY

The Only Important Slide

• Are you interested in combinatorial optimization for compilation?

• Do you want to do a postdoc in one of the most beautiful and dynamic cities in the world?

• Then talk to me!

• Open position at KTH for one year, might be prolonged to two years

• salary and benefits are good

• deadline is end of October

Now and Then…

• Status• instruction scheduling: local, standard

• register allocation: global, unique

• instruction selection: global, unique

• not fully integrated

• solving pretty naïve

• Future• instruction scheduling: superblocks, if-conversion (predication)

• register allocation: rematerialization

• more sophisticated solving

• integration!!!

Project & Goals

• Unison has a considerable engineering part• processor descriptions (separate large project)

• robust and maintainable tool chain

• testing and transfer

• A production-quality tool that will be deployed• industrial strength re-implementation started

• An open-source contribution to LLVM• legal process started, but need to convince LLVM developers…

• Real significance

simplicity even for today’s freak processors

modeling and solving code generation for real 2015.pdf · •model and solve combinatorial...

Documents

smartphone heuristics

heuristics g2

nicolson_10 heuristics for interdisciplinary modeling...

mathematical models of branching actin networks: results...

understanding vsids branching heuristics in conflict

ca313 heuristics

david silver - ucl · the real world real-world problems...

plugin heuristics

acropora branching & non branching

usability heuristics

what looks left-branching is right-branching: evidence

counting-based search: branching heuristics for constraint...

heuristics 05

constraint branching and disjunctive cuts for mixed ... ·...

heuristics: heuristics and biases at the bargaining table

branching heuristics in differential collision search

sisb heuristics

hci - heuristics

rheology control by branching modeling - tu/e ·...

(4) heuristics