continuous shape shifting: enabling loop co-optimization...

October 18, 2016

Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars

Continuous Shape Shifting: Enabling Loop

Co-optimization via Near-Free Dynamic Code Rewriting

International Symposium on Microarchitecture (MICRO), 2016

Rampant Dynamism in Datacenters

Datacenters


Datacenters

Dynamism - Dynamic factors that affect application runtime environments


Datacenters

Co-running of applications



Datacenters


Microarchitectural flexibility



Datacenters



Platform diversity



Datacenters



Platform diversity

Dynamism affects the runtime availability of resources


Static Compiler Optimizations

Compilation assumptions might not be met at runtime



Resource dependent static optimizations do not react to dynamism




Loop Tiling

Restructures memory access pattern to utilize data reuse




Loop Tiling


Conceptualized before multicore era, presenting little dynamism




Loop Tiling



Static

Normal




Loop Tiling



Static

Normal Co-running

application




Loop Tiling



Static

Normal Co-running

application

Partitioned

cache




Loop Tiling



Static

Normal Co-running

application

Partitioned

cache

Different

architecture




Loop Tiling



Static

Ideal

Normal Co-running

application

Partitioned

cache

Different

architecture

Co-runner Tiling Comparison

Static vs Dynamic


Static vs Dynamic

Static vs Dynamic Static vs Dynamic


Dynamism requires rethinking cache tiling

Static vs Dynamic

Static vs Dynamic Static vs Dynamic

Design Objectives

Dynamic – Should react to changes in runtime environment

Design Objectives


High accuracy – Should identify a high-performance tiling strategy

Design Objectives



Low overhead – Should have low dynamic performance overhead

Design Objectives




Current techniques are not enough

White-box approaches

Design Objectives






Dynamic Accuracy Low-overhead

White-box approach

BLAS libraries

Design Objectives






Math kernel libraries like Intel MKL, ATLAS


White-box approach

BLAS libraries

Design Objectives






Math kernel libraries like Intel MKL, ATLAS

Online generation of a black-box model


White-box approach

BLAS libraries

Shape Shifter

Key Components

Dynamic tile generation

Application 1

Tiled

loop

Key Components


Application 1

Dynamic

compiler

Code cache

Tiled

loop

Companion 1 Application 2 Companion 2

ZZ

Companion thread (Protean Code + Polly)

Protean Code, MICRO 2014 and Polly, PLDI 2008

Key Components


Detect tiling opportunities

Application 1

Dynamic

compiler

Code cache

Tiled

loop

Companion 1 Application 2 Companion 2

ZZ



Key Components



Application 1

Dynamic

compiler

Code cache

Tiled

loop

Companion 1

REM

Application 2 Companion 2

ZZ


Runtime Environment Monitor (REM)


Key Components



Find a high-performant tile

Application 1

Dynamic

compiler

Code cache

Tiled

loop

Companion 1

REM


ZZ




Key Components




Application 1

Dynamic

compiler

Code cache

Tiled

loop

Companion 1

REM

Tile

selector


ZZ



Tile Selector


Key Components




Application 1

Dynamic

compiler

Code cache

Tiled

loop

Companion 1

REM

Tile

selector

1 2

ZZ

Companion

controller


ZZ

ShapeShifter



Tile Selector


Overview

Tile

selectorREM

Dynamic

compiler

Overview

Tile

selector

Online training – select tile size and generate training data

Online training

REM

Find tile size Training set Collect cache

stats

Dynamic

compiler

Overview

Tile

selector


Tile selection – generate black-box model and select suitable tile

shape

Online training Tile selection

REM

Find tile size Training set Collect cache

statsTile performance

model

Choose tile

Dynamic

compiler

Overview

Tile

selector



shape

Monitored execution – detect tiling opportunities


REM

Monitored

executionFind tile size Training set Collect cache


model

Choose tile

Dynamic

compiler

Overview

Tile

selector



shape

Monitored execution – detect tiling opportunities


REM

Monitored

executionFind tile size Training set Collect cache


model

Choose tile

Runtime environment change

Dynamic

compiler

Tile Selection

Black-box model is generated online

Tile Selection

Training data

Black-box model


Uses tile parameters and IPC from tile data

Model is specific to application and its current runtime environment

Tile Selection

Tile parameters

IPC

Training data

Black-box model




Tile Selection

Tile parameters

IPC

Training data

IPCpred

Set of tile shapes

of predicted sizeBlack-box model




Predicts a tile suitable to current runtime environment

Tile Selection

Tile parameters

IPC

Training data

IPCpred

IPCmax

Set of tile shapes

of predicted sizeBlack-box model




Predicts a tile suitable to current runtime environment

Tshapeshifter

Insight for Co-optimization

Challenging to retile multiple applications simultaneously



Tile shape and tile size contribute differently to cache interference



Tile shape and tile size contribute differently to cache interference

Co-optimization – Find tile size for apps and then tile shape one-by-one

Experimental Evaluation

Methodology

Polybench application suite

Methodology


Three sources of dynamism

Co-running applications

Microarchitectural flexibility – cache partitioning

Platform diversity

Methodology





Platform diversity

Three platforms

AMD Bulldozer

Intel Haswell

Intel Atom

Methodology





Platform diversity

Three platforms

AMD Bulldozer

Intel Haswell

Intel Atom

Tiling is performed in the shared cache

Co-runner

Arrival/departure of a co-runner

Static Best – best tile with no co-runner

Co-runner



Co-runner change

syr2k to correlation

Co-runner



Co-runner change

syr2k to correlation

Change in cache allocations

Microarchitectural Flexibility


Static Best – best tile with no cache resizing (16-way enabled)

Platform Diversity

Platform diversity – Intel Atom, Intel Haswell and AMD Bulldozer

Static Best – best tile on AMD Bulldozer

Conclusions

ShapeShifter – an end to end dynamic loop co-optimization

Conclusions


Adapt tiling strategy to the application runtime environment

Conclusions



Loop co-optimization – tiling multiple applications on the fly

Conclusions




Novel black-box modelling approach – fast and accurate

Conclusions




Novel black-box modelling approach – fast and accurate

ShapeShifter achieves significant performance improvements

across different sources of dynamism

Why black-box model works?

There is trade-off between the best tiling stragey and performance

We show that SS chooses a close one

Why 3 D tiling?

Build on Polly but technique is not restricted to 3D tiling

Also memorize the compilation times

2 reasons of slowdown – tile doesn’t matter, black-box model not good enough

Remember cache sizes

Prior work – refresh

18

Overhead – Companion thread

Three sources of overhead

Dynamic Compilation – 136 ms on Intel Haswell, 430 ms on AMD

Bulldozer

Code redirection

Training

19

Overhead – training

20

Black-box model

Multiple high-performance tiles

ShapeShifter chooses one of the high-performanc e tiles

21

ShapeShifter vs Dynamic Oracle

ShapeShifter achieves 93% of the dynamic oracle performance on

average

22

Co-runner

23

continuous shape shifting: enabling loop co-optimization...

Documents