continuous shape shifting: enabling loop co-optimization...
TRANSCRIPT
October 18, 2016
Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars
Continuous Shape Shifting: Enabling Loop
Co-optimization via Near-Free Dynamic Code Rewriting
International Symposium on Microarchitecture (MICRO), 2016
Rampant Dynamism in Datacenters
Datacenters
Dynamism - Dynamic factors that affect application runtime environments
Rampant Dynamism in Datacenters
Datacenters
Co-running of applications
Dynamism - Dynamic factors that affect application runtime environments
Rampant Dynamism in Datacenters
Datacenters
Co-running of applications
Microarchitectural flexibility
Dynamism - Dynamic factors that affect application runtime environments
Rampant Dynamism in Datacenters
Datacenters
Co-running of applications
Microarchitectural flexibility
Platform diversity
Dynamism - Dynamic factors that affect application runtime environments
Rampant Dynamism in Datacenters
Datacenters
Co-running of applications
Microarchitectural flexibility
Platform diversity
Dynamism affects the runtime availability of resources
Dynamism - Dynamic factors that affect application runtime environments
Static Compiler Optimizations
Compilation assumptions might not be met at runtime
Resource dependent static optimizations do not react to dynamism
Static Compiler Optimizations
Compilation assumptions might not be met at runtime
Resource dependent static optimizations do not react to dynamism
Loop Tiling
Restructures memory access pattern to utilize data reuse
Static Compiler Optimizations
Compilation assumptions might not be met at runtime
Resource dependent static optimizations do not react to dynamism
Loop Tiling
Restructures memory access pattern to utilize data reuse
Conceptualized before multicore era, presenting little dynamism
Static Compiler Optimizations
Compilation assumptions might not be met at runtime
Resource dependent static optimizations do not react to dynamism
Loop Tiling
Restructures memory access pattern to utilize data reuse
Conceptualized before multicore era, presenting little dynamism
Static
Normal
Static Compiler Optimizations
Compilation assumptions might not be met at runtime
Resource dependent static optimizations do not react to dynamism
Loop Tiling
Restructures memory access pattern to utilize data reuse
Conceptualized before multicore era, presenting little dynamism
Static
Normal
Static Compiler Optimizations
Compilation assumptions might not be met at runtime
Resource dependent static optimizations do not react to dynamism
Loop Tiling
Restructures memory access pattern to utilize data reuse
Conceptualized before multicore era, presenting little dynamism
Static
Normal
Static Compiler Optimizations
Compilation assumptions might not be met at runtime
Resource dependent static optimizations do not react to dynamism
Loop Tiling
Restructures memory access pattern to utilize data reuse
Conceptualized before multicore era, presenting little dynamism
Static
Normal Co-running
application
Static Compiler Optimizations
Compilation assumptions might not be met at runtime
Resource dependent static optimizations do not react to dynamism
Loop Tiling
Restructures memory access pattern to utilize data reuse
Conceptualized before multicore era, presenting little dynamism
Static
Normal Co-running
application
Static Compiler Optimizations
Compilation assumptions might not be met at runtime
Resource dependent static optimizations do not react to dynamism
Loop Tiling
Restructures memory access pattern to utilize data reuse
Conceptualized before multicore era, presenting little dynamism
Static
Normal Co-running
application
Partitioned
cache
Static Compiler Optimizations
Compilation assumptions might not be met at runtime
Resource dependent static optimizations do not react to dynamism
Loop Tiling
Restructures memory access pattern to utilize data reuse
Conceptualized before multicore era, presenting little dynamism
Static
Normal Co-running
application
Partitioned
cache
Different
architecture
Static Compiler Optimizations
Compilation assumptions might not be met at runtime
Resource dependent static optimizations do not react to dynamism
Loop Tiling
Restructures memory access pattern to utilize data reuse
Conceptualized before multicore era, presenting little dynamism
Static
Ideal
Normal Co-running
application
Partitioned
cache
Different
architecture
Co-runner Tiling Comparison
Dynamism requires rethinking cache tiling
Static vs Dynamic
Static vs Dynamic Static vs Dynamic
Design Objectives
Dynamic – Should react to changes in runtime environment
High accuracy – Should identify a high-performance tiling strategy
Design Objectives
Dynamic – Should react to changes in runtime environment
High accuracy – Should identify a high-performance tiling strategy
Low overhead – Should have low dynamic performance overhead
Design Objectives
Dynamic – Should react to changes in runtime environment
High accuracy – Should identify a high-performance tiling strategy
Low overhead – Should have low dynamic performance overhead
Current techniques are not enough
White-box approaches
Design Objectives
Dynamic – Should react to changes in runtime environment
High accuracy – Should identify a high-performance tiling strategy
Low overhead – Should have low dynamic performance overhead
Current techniques are not enough
White-box approaches
Dynamic Accuracy Low-overhead
White-box approach
BLAS libraries
Design Objectives
Dynamic – Should react to changes in runtime environment
High accuracy – Should identify a high-performance tiling strategy
Low overhead – Should have low dynamic performance overhead
Current techniques are not enough
White-box approaches
Math kernel libraries like Intel MKL, ATLAS
Dynamic Accuracy Low-overhead
White-box approach
BLAS libraries
Design Objectives
Dynamic – Should react to changes in runtime environment
High accuracy – Should identify a high-performance tiling strategy
Low overhead – Should have low dynamic performance overhead
Current techniques are not enough
White-box approaches
Math kernel libraries like Intel MKL, ATLAS
Dynamic Accuracy Low-overhead
White-box approach
BLAS libraries
Design Objectives
Dynamic – Should react to changes in runtime environment
High accuracy – Should identify a high-performance tiling strategy
Low overhead – Should have low dynamic performance overhead
Current techniques are not enough
White-box approaches
Math kernel libraries like Intel MKL, ATLAS
Online generation of a black-box model
Dynamic Accuracy Low-overhead
White-box approach
BLAS libraries
Key Components
Dynamic tile generation
Application 1
Dynamic
compiler
Code cache
Tiled
loop
Companion 1 Application 2 Companion 2
ZZ
Companion thread (Protean Code + Polly)
Protean Code, MICRO 2014 and Polly, PLDI 2008
Key Components
Dynamic tile generation
Detect tiling opportunities
Application 1
Dynamic
compiler
Code cache
Tiled
loop
Companion 1 Application 2 Companion 2
ZZ
Companion thread (Protean Code + Polly)
Protean Code, MICRO 2014 and Polly, PLDI 2008
Key Components
Dynamic tile generation
Detect tiling opportunities
Application 1
Dynamic
compiler
Code cache
Tiled
loop
Companion 1
REM
Application 2 Companion 2
ZZ
Companion thread (Protean Code + Polly)
Runtime Environment Monitor (REM)
Protean Code, MICRO 2014 and Polly, PLDI 2008
Key Components
Dynamic tile generation
Detect tiling opportunities
Find a high-performant tile
Application 1
Dynamic
compiler
Code cache
Tiled
loop
Companion 1
REM
Application 2 Companion 2
ZZ
Companion thread (Protean Code + Polly)
Runtime Environment Monitor (REM)
Protean Code, MICRO 2014 and Polly, PLDI 2008
Key Components
Dynamic tile generation
Detect tiling opportunities
Find a high-performant tile
Application 1
Dynamic
compiler
Code cache
Tiled
loop
Companion 1
REM
Tile
selector
Application 2 Companion 2
ZZ
Companion thread (Protean Code + Polly)
Runtime Environment Monitor (REM)
Tile Selector
Protean Code, MICRO 2014 and Polly, PLDI 2008
Key Components
Dynamic tile generation
Detect tiling opportunities
Find a high-performant tile
Application 1
Dynamic
compiler
Code cache
Tiled
loop
Companion 1
REM
Tile
selector
1 2
ZZ
Companion
controller
Application 2 Companion 2
ZZ
ShapeShifter
Companion thread (Protean Code + Polly)
Runtime Environment Monitor (REM)
Tile Selector
Protean Code, MICRO 2014 and Polly, PLDI 2008
Overview
Tile
selector
Online training – select tile size and generate training data
Online training
REM
Find tile size Training set Collect cache
stats
Dynamic
compiler
Overview
Tile
selector
Online training – select tile size and generate training data
Tile selection – generate black-box model and select suitable tile
shape
Online training Tile selection
REM
Find tile size Training set Collect cache
statsTile performance
model
Choose tile
Dynamic
compiler
Overview
Tile
selector
Online training – select tile size and generate training data
Tile selection – generate black-box model and select suitable tile
shape
Monitored execution – detect tiling opportunities
Online training Tile selection
REM
Monitored
executionFind tile size Training set Collect cache
statsTile performance
model
Choose tile
Dynamic
compiler
Overview
Tile
selector
Online training – select tile size and generate training data
Tile selection – generate black-box model and select suitable tile
shape
Monitored execution – detect tiling opportunities
Online training Tile selection
REM
Monitored
executionFind tile size Training set Collect cache
statsTile performance
model
Choose tile
Runtime environment change
Dynamic
compiler
Overview
Tile
selector
Online training – select tile size and generate training data
Tile selection – generate black-box model and select suitable tile
shape
Monitored execution – detect tiling opportunities
Online training Tile selection
REM
Monitored
executionFind tile size Training set Collect cache
statsTile performance
model
Choose tile
Runtime environment change
Dynamic
compiler
Tile Selection
Training data
Black-box model
Black-box model is generated online
Uses tile parameters and IPC from tile data
Model is specific to application and its current runtime environment
Tile Selection
Tile parameters
IPC
Training data
Black-box model
Black-box model is generated online
Uses tile parameters and IPC from tile data
Model is specific to application and its current runtime environment
Tile Selection
Tile parameters
IPC
Training data
Black-box model
Black-box model is generated online
Uses tile parameters and IPC from tile data
Model is specific to application and its current runtime environment
Tile Selection
Tile parameters
IPC
Training data
IPCpred
Set of tile shapes
of predicted sizeBlack-box model
Black-box model is generated online
Uses tile parameters and IPC from tile data
Model is specific to application and its current runtime environment
Predicts a tile suitable to current runtime environment
Tile Selection
Tile parameters
IPC
Training data
IPCpred
IPCmax
Set of tile shapes
of predicted sizeBlack-box model
Black-box model is generated online
Uses tile parameters and IPC from tile data
Model is specific to application and its current runtime environment
Predicts a tile suitable to current runtime environment
Tshapeshifter
Insight for Co-optimization
Challenging to retile multiple applications simultaneously
Tile shape and tile size contribute differently to cache interference
Insight for Co-optimization
Challenging to retile multiple applications simultaneously
Tile shape and tile size contribute differently to cache interference
Insight for Co-optimization
Challenging to retile multiple applications simultaneously
Tile shape and tile size contribute differently to cache interference
Co-optimization – Find tile size for apps and then tile shape one-by-one
Methodology
Polybench application suite
Three sources of dynamism
Co-running applications
Microarchitectural flexibility – cache partitioning
Platform diversity
Methodology
Polybench application suite
Three sources of dynamism
Co-running applications
Microarchitectural flexibility – cache partitioning
Platform diversity
Three platforms
AMD Bulldozer
Intel Haswell
Intel Atom
Methodology
Polybench application suite
Three sources of dynamism
Co-running applications
Microarchitectural flexibility – cache partitioning
Platform diversity
Three platforms
AMD Bulldozer
Intel Haswell
Intel Atom
Tiling is performed in the shared cache
Co-runner
Arrival/departure of a co-runner
Static Best – best tile with no co-runner
Co-runner change
syr2k to correlation
Co-runner
Arrival/departure of a co-runner
Static Best – best tile with no co-runner
Co-runner change
syr2k to correlation
Change in cache allocations
Microarchitectural Flexibility
Microarchitectural flexibility – cache partitioning
Static Best – best tile with no cache resizing (16-way enabled)
Microarchitectural Flexibility
Microarchitectural flexibility – cache partitioning
Static Best – best tile with no cache resizing (16-way enabled)
Microarchitectural Flexibility
Microarchitectural flexibility – cache partitioning
Static Best – best tile with no cache resizing (16-way enabled)
Microarchitectural Flexibility
Microarchitectural flexibility – cache partitioning
Static Best – best tile with no cache resizing (16-way enabled)
Microarchitectural Flexibility
Microarchitectural flexibility – cache partitioning
Static Best – best tile with no cache resizing (16-way enabled)
Platform Diversity
Platform diversity – Intel Atom, Intel Haswell and AMD Bulldozer
Static Best – best tile on AMD Bulldozer
Platform Diversity
Platform diversity – Intel Atom, Intel Haswell and AMD Bulldozer
Static Best – best tile on AMD Bulldozer
Platform Diversity
Platform diversity – Intel Atom, Intel Haswell and AMD Bulldozer
Static Best – best tile on AMD Bulldozer
Conclusions
ShapeShifter – an end to end dynamic loop co-optimization
Adapt tiling strategy to the application runtime environment
Conclusions
ShapeShifter – an end to end dynamic loop co-optimization
Adapt tiling strategy to the application runtime environment
Loop co-optimization – tiling multiple applications on the fly
Conclusions
ShapeShifter – an end to end dynamic loop co-optimization
Adapt tiling strategy to the application runtime environment
Loop co-optimization – tiling multiple applications on the fly
Novel black-box modelling approach – fast and accurate
Conclusions
ShapeShifter – an end to end dynamic loop co-optimization
Adapt tiling strategy to the application runtime environment
Loop co-optimization – tiling multiple applications on the fly
Novel black-box modelling approach – fast and accurate
ShapeShifter achieves significant performance improvements
across different sources of dynamism
Why black-box model works?
There is trade-off between the best tiling stragey and performance
We show that SS chooses a close one
Why 3 D tiling?
Build on Polly but technique is not restricted to 3D tiling
Also memorize the compilation times
2 reasons of slowdown – tile doesn’t matter, black-box model not good enough
Remember cache sizes
Prior work – refresh
18
Overhead – Companion thread
Three sources of overhead
Dynamic Compilation – 136 ms on Intel Haswell, 430 ms on AMD
Bulldozer
Code redirection
Training
19
Black-box model
Multiple high-performance tiles
ShapeShifter chooses one of the high-performanc e tiles
21
ShapeShifter vs Dynamic Oracle
ShapeShifter achieves 93% of the dynamic oracle performance on
average
22