towards intelligent programing systems for modern …ssdm’10. 12 p2p: point-to-point shortest path...

Post on 20-Jan-2021

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Towards Intelligent Programing Systems for Modern Computing

Computer  Science,  North  Carolina  State  University

Xipeng  Shen  

Unprecedented Scale

2sources: SciDAC, IBM

Y201120 petaflops10mw power

Y202X1000 petaflops20mw power

50X perf

2X power!

Heterogeneity becomes Norm

3

Massively parallel accelerators are

becoming ubiquitous.

Thesis

4

To  address  the  challenges  in  modern  computing,  one  of  the  keys  exists  in  making  programming  systems  more  intelligent.  

For  advancing  programming  systems,  right  problem  formulating  goes  a  long  way.

Modern Computing

5

Application Data analytics, Machine learning, …

Infrastructure Data centers, Cloud, IoT, …

Architecture Heterogeneous parallel processors, Emerging complex memory, …

TOP  Algorithmic  optimizer  for  data  analytics  [VLDB’15,ICML’15]

GStreamline+  PORPLE  Memory  optimization  for  GPU  [ASPLOS’11,  Micro’14,  ICS’16]

TOP: Enabling Algorithmic Optimizations for

Distance-Related Problems

6

VLDB’2015,  ICML’2015

Up  to  100s  X  speedups.

Yufei  Ding

Role of Compiler

7

ML  Algorithm

Implementation

Execution

compiler

Runtime  system/Architecture

compiler

ML  experts

co-­‐design

Can  compilers  optimize  algorithms?

Learning    Problem

compiler

Why algorithm level?

8

Reason  1:  Large  benefits:  orders  of  magnitude  speedups  at  no  extra  cost.

Reason  2:  Compiler  may  outsmart  ML  experts.   Really?

Example

9

a

d b

Triangular  Inequality:              a-­‐b  ≤  d  ≤  a+b

C1

C2X

K-­‐Means

NIPS’2012

10

K-Means

SIAM’2010

ICML’2003

11

K-NN

IJCNN’11

VisionInterface’10

SSDM’10

12

P2P: Point-to-Point Shortest PathSIAM’05

ALENEX’04

Observations

13

• TI  has  led  to  many  enhanced  algorithms  across  problems  and  domains.  

• Applying  TI  well  is  tricky,  hence  the  many  manual  efforts  and  publications.  

Thoughts• Can  we  have  an  abstraction  to  represent  all  the  problems?  • Can  we  then  generalize  the  TI  optimizations  into  compiler-­‐based  transformations?  

14

Query  Point  Set Target  Point  Set

Distance

Relation

Constraints

Abstract Distance Problem(Q,  T,  D,  R,  C)

KMeansKNN ICP Shortest  DistanceNBodyKNN  join

15

Abstract  Distance-­‐Related  Problem Essence    &  7  Principles  of  TI  Optimizations

KMeansKNN ICP Shortest  DistanceNBodyKNN  join

Our  Analysis  and  Abstraction

16

Key Insights•Reuse through Landmarks

•Spatial & temporal reuses

•Elasticity through hierarchical landmarks

•Efficient bounds update through ghosts for iterative alg.

•Order of comparison

Lq2

q1 t1t2q3 t3

See  VLDB’15  for  details.

17

Abstract  Distance-­‐Related  Problem Essence    &  7  Principles  of  TI  Optimizations

KMeansKNN ICP Shortest  DistanceNBodyKNN  join

TOP  Framework

TOP  API

Compilerproblem    semantic

building    blocks

Opt  Lib

18

TOP  APIBasic  algorithm    description

Compiler

Staged  program    code

TI  Opt  LibEfficient  execution

Usage

TOP_defDistance(Euclidean);T = init();changedFlag = 1;while (changedFlag){ N = TOP_findClosestTargets(1, S, T); TOP_update(T, &changedFlag, N, S); }

Ad  hoc

Systematic

19

Baseline:  Classic  K-­‐means

(16GB,  8-­‐core  Intel  Ivy  Bridge)Speedu

p  (X)

K-­‐Means  (K=1024)

TOP                 Yinyang K-Means

Code  link  in    ICML’15  paper.

Clustering  results  are  same  as  original  method’s.

20

Speedu

p  (X)

Baseline:  Classic  K-­‐means(16GB,  8-­‐core)

XX

TOP                

On  K-­‐Means

Yinyang K-Means

21

Speedups(X) by manual version0 1 102 104

Spee

dups

(X) b

y TO

P ve

rsio

n

1

102

104

KnnKnnjoinKmeansICPNbodyP2PReference line

In manual version0 106 1013

In T

OP

vers

ion

106

1013

KnnKnnjoinKmeansICPNbodyP2PReference line

Average speedups: 50X vs 20X. Save at least 93% calculations.

Speedups #  distance  calculations

   Manually  Optimized      Manually  Optimized      TOP  Optim

ized  

       TOP        Optim

ized  

Insight:    The  right  abstraction  and  formulation  turn  a  compiler  into  an  automatic  algorithm  optimizer,  giving  out  large  speedups.

Intel i5-4570 CPU and 8G memory

On All Benchmarks

Modern Computing

22

Application Data analytics, Machine learning, …

Infrastructure Data centers, Cloud, IoT, …

Architecture Heterogeneous parallel processors, Emerging complex memory, …

TOP  Algorithmic  optimizer  for  data  analytics  [VLDB’15,ICML’15]

GStreamline+  PORPLE  Memory  optimization  for  GPU  [ASPLOS’11,  Micro’14,  ICS’16]

Overcome GPU Limitations

23

Guoyang  Chen  (Qualcomm)

Bo  Wu  (Prof.  @  Colorado  Mines)

Zheng  Zhang  (Prof.  @  Rutgers  Univ)  

Xipeng Shen xshen5@ncsu.edu 24

a SIMD group(warp)

Graphic Processing Unit (GPU)

• Massive parallelism• Favorable

• computing power• cost effectiveness• energy efficiency

25

Challenges

Irregular  Mem  &  Control

Dyn    Task  Parallelism

Scheduling  Limitations

26

Our ExplorationsCompiler-based software solutions

11/069/07

5/096/10

3/1110/11

6/122/13

9/1312/14

5/156/15

12/156/16

2/17

CUDA  release

LCPC  talk  by    David  Kirk

IPDPS  cross  input  adap.  opt.

ICS  remove  thread  diverg.  dyn.

ASPLOS  GStreamline

PACT  treat  synch.  correct.  GPU2CPU

ICS  syn.  relax.  &  opt.  GPU2CPU

PPOPP  mem  coalesc.

PACT  NVM  for    GPU

Micro  PORPLE

ICS  SM  centric

HotOS  Co-­‐run  on  Fused

Micro  Free  Launch

ICS  Multiview

PPOPP  EffiSha

Sweet  KNN;  VersaPipe;  Lean  DNN;  …  

5/17

IPDPS  Co-­‐sched  on  Fused  System

27

Solutions

Irregular  Mem  &  Control

Dyn    Task  Parallelism

Scheduling  Limitations

Compiler-based software solutions

SM-­‐Centric  &  EffiSha  [ics15,ppopp17]

FreeLaunch  [micro15]

Monday  PPoPP  Session  1

GStreamline  &  PORPLE  [asplos11,micro14,        ics16]

Xipeng Shen xshen5@ncsu.edu

Dynamic Irregularities

28

A[ ]:

P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2}

... = A[P[tid]];

tid: 0 1 2 3 4 5 6 7

Degrade throughput by up to (warp size - 1) times. (warp size = 32 in modern GPUs)

memory

2 4 10 0 6 0 0A[ ]:

tid: 0 1 2 3 4 5 6 7 if (A[tid]) {...}

control flow (thread divergence)

for (i=0;i<A[tid]; i++) {...}{a mem seg.

P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7}

Solution 1: Thread-Data Remapping

29

{a mem seg.

4 trans/warp

{

a mem seg.

1 trans/warp

Irregularity in a warp: problematic; across warps: okey!

Principle of solution:Turn intra-warp irreg. into reg. or inter-warp irreg.

Trans-1: Data Reordering

30

P[ ] = {0,5,2,3,2,3,7,6}

... = A[P[tid]];

A[ ]:

tid: 0 1 2 3 4 5 6 7

A’[ ]:

tid: 0 1 2 3 4 5 6 7

<relocation>

original

... = A’[Q[tid]];

Q[ ] = {0,1,2,3,2,3,6,7}

<redirection>

transformed

tid: thread ID; : a thread; : data access; : data relocation

maintain mapping between threads &

data values

Trans-2: Job Swapping • Job = operations + data elements accessed

31

newtid = Q[tid]; . . .... = A[P[newtid]];

Q[ ] = {0,4,2,3,1,5,6,7}

<redirection>

transformed

A[ ]:... = A[P[tid]];

tid: 0 1 2 3 4 5 6 7

original

P[ ] = {0,5,2,3,2,3,7,6}

A[ ]:

tid: 0 1 2 3 4 5 6 7

G-Streamline[ASPLOS’2011]

32

1.08—2.5X  speedups

First framework enabling runtime thread-data remapping.

CPU-GPU pipeline to hide transformation overhead.

Kernel splitting to resolve dependences.

Xipeng Shen xshen5@ncsu.edu 33

Global memory

Texture memory

Shared memory

Constant memory

L1/L2 cacheRead-only cache

Texture cache

Solution 2: Data Placement

Xipeng Shen xshen5@ncsu.edu

GPU Memory

34

Global memory

Texture memory

Shared memory

Constant memory

L1/L2 cacheRead-only cache

Texture cache

coalescing; cache hierarchy

2D/3D locality; texture cache; read-only

on-chip; bank conflicts

broadcasting; cached; read-only

private/shared

read-only data

2D/3D locality; read-only

Xipeng Shen xshen5@ncsu.edu

Data Placement Problem

35

Global memory

Texture memory

Shared memory

Constant memory

(L1/L2 cache)(Read-only cache)

(Texture cache)

A

B

C

D

Data in a program

?????

3X performance difference

Xipeng Shen xshen5@ncsu.edu

Data Placement Problem

36

Properties:

Machine dependent

Changes across models/generations

Input dependent

Changes across runs

Options:

Manual efforts by programmers?

Offline autotuning?

Xipeng Shen xshen5@ncsu.edu 37

PLACER(placing engine)

MSL(mem. spec. lang.)

PORPLE-C(compiler)

architect/usermem spec

org. program

access patterns

staged program

online profile

desired placement

efficient execution

offline online

microkernels

input

PORPLE in a Whole

More  details  in  our  Micro’2014  paper.

Xipeng Shen xshen5@ncsu.edu 38

Properties of PORPLE

• Good portability to new memory

• Just need new MSL spec

• Program adapts automatically

• Adaptivity to new program inputs

• On-the-fly placement with placement-agnostic code.

• Generality to regular & irregular programs

• Static analysis + lightweight online profiling

• K20c

• M2075

• C1060

GPU  Models

Xipeng Shen xshen5@ncsu.edu

Potential for Future Memory Systems

40

3D  Stacked  Memory

Persistent  Memory

DRAM  (NUMA)

Final Takeaways

• Large potential of compilers for modern computing

• Right problem formulation is a key

TOPAn  algorithmic  optimizer.  Up  to  100x  speedups.

PORPLEPortable  solution  to  mem.  complexity.  Consistent  speedups  cross  GPUs.

GStreamline

top related