towards intelligent programing systems for modern …ssdm’10. 12 p2p: point-to-point shortest path...

41
Towards Intelligent Programing Systems for Modern Computing Computer Science, North Carolina State University Xipeng Shen

Upload: others

Post on 20-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Towards Intelligent Programing Systems for Modern Computing

Computer  Science,  North  Carolina  State  University

Xipeng  Shen  

Page 2: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Unprecedented Scale

2sources: SciDAC, IBM

Y201120 petaflops10mw power

Y202X1000 petaflops20mw power

50X perf

2X power!

Page 3: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Heterogeneity becomes Norm

3

Massively parallel accelerators are

becoming ubiquitous.

Page 4: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Thesis

4

To  address  the  challenges  in  modern  computing,  one  of  the  keys  exists  in  making  programming  systems  more  intelligent.  

For  advancing  programming  systems,  right  problem  formulating  goes  a  long  way.

Page 5: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Modern Computing

5

Application Data analytics, Machine learning, …

Infrastructure Data centers, Cloud, IoT, …

Architecture Heterogeneous parallel processors, Emerging complex memory, …

TOP  Algorithmic  optimizer  for  data  analytics  [VLDB’15,ICML’15]

GStreamline+  PORPLE  Memory  optimization  for  GPU  [ASPLOS’11,  Micro’14,  ICS’16]

Page 6: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

TOP: Enabling Algorithmic Optimizations for

Distance-Related Problems

6

VLDB’2015,  ICML’2015

Up  to  100s  X  speedups.

Yufei  Ding

Page 7: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Role of Compiler

7

ML  Algorithm

Implementation

Execution

compiler

Runtime  system/Architecture

compiler

ML  experts

co-­‐design

Can  compilers  optimize  algorithms?

Learning    Problem

compiler

Page 8: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Why algorithm level?

8

Reason  1:  Large  benefits:  orders  of  magnitude  speedups  at  no  extra  cost.

Reason  2:  Compiler  may  outsmart  ML  experts.   Really?

Page 9: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Example

9

a

d b

Triangular  Inequality:              a-­‐b  ≤  d  ≤  a+b

C1

C2X

K-­‐Means

Page 10: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

NIPS’2012

10

K-Means

SIAM’2010

ICML’2003

Page 11: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

11

K-NN

IJCNN’11

VisionInterface’10

SSDM’10

Page 12: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

12

P2P: Point-to-Point Shortest PathSIAM’05

ALENEX’04

Page 13: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Observations

13

• TI  has  led  to  many  enhanced  algorithms  across  problems  and  domains.  

• Applying  TI  well  is  tricky,  hence  the  many  manual  efforts  and  publications.  

Thoughts• Can  we  have  an  abstraction  to  represent  all  the  problems?  • Can  we  then  generalize  the  TI  optimizations  into  compiler-­‐based  transformations?  

Page 14: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

14

Query  Point  Set Target  Point  Set

Distance

Relation

Constraints

Abstract Distance Problem(Q,  T,  D,  R,  C)

KMeansKNN ICP Shortest  DistanceNBodyKNN  join

Page 15: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

15

Abstract  Distance-­‐Related  Problem Essence    &  7  Principles  of  TI  Optimizations

KMeansKNN ICP Shortest  DistanceNBodyKNN  join

Our  Analysis  and  Abstraction

Page 16: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

16

Key Insights•Reuse through Landmarks

•Spatial & temporal reuses

•Elasticity through hierarchical landmarks

•Efficient bounds update through ghosts for iterative alg.

•Order of comparison

Lq2

q1 t1t2q3 t3

See  VLDB’15  for  details.

Page 17: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

17

Abstract  Distance-­‐Related  Problem Essence    &  7  Principles  of  TI  Optimizations

KMeansKNN ICP Shortest  DistanceNBodyKNN  join

TOP  Framework

TOP  API

Compilerproblem    semantic

building    blocks

Opt  Lib

Page 18: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

18

TOP  APIBasic  algorithm    description

Compiler

Staged  program    code

TI  Opt  LibEfficient  execution

Usage

TOP_defDistance(Euclidean);T = init();changedFlag = 1;while (changedFlag){ N = TOP_findClosestTargets(1, S, T); TOP_update(T, &changedFlag, N, S); }

Ad  hoc

Systematic

Page 19: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

19

Baseline:  Classic  K-­‐means

(16GB,  8-­‐core  Intel  Ivy  Bridge)Speedu

p  (X)

K-­‐Means  (K=1024)

TOP                 Yinyang K-Means

Code  link  in    ICML’15  paper.

Clustering  results  are  same  as  original  method’s.

Page 20: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

20

Speedu

p  (X)

Baseline:  Classic  K-­‐means(16GB,  8-­‐core)

XX

TOP                

On  K-­‐Means

Yinyang K-Means

Page 21: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

21

Speedups(X) by manual version0 1 102 104

Spee

dups

(X) b

y TO

P ve

rsio

n

1

102

104

KnnKnnjoinKmeansICPNbodyP2PReference line

In manual version0 106 1013

In T

OP

vers

ion

106

1013

KnnKnnjoinKmeansICPNbodyP2PReference line

Average speedups: 50X vs 20X. Save at least 93% calculations.

Speedups #  distance  calculations

   Manually  Optimized      Manually  Optimized      TOP  Optim

ized  

       TOP        Optim

ized  

Insight:    The  right  abstraction  and  formulation  turn  a  compiler  into  an  automatic  algorithm  optimizer,  giving  out  large  speedups.

Intel i5-4570 CPU and 8G memory

On All Benchmarks

Page 22: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Modern Computing

22

Application Data analytics, Machine learning, …

Infrastructure Data centers, Cloud, IoT, …

Architecture Heterogeneous parallel processors, Emerging complex memory, …

TOP  Algorithmic  optimizer  for  data  analytics  [VLDB’15,ICML’15]

GStreamline+  PORPLE  Memory  optimization  for  GPU  [ASPLOS’11,  Micro’14,  ICS’16]

Page 23: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Overcome GPU Limitations

23

Guoyang  Chen  (Qualcomm)

Bo  Wu  (Prof.  @  Colorado  Mines)

Zheng  Zhang  (Prof.  @  Rutgers  Univ)  

Page 24: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Xipeng Shen [email protected] 24

a SIMD group(warp)

Graphic Processing Unit (GPU)

• Massive parallelism• Favorable

• computing power• cost effectiveness• energy efficiency

Page 25: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

25

Challenges

Irregular  Mem  &  Control

Dyn    Task  Parallelism

Scheduling  Limitations

Page 26: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

26

Our ExplorationsCompiler-based software solutions

11/069/07

5/096/10

3/1110/11

6/122/13

9/1312/14

5/156/15

12/156/16

2/17

CUDA  release

LCPC  talk  by    David  Kirk

IPDPS  cross  input  adap.  opt.

ICS  remove  thread  diverg.  dyn.

ASPLOS  GStreamline

PACT  treat  synch.  correct.  GPU2CPU

ICS  syn.  relax.  &  opt.  GPU2CPU

PPOPP  mem  coalesc.

PACT  NVM  for    GPU

Micro  PORPLE

ICS  SM  centric

HotOS  Co-­‐run  on  Fused

Micro  Free  Launch

ICS  Multiview

PPOPP  EffiSha

Sweet  KNN;  VersaPipe;  Lean  DNN;  …  

5/17

IPDPS  Co-­‐sched  on  Fused  System

Page 27: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

27

Solutions

Irregular  Mem  &  Control

Dyn    Task  Parallelism

Scheduling  Limitations

Compiler-based software solutions

SM-­‐Centric  &  EffiSha  [ics15,ppopp17]

FreeLaunch  [micro15]

Monday  PPoPP  Session  1

GStreamline  &  PORPLE  [asplos11,micro14,        ics16]

Page 28: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Xipeng Shen [email protected]

Dynamic Irregularities

28

A[ ]:

P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2}

... = A[P[tid]];

tid: 0 1 2 3 4 5 6 7

Degrade throughput by up to (warp size - 1) times. (warp size = 32 in modern GPUs)

memory

2 4 10 0 6 0 0A[ ]:

tid: 0 1 2 3 4 5 6 7 if (A[tid]) {...}

control flow (thread divergence)

for (i=0;i<A[tid]; i++) {...}{a mem seg.

P[ ] = { 0, 1, 2, 3, 4, 5, 6, 7}

Page 29: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Solution 1: Thread-Data Remapping

29

{a mem seg.

4 trans/warp

{

a mem seg.

1 trans/warp

Irregularity in a warp: problematic; across warps: okey!

Principle of solution:Turn intra-warp irreg. into reg. or inter-warp irreg.

Page 30: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Trans-1: Data Reordering

30

P[ ] = {0,5,2,3,2,3,7,6}

... = A[P[tid]];

A[ ]:

tid: 0 1 2 3 4 5 6 7

A’[ ]:

tid: 0 1 2 3 4 5 6 7

<relocation>

original

... = A’[Q[tid]];

Q[ ] = {0,1,2,3,2,3,6,7}

<redirection>

transformed

tid: thread ID; : a thread; : data access; : data relocation

maintain mapping between threads &

data values

Page 31: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Trans-2: Job Swapping • Job = operations + data elements accessed

31

newtid = Q[tid]; . . .... = A[P[newtid]];

Q[ ] = {0,4,2,3,1,5,6,7}

<redirection>

transformed

A[ ]:... = A[P[tid]];

tid: 0 1 2 3 4 5 6 7

original

P[ ] = {0,5,2,3,2,3,7,6}

A[ ]:

tid: 0 1 2 3 4 5 6 7

Page 32: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

G-Streamline[ASPLOS’2011]

32

1.08—2.5X  speedups

First framework enabling runtime thread-data remapping.

CPU-GPU pipeline to hide transformation overhead.

Kernel splitting to resolve dependences.

Page 33: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Xipeng Shen [email protected] 33

Global memory

Texture memory

Shared memory

Constant memory

L1/L2 cacheRead-only cache

Texture cache

Solution 2: Data Placement

Page 34: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Xipeng Shen [email protected]

GPU Memory

34

Global memory

Texture memory

Shared memory

Constant memory

L1/L2 cacheRead-only cache

Texture cache

coalescing; cache hierarchy

2D/3D locality; texture cache; read-only

on-chip; bank conflicts

broadcasting; cached; read-only

private/shared

read-only data

2D/3D locality; read-only

Page 35: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Xipeng Shen [email protected]

Data Placement Problem

35

Global memory

Texture memory

Shared memory

Constant memory

(L1/L2 cache)(Read-only cache)

(Texture cache)

A

B

C

D

Data in a program

?????

3X performance difference

Page 36: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Xipeng Shen [email protected]

Data Placement Problem

36

Properties:

Machine dependent

Changes across models/generations

Input dependent

Changes across runs

Options:

Manual efforts by programmers?

Offline autotuning?

Page 37: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Xipeng Shen [email protected] 37

PLACER(placing engine)

MSL(mem. spec. lang.)

PORPLE-C(compiler)

architect/usermem spec

org. program

access patterns

staged program

online profile

desired placement

efficient execution

offline online

microkernels

input

PORPLE in a Whole

More  details  in  our  Micro’2014  paper.

Page 38: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Xipeng Shen [email protected] 38

Properties of PORPLE

• Good portability to new memory

• Just need new MSL spec

• Program adapts automatically

• Adaptivity to new program inputs

• On-the-fly placement with placement-agnostic code.

• Generality to regular & irregular programs

• Static analysis + lightweight online profiling

Page 39: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

• K20c

• M2075

• C1060

GPU  Models

Page 40: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Xipeng Shen [email protected]

Potential for Future Memory Systems

40

3D  Stacked  Memory

Persistent  Memory

DRAM  (NUMA)

Page 41: Towards Intelligent Programing Systems for Modern …SSDM’10. 12 P2P: Point-to-Point Shortest Path SIAM’05 ... • Applying’TIwell’is’tricky,’hence’the’many’manual’efforts’and’

Final Takeaways

• Large potential of compilers for modern computing

• Right problem formulation is a key

TOPAn  algorithmic  optimizer.  Up  to  100x  speedups.

PORPLEPortable  solution  to  mem.  complexity.  Consistent  speedups  cross  GPUs.

GStreamline