programming with relaxed synchronization

20
© 2012 IBM Corporation Towards Approximate Computing: Programming with Relaxed Synchronization Lakshminarayanan Renganarayana Vijayalakshmi Srinivasan Ravi Nair (presenting) Dan Prener IBM T.J. Watson Research Center October 21, 2012

Upload: racesworkshop

Post on 01-Jul-2015

267 views

Category:

Technology


0 download

DESCRIPTION

Presentation by Ravi Nair. Paper and more information: http://soft.vub.ac.be/races/paper/programming-with-relaxed-synchronization/

TRANSCRIPT

Page 1: Programming with Relaxed Synchronization

© 2012 IBM Corporation

Towards Approximate Computing: Programming with Relaxed Synchronization

Lakshminarayanan Renganarayana Vijayalakshmi Srinivasan Ravi Nair (presenting) Dan Prener IBM T.J. Watson Research Center October 21, 2012

Page 2: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 2

Emerging Use of Computing

  Emerging important applications   Search

  Inexact answers to inexact queries   Proactive delivery of information

  Advertisements, offers, news, alerts, advise   Media

  Approximate rendition of audio, video, or images   Stream Processing

  Fast decisions based on possibly incomplete data

  These applications benefit in cost-performance through the use of prediction techniques

  They allow trading off accuracy of results for speed of delivering results

Page 3: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 3

Approximate Computing

  Current techniques   Exact (hence energy-inefficient) computation done on exact

(hence expensive) hardware to produce results that need not be exact

  With increasing unreliability and variability of devices in new generations of technology, the cost of such computation will increase dramatically

  What is needed: Approximate Computing   Computational models that allow computational results to fall in

some acceptable range, and   Hardware that produces results that fall in some acceptable

range

Page 4: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 4

Hardware

Exact Less Exact Accurate

Less Accurate, less up-to-date, possibly

corrupted

Reliable

Variable

Computation

Data

Computing model today

Dimensions of Approximate Computing

Page 5: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 5

Hardware

Exact Less Exact Accurate

Less Accurate, less up-to-date, possibly

corrupted

Reliable

Variable

Computation

Data

Computing model today

Future model

Human Brain

Path to Efficiency

Improving Efficiency through Approximation

Page 6: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 6

Hardware

Exact Less Exact Accurate

Less Accurate, less up-to-date, possibly

corrupted

Reliable

Variable

Computation

Data

Computing model today Relaxed

Synchronization

Human Brain

Path to Efficiency

Relaxed Synchronization

Page 7: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 7

Relaxing Synchronization: Challenges and Opportunities

  Principal uses of synchronization   Ensuring that all threads see consistent values of shared

variables (e.g. atomic updates with locks)   Best candidate for relaxing synchronization

  Ensuring that threads reach various points in their execution in a predictable manner (e.g. barrier)   Can relax (depending on context)

  Parallel update of data structures (e.g. linked lists)   Difficult to relax (can lead to fatal crash of the program)

  Key questions about relaxed synchronization   Will the program produce results that are acceptable?   If acceptable, would the results be produced in

considerably less time?

Page 8: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 8

  Partition data into k clusters

Randomly select K “initial” means

Create k clusters by associating every data point with the nearest centroid

Recompute centroids of k clusters

Iterate until centroids do not change

8

  Widely used   Market segmentation, data mining, computer vision, astronomy

Kmeans Clustering

Page 9: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 9

Kmeans Synchronization Overhead

  Kmeans : 8 clusters with large input set   OpenMP based parallel code   Parallel execution on an IBM Power7 machine

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8# of threads

Perc

ent C

lust

erin

g Ti

me

SynchronizationCompute

Synchronization time as high as 90% for 8 threads

Page 10: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 10

Relaxing Synchronization in Kmeans (NU-MineBench)

  Parallel evaluation of all points moving between clusters   Synchronization is used to update list of points moving

  Stopping criterion: } while (delta > threshold && loop++ < 500);

  Number of points moving (delta) < 0.1% of total points (threshold)   Subject to a maximum of 500 iterations

  This is already an approximation   What would happen if synchronization were relaxed?

Performance for 8 Clusters

0

10

20

30

40

50

60

2 3 4 5 6 7 8

Number of Threads

Exec

utio

n Ti

me

(sec

s)

012345678910

Spee

dup Original

RelaxedSpeedup

Page 11: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 11

Performance Improvement

Performance Speedup with Relaxed Synchronization

0

2

4

6

8

10

12

14

4 5 6 7 8 9 10 11 12 13

Number of Clusters

Spee

dup

2 threads3 threads4 threads5 threads6 threads7 threads8 threads

Significant speedup observed in many combinations. No combination showed a degradation in performance.

Page 12: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 12

Quality of Solution

  Computed total sum of distances of all points from the centroids of clusters they belong to   Good indicator of quality of

solution   Each bar in graph

represents a different relaxed run   Non-determinism is a side-

effect of relaxation

Quality of solution is not degraded due to relaxed synchronization

Degradation  in  Quality  of  Solution  -­‐  8  clusters,  8  thds

0.00%

0.05%

0.10%

0.15%

0.20%

0.25%

0.30%

0.35%

1 2 3 4 5 6 7 8 9 10

Run NumberD

iffer

ence

bet

wee

n R

elax

ed c

ost a

nd O

rigi

nal C

ost

Page 13: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 13

Graph500

  Breadth-first search (BFS) of a very large scale-free graph   Representative of certain real-life analytics problems

  E.g. Mail a flyer to all people who are connected directly and transitively to a certain person or sets of persons

  Experiments performed on version of award-winning IBM solution

Sample Scale-Free Graphs

Page 14: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 14

Initial Experiment: Relaxing Synchronization

  The parallel examination of nodes on a wavefront requires an atomic update of a data structure representing list of visited nodes   Atomic update is a synchronizing operation

  Question: What would happen if we updated the data structure with a simple OR, rather than an atomic-OR operation?

Page 15: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 15

Sample Result

  Each column represents a different relaxed run   Relaxed results are non-deterministic   Very few vertices missing

  Results   Very few vertices with wrong level numbers   Better than 3x the performance

  In most cases, it was possible to detect all missing vertices by simply validating the obtained tree

CPUS 8THREADS 16SCALE 20Time taken with accurate sync 0.248 0.248 0.248 0.248 0.248Time taken with relaxed sync 0.077 0.078 0.078 0.077 0.078Performance gain with relaxed sync 3.20 3.20 3.20 3.20 3.19Vertices visited in accurate list 646056 646056 646056 646056 646056Vertices visited in relaxed list 645985 645991 645990 646002 645993Vertices common in both 645985 645991 645990 646002 645993Missing vertices 29 35 34 46 37Other vertices with wrong level numbers 71 49 66 34 59

Page 16: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 16

Relax-and-Check: A Framework to Exploit Relaxed Synchronization   Assume there is a criterion that allows us to estimate the quality of a

solution without adding high cost   Relax-and-Check

  Modifies an existing program   Run initially without synchronizations   Revert to fully synchronized version if quality does not fall within acceptable limits

  Such criteria are often already explicit in programs   E.g. convergence criteria

  Other “syndromes” can often be identified

Page 17: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 17

Sample Results

  Three benchmarks from STAMP suite

  Acceptability criteria of results identical to original

  Significant reduction of run time due to relaxation of synchronization requirement

SSCA2

Kmeans (STAMP) Labyrinth

Page 18: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 18

Relax-and-Check Strategy for BFS: A Slight Twist

  Very high correlation between validation errors and time to solution   Possibly because of

redundant visit of vertices   Also, these occurred

principally when hardware parameters were not suitable for problem size

  Use validation errors as measure of quality   Validate the solution

produced by approximate technique

  If number of errors exceeds a threshold, say 1000, revert to synchronized solution for future instances with similar problem size

Performance vs. validation errors

0

0.5

1

1.5

2

2.5

3

3.5

0 500 1000 1500 2000 2500 3000 3500 4000

Number of validation errors

Perf

orm

ance

rel

ativ

e to

ac

cura

te s

olut

ion

scale = 18scale = 20scale = 22scale = 24

Relax-and-Check provides a technique to exploit the advantages of relaxed

synchronization with a quality safety-net

Page 19: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 19

Applications Suitable for Relax-and-Check

Domain / Applications Examples Scientific Finite Difference Methods, Successive over-

relaxation, N-body methods

Data Mining, Text Mining, Risk Analysis, Machine Learning

K-means, Hierarchical clustering, Multinomial regression, Feature reduction

Numerical optimization, Forecasting, Financial Modeling, and Mathematical Programming

Non-convex optimization solvers, constrained optimization, multi-dimensional cubing

EDA VLSI place and route

Graph processing and Social Network Analysis Mesh refinement, Metis style graph partitioning, Graph traversals

Program analyses in compilers and software engineering

Iterative data flow analysis

A large majority of applications in these domains choose a good solution from many correct solutions, and use acceptance criteria such as convergence tests,

fixed-point tests, or boolean acceptability tests

Page 20: Programming with Relaxed Synchronization

© 2012 IBM Corporation RN 10/21/12 20

Hardware

Exact Less Exact Accurate

Less Accurate, less up-to-date, possibly

corrupted

Reliable

Variable

Computation

Data

Computing model today

Future model

Human Brain

Path to Efficiency

Move Towards Attributes of the Human Brain