architecting for multiple cores

25
© 2009 RoCk SOLid KnOwledge 1 RoCk KnOwledge SOLiD http://www.rocksolidknowledge.com Architecting for multiple cores Andrew Clymer [email protected]

Upload: diem

Post on 10-Jan-2016

37 views

Category:

Documents


1 download

DESCRIPTION

Andrew Clymer [email protected]. Architecting for multiple cores. Objectives. Why should I care about parallelism? Types of parallelism Introduction to Parallel Patterns. The free lunch is over. Herb Sutter, Dr Dobb’s Journal 30 th March 2005. Multi core architectures. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 1

RoCkKnOwledge

SOLiDhttp://www.rocksolidknowledge.com

Architecting for multiple cores

Andrew [email protected]

Page 2: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 2

Objectives

Why should I care about parallelism?

Types of parallelism

Introduction to Parallel Patterns

Page 3: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 3

The free lunch is overHerb Sutter, Dr Dobb’s Journal 30th March 2005

Page 4: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 4

Multi core architectures

Symmetric multi processors ( SMP )Shared Memory

Easy to program withScalability issues with memory bus contention

Distributed MemoryHard to program against, message passing model Scales well when very low latency or low message exchange

GridHeterogeneousMessage PassingNo central control

HybridMany SMP’s networked together

Page 5: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 5

The continual need for speed

For a given time frame XData volume increases

More detail in gamesIntellisense

Better answersSpeech recognitionRoute planningWord gets better at grammar checking

Page 6: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 6

Expectations

Amdahl’s LawIf only n% of the compute can be parallelised at best you can get an n% reduction in time taken.Example

10 hours of compute of which 1 hour cannot be parallelised.So minimum time to run is 1 hour + Time taken for parallel part, thus you can only ever achieve a maximum of 10x speed up.

Not even the most perfect parallel algorithms scale linearly

Scheduling overheadCombining results overhead

Page 7: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 7

Types of parallelism

Coarse grain parallelismLarge tasks, little scheduling or communication overhead

Each service request executed in parallel, parallelism realised by many independent requests.DoX,DoY,DoZ and combine results. X,Y,Z large independent operations

ApplicationsWeb Page renderingServicing multiple web requestsCalculating the payroll

Coarse grain

Page 8: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 8

Types of parallelism

Fine grain parallelismA single activity broken down into a series of small co-operating parallel tasks.

f(n=0...X) can possibly be performed by many tasks in parallel for given values of nInput => Step 1 => Step 2 => Step 3 => Output

ApplicationsSensor processingRoute planningGraphical layoutFractal generation ( Trade Show Demo )

Fine grain

Page 9: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 9

Moving to a parallel mind set

Identify areas of code that could benefit from concurrency

Keep in mind Amdahl’s law

Adapt algorithms and data structures to assist concurrency

Lock free data structures

Utilise a parallel frameworkMicrosoft Parallel Extensions .NET 4Open MPIntel Threading blocks

Thinking Parallel

Page 10: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 10

Balancing safety and concurrency

Reduce shared dataMaximising “free running” to produce scalability

Optimise shared data accessSelect cheapest form of synchronization

Consider lock free data structures

If larger proportion of time spent reading than writing, consider optimise for read.

Page 11: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 11

Parallel patterns

Fork/Join

Pipe line

Geometric Decomposition

Parallel loops

Monti Carlo

Page 12: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 12

Fork/Join

Book Hotel, Book Car, Book Flight

Sum X + Sum Y + Sum Z

Task<int> sumX = Task.Factory.StartNew(SumX);Task<int> sumY = Task.Factory.StartNew(SumY);Task<int> sumZ = Task.Factory.StartNew(SumZ);

int total = sumX.Result+ sumY.Result+ sumZ.Result;

Parallel.Invoke( BookHotel , BookCar , BookFlight );

// Won’t get here until all three tasks are complete

Page 13: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 13

Pipe Line

When an algorithm has Strict ordering, first in first outLarge sequential steps that are hard to parallelise

Each stage of the pipeline represents a sequential step, and runs in parallel

Queues connect tasks to form the pipeline

Example, Signal processing

Page 14: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 14

Pipeline

AdvantagesEasy to implement.

DisadvantagesIn order to scale you need as many stages as cores.

Hybrid approachParallelise further a given stage to utilise more cores.

Page 15: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 15

Geometric Decomposition

Task parallelism is one approach, data parallelism is another.

Decompose the data into a smaller sub set, have identical tasks work on smaller data sets.

Combine results.

Page 16: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 16

Geometric Decomposition

Each task runs on its own sub region, often they need to share edge data

Update/Reading needs to be synchronised with neighbouring tasks

Edge Complications

Task 1 Task 2 Task 3 Task 4 Task 5 Task 6

Page 17: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 17

Parallel loops

Loops can be successfully parallelised IFThe order of execution is not important.Each iteration is independent of each other.Each Iteration has sufficient work to compensate for scheduling overhead.

If insufficient work per loop iteration consider creating an inner loop.

for

Page 18: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 18

Cost of Loop Parallelisation

Task Scheduling Cost = 50

Iteration Cost = 1

Total Iterations = 100,000

Sequential Time would be 100,000

Assuming fixed cost per iteration

Time = Number of Tasks * ( Task Scheduling Cost + ( IterationCost * IterationsPerTask )

Number of Cores

Number of Cores

Iterations Per Task

Number of Tasks

Time

2 100 1000 75000

2 50000 2 50050

4 100 1000 37500

4 25000 4 25050

256 100 1000 586

256 391 256 440

Page 19: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 19

Travelling Salesman problem

Calculate the shortest path, between a series of towns, starting and finishing at the same location

Possible combinations n Factorial

15 City produces 1307674368000 combinations

Took 8 Cores appx. 18 days to calculate exact shortest route.

Estimate 256 Cores would take appx. 0.5 days

Page 20: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 20

Monti Carlo

Some algorithms even after heavy parallelisation are too slow.

Often some answer in time T is often better than a perfect answer in T+

Real time speech processingRoute planning

Machines can make educated guesses too..

Thinking out of the box

Page 21: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 21

Monti CarloSingle Core

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97100

0

5

10

15

20

25

30

35

40

45

50

% Error for a 15 city route allowing 2 seconds for calculation using single core

% Error

Runs

% E

rror

Page 22: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 22

Monti Carlo8 Cores

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97100

0

0.5

1

1.5

2

2.5

3

3.5

4

% Error for a 15 city route allowing 2 seconds for calculation using eight cores

% Error

Runs

% E

rror

Page 23: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 23

Monti CarloAverage % Error over 1-8 cores

1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

9

Average % Error for a 15 city route allowing 2 seconds for calcula-tion

Average % Error

Number of Cores

% E

rror

Page 24: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 24

Monti Carlo

Each core runs independently making guesses.

Scales, more cores greater the chance of a perfect guess.As we approach 256 cores...looks very attractive.

Strategies Pure RandomHybrid, Random + Logic

Page 25: Architecting for multiple cores

© 2009 RoCk SOLid KnOwledge 25

Summary

Free Lunch is over

Think concurrent

Test on a variety of hardware platforms

Utilise a framework and patterns to produce scalable solutions, now and for the future.

Andrew [email protected]://www.rocksolidknoweldge.com

/Blogs/Screencasts