performance of parallel programs - elc lab · performance of parallel programs 6 amdahl’s law...

63
1 Performance of Parallel Programs Fall 2011 Programming Practice Bernd Burgstaller Yonsei University Bernhard Scholz The University of Sydney

Upload: others

Post on 09-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

1

Performance of Parallel Programs

Fall 2011

Programming Practice

Bernd BurgstallerYonsei University

Bernhard ScholzThe University of Sydney

Page 2: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

2

Outline

The fundamental laws of parallelism

Amdahl’s Law

Gustafson’s Law

Scalability

Load Balancing

Sources of Performance Loss

False sharing case study

The 4 sources of performance loss

Performance Measurements & Profiling

Page 3: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

3

Limits to Performance Scalability

Not all programs are “embarrassingly” parallel.

Programs have sequential parts and parallel parts.

Another example: color -> grayscale conversion (Assignment 2)

Sequential part 1: read the input data from file.

Parallel part: perform the grayscale conversion.

Sequential part 2: write the converted data to the output file.

1 a = b + c;1 a = b + c;2 d = a + 1;3 e = d + a;4 for (k=0; k < e; k++)5 M[k] = 1;

Sequential part: cannot be parallelized because of data dependencies.

Parallel part: no data dependence, the different loop iterations (Line 5) can be executed in parallel.

Page 4: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

4

Amdahl’s Law

In 1967, Gene Amdahl, then IBM computer mainframe architect, stated that

“The performance improvement to be gained from some faster mode of execution is limited by the fraction of the time that the faster mode can be used.”

“Faster mode of execution” here means program parallelization.

The potential speedup is defined by the fraction of the code that can be parallelized.

sequential part

parallelizable part

sequential part

sequential part

T1 T2 T3 T4 T5

sequential part

25 seconds

25 seconds

50 seconds

+

+

100 seconds

Use 5 threads for the parallelizable part:

25 seconds

25 seconds

10 seconds

+

+

60 seconds

time

Page 5: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

5

Amdahl’s Law (cont.)

Speedup = old running time / new running time= 100 seconds / 60 seconds= 1.67

The Parallel version is 1.67 times faster than the sequential version.

sequential part

parallelizable part

sequential part

sequential part

T1 T2 T3 T4 T5

sequential part

25 seconds

25 seconds

50 seconds

+

+

100 seconds

Use 5 threads for the parallelizable part:

25 seconds

25 seconds

10 seconds

+

+

60 seconds

time

Page 6: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

6

Amdahl’s Law (cont.)

We may use more threads executing in parallel for the parallelizable part, but the sequential part will remain the same.

The sequential part of a program limits the speedup that we can achieve!

Even if we theoretically could reduce the parallelizable part to 0 seconds, the best possible speedup in this example would be

speedup = 100 seconds / 50 seconds = 2.

sequential part

parallelizable part

sequential part

time

sequential part

sequential part

25 seconds

25 seconds

50 seconds

+

+

100 seconds

25 seconds

25 seconds

2 seconds

52 seconds

Page 7: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

7

Amdahl’s Law (cont.)

p = fraction of work that can be parallelized.

N = the number of threads executing in parallel.

Speedup =old_running_time

new_running_time=

new_running_time = (1-p) * old_running_time +p * old_running_time

N

1

(1 - p) + pN

Observation: if the number of threads goes to infinity ( ), the speedup becomes

Parallel programming pays off for programs which have a largeparallelizable part.

Page 8: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

8

Amdahl’s Law (Examples)

p = fraction of work that can be parallelized.

N = the number of threads executing in parallel.

speedup =old_running_time

new_running_time=

1

(1 - p) + pN

Example 1: p=0, an embarrassingly sequential program.

Example 2: p=1, an embarrassingly parallel program.

Example 3: p=0.75, N = 8.

Example 4: p=0.75, N = .

Page 9: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

9

0

20

40

60

80

100

1 2 4 8 16

32

64

128

256

512

102

4

204

8

409

6

819

2

163

84

Sp

eed

up

fac

tor

Number of Cores

80% parallel90% parallel95% parallel99% parallel

Speedup according to Amdahl’s Law

Page 10: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

10

Gustafson’s Law

In 1988, John L. Gustafson reported the achieved speedup on an experiment as

Seemed like contradiction to Amdahl’s law:

With Amdahl’s law, no matter how many cores we add, the sequential part of the application imposes upper bound on possible speedup (see previous slide).

With Gustafson’s law, unlimited parallelism benefits application.

Example: N=256, p=99%, np=1%

Page 11: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

11

Amdahl’s Law uses for the sequential portion

That way, sequential portion does not scale with higher N.

Gustafson’s Law scales down the sequential part (np) with higher N:

Gustafson’s Law (cont.)

Sequential portions differ with Amdahl’s and Gustafson’s Laws.

Assume:

Amdahl’s and Gustafson’s Laws can be proven equivalent if wefactor in the differences wrt. the scalability of the sequentialportions of the code.

Page 12: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

12

Amdahl’s Law uses for the sequential portion.

That way, sequential portion does not scale with higher N.

Gustafson’s Law scales down the sequential part (np) with higher N:

Gustafson’s Law (cont.)

Example: from Slide 10:

N=256, T(256)=100s, np(256)=1% p(256)=99%

T(1)=1+256*99=25345s

Amdahl: ts(1)/T(1) = 1/25345=0.0000395% (compare with np=1% on Slide 10!)

(compare with Gustafson’s speedup on prev. slide!)

Page 13: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

13

0

50

100

150

200

250

300

1 2 4 8 16

32

64

128

256

Sp

eed

up

fac

tor

Number of Cores

50% parallel80% parallel95% parallel99% parallel

Speedup according to Gustafson’s Law

Page 14: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

14

Amdahl’s and Gustafson’s Laws

Fundamental question: given an application, is the speedup potential bounded (as implied by Amdahl’s Law), or unbounded (as implied by Gustafson’s law)?

Answer lies in the sequential part of the application:

If the share of sequential part stays constant as the number of available cores increases Amdahl’s Law, application has upper bound on possible speedup.

• Example: application consisting of grayscale conversion (parallel part) and file I/O (sequential part). If file I/O does not become faster, then speedup capped (limited) through Amdahl’s Law.

If the share of sequential part of the application diminishes with more cores added Gustafson’s Law• Example: application consisting of grayscale conversion (parallel part) and file I/O

(sequential part). If file I/O becomes faster as we add more cores (e.g., because I/O itself was improved/parallelized), then Gustafson’s Law applies.

In both cases, the sequential portion of the application needs to be reduced to achieve higher speedups.

Page 15: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

15

Performance Scalability

On the previous slides, we assumed that performance directly scales with the number N of threads/cores used for the parallelizable part p.

E.g., if we double the number of threads/cores, we expect the execution time to drop in half. This is called linear speed-up.

However, linear speedup is an “ideal case” that is often not achievable.

number of threads/cores

sequential program

The speedup graph shows that the curves level off as we increase the number of cores. Result of keeping the problem size constant the amount of work per thread decreases overhead for thread creation aso becomes more significant.

Other reasons possible also (memory hierarchy, highly-contended lock, ...).

Ideal efficiency of 1 indicates linear speedup. Efficiency > 1 indicates super-linear speedup.

Super-linear speedups are possible due to increased #registers and caches with multiple cores.

spee

du

p

Page 16: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

16

Outline

The fundamental laws of parallelism

Amdahl’s Law

Gustafson’s Law

Scalability

Load Balancing

Sources of Performance Loss

False sharing case study

The 4 sources of performance loss

Performance Measurement & Profiling

Page 17: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

17

Granularity of parallelism = frequency of Interaction between threads

Fine-grained Parallelism

Small amount of computational work between communication / synchronization stages.

Low computation to communication ratio.

High communication / synchronization overhead.

Less opportunity for performance improvement.

Coarse-grained Parallelism

Large amount of computational work between communication / synchronization stages.

High computation to communication ratio.

Low communication / synchronization overhead.

More opportunity for performance improvement.

Harder to load-balance effectively.

time

Page 18: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

18

Load Balancing Problem

Threads that finish early have to wait for the thread with the largest amount of work to complete.

This leads to idle times and lowers processor utilization.

synchronization

workwork

work

work

synchronization

idleidle

idle

time

Page 19: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

19

Load imbalance: work not evenly assigned to cores/SPEs. Underutilizes parallelism!

The assignment of work, not data, is key. Static assignment = assignment at program writing time

cannot be changed at run-time.

More prone to imbalance.

Dynamic assignment = assignment at run-time. Quantum of work must be large enough to amortize overhead.

Example:Threads fetch work from a work queue.

With flexible allocations, load balance can be solved late in the design programming cycle.

Load Balancing

T2

work queue

T1 T2

work queue

T1 T1 T2

work queue

step 1: step 2: step 3:

Page 20: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

20

Outline

The fundamental laws of parallelism

Amdahl’s Law

Gustafson’s Law

Scalability

Load Balancing

Sources of Performance Loss

False sharing case study

The 4 sources of performance loss

Performance Measurement & Profiling

Page 21: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

21

Let’s consider a simple problem

Sum up the numbers in array[] Array has length values.

Sequential solution:

int array[length];sum = 0;

for(i=0; i<length; i++){sum = sum + array[i];

}

Page 22: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

22

Write a parallel program

We need to know something about the machine...

Let’s use a multicore architecture!

RAM

(Main Memory)

L2 Cache

L1 Cache L1 Cache

Core 0 Core 1

Main memory access is extremely slow compared to the processing speed of today’s CPUs.

A cache is a small but fast memory that is used to store copies of data from the main memory.

Holds copies of data from the most recently used locations in main memory.

Memory hierarchy:

• registers->L1 cacheL2 cachemain memory.

Cache miss: CPU wants to access data which is not in the cache.

• Need to access data in level below, eventually in main memory (slow!).

Page 23: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

23

Divide into several parts...

Solution for several threads: divide the array into several parts.

1 4 3 12 1 42 16 17 9 45 33 12 18 4 7 22array

length=16 t=4

Thread 0 Thread 1 Thread 2 Thread 3

int length_per_thread = length / t;int = start = id * length_per_thread;sum = 0;

for(i=start; i<start+length_per_thread; i++){sum = sum + array[i];

}

What’s wrong?What’s wrong?

Page 24: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

24

Divide into several parts...

Solution for several threads: divide the array into several parts.

1 4 3 12 1 42 16 17 9 45 33 12 18 4 7 22array

length=16 t=4

Thread 0 Thread 1 Thread 2 Thread 3

}

int length_per_thread = length / t;int start = id * length_per_thread;sum = 0;pthread_mutex_t m;

for(i=start; i<start+length_per_thread; i++){pthread_mutex_lock(&m);sum = sum + array[i];pthread_mutex_unlock(&m);

}

Sum is a shared variable. Need to ensure mutual exclusion when Sum is a shared variable. Need to ensure mutual exclusion when accessing sum from different threads.

Page 25: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

25

The correct program is slow...

We are mainly acquiring/releasing the mutex. lock()/unlock induces huge overhead!

Threads serialize at the mutex. Have to wait for each other.

Memory traffic for sum and mutex variable.

62

925

3100

sequential

array of 16 million

unsigned chars

Page 26: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

26

Closer look: motion of sum and m

The variables sum and m need to be moved across the memory hierarchy because both threads access them.

Takes time.

Lock contention at mutex m.

Overhead for lock/unlock operations.

int length_per_thread = length / t;int start = id * length_per_thread;unsigned long long sum = 0;pthread_mutex_t m;

for(i=start; i<start+length_per_thread; i++){pthread_mutex_lock(&m);sum = sum + array[i];pthread_mutex_unlock(&m);

}RAM

(Main Memory)

L2 Cache

L1 Cache L1 Cache

Core 0 Core 1

Page 27: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

27

Accumulate into private sum variable

Each thread adds into its own memory (variable private_sum[]).

Combine sums at the end.

#define t 4int length_per_thread = length / t;int start = id * length_per_thread;unsigned long long private_sum[4]={0, ...};unsigned long long sum = 0;pthread_mutex_t m;

for(i=start; i<start+length_per_thread; i++){private_sum[id] = private_sum[id] + array[i];

}

pthread_mutex_lock(&m);sum = sum + private_sum[id];

pthread_mutex_unlock(&m);

Page 28: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

28

Keeping up, but not gaining 1-thread version almost as good as sequential

apart from overhead for creating threads

It gets worse with 2 or more threads. With the 9th thread, we introduce a new cache line and things get slightly better.

However, still not faster than the sequential version!

seq

array of 16 million

unsigned chars

Page 29: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

29

False sharing

Private sum variable = private cache-line.

RAM

(Main Memory)

L2 Cache

L1 Cache L1 Cache

Core 0 Core 1

private_sum[0] private_sum[1]

private_sum[0] private_sum[1]

private_sum[0] private_sum[1] private_sum[0] private_sum[1]

cache line

Page 30: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

30

False sharing (zoomed in)

CPU cannot load individual variables into cache

Can only load in units of cache-lines

64 bytes, usually

2 thread-private variables in the same cache-line are as bad as a single shared variable

ping-pong of cacheline when simultaneously accessing private counters from 2 threads

Core 0 Core 1

CacheLine

Thread 0 Thread 1

CacheLine

L1 Cache L1 Cache

L2 Cache

from main memoryCPU

Page 31: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

31

False sharing (solution)

Through padding memory, we can force each private counter into a separate cache line

the cache-line with a thread’s private counter is loaded into the core’s L1 cache

no need to exchange the other counter’s cachelines between L1 caches.

Core 0 Core 1

CacheLine

Thread 0 Thread 1

CacheLine

L1 Cache L1 Cache

L2 Cache

from main memory

priv_sum[0]priv_sum[0]priv_sum[1]priv_sum[1]

priv_sum[0]priv_sum[0]

priv_sum[1]priv_sum[1]

Page 32: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

32

Force into different cache lines

• Padding the private sum variables forces them into separate cache lines and removes false sharing.

• Two threads now almost twice as fast as one:

struct padded_int{ unsigned long long value; // 8 bytes widechar padding [56]; //assuming a cache line is 64 bytes

} private_sum[MaxThreads];

62

77

39

2216 16

array of 16 million

unsigned chars

Page 33: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

33

Outline

The fundamental laws of parallelism

Amdahl’s Law

Gustafson’s Law

Scalability

Load Balancing

Sources of Performance Loss

False sharing case study

The 4 sources of performance loss next

Performance Measurement & Profiling

Page 34: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

34

The 4 Sources of Performance Loss

Linear speedup with parallel processing frequently not attainable for following reasons:

1) Overhead which sequential computation does not have to pay

2) Non-parallelizable computation

3) Idle processors

4) Contention for resources

All other sources are special cases of these four!

In next slides, we’ll consider those four sources of performance loss.

From the textbook, Chapter 3.

Page 35: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

35

Example: The 4 Sources of Performance Loss

Task: Let’s assume we want to paint the four walls of our class-room.

A single painter will take 1 day to paint all for walls, and then it takes 1 day for the paint to dry.

Adding more painters to the job will speed up, but we’ll face the 4 sources of performance loss:

1) Overhead which sequential computation does not have to pay:

we need to decide which painters will paint which parts (walls) of the class-room

2) Non-parallelizable computation:

after the painters finish, it’ll take 1 day for the paint to dry

3) Idle processors:

if we use 1M painters, they have to queue up to enter the class-room and do their paint-job

4) Contention for resources

Assume we only have one bucket of paint shared by all participating painters...

Page 36: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

36

Loss 1: Overhead Any cost caused in parallel solution but not required by sequential

solution is called overhead.

Examples: creating/joining threads, synchronization, ...

Overhead 1: Communication

Because sequential solution does not have to communicate with other threads, all communication is overhead.

• Example: private sums have to communicated back to main thread.

Cost for communication depends on hardware

• Examples: shared-memory communication cheaper than sending messages over network, non-uniform memory access (NUMA) architectures.

Overhead 2: Synchronization

Arises when one thread must wait for ‘event’ on another thread.

Example: thread must wait for another thread to free resource, e.g., mutex, in the sum computation example. Sequential solution does not require synchronization!

Page 37: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

37

Loss 1: Overhead (cont.) Overhead 3: Computation

Parallel computations almost always perform extra computations not needed in sequential solution.

Examples:• threads in gray-scale conversion need to compute the part of the array which they

should work on.

• Initializations of counters aso executed in each of the parallel threads

Overhead 4: Memory

Parallel computations frequently require more memory than sequential computations.

• Example: padding in the parallel sum computation.

Padding is a small memory overhead that actually improves performance.

Applications with significant memory overhead will experience performance loss.

Page 38: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

38

Loss 2: Non-parallelizable computations

Non-parallelizable computations limit potential benefit from parallelism.

Amdahl’s law observes that if 1/S of a computation is inherently sequential, maximum speedup limited to a factor of S.

Special case: redundant computations

N threads execute loop k times:

for(i=0; i<k; i++) {...}

all N threads increment their private loop variable, test loop condition, ...

Idle threads waiting for one thread to perform computation

Example: disk I/O

Page 39: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

39

Loss 3: Idle Times

Ideally, all processors/cores are working all the time.

Thread might become idle because

1) of lack of work

2) it is waiting for external event, such as arrival of data from other thread

• Example: producer-consumer relationship

3) Load Imbalance

• uneven distribution of work to processors/cores

• Example: sequential computation running on single core, all other cores idle

4) Memory-bound computations

• Main memory very slow compared to CPU

• Caches hide memory-latency, but applications with poor locality or large data-sets spend 100s of CPU-cycles waiting for data from main memory.

Page 40: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

40

Loss 4: Contention for Resources

Contention is degradation of system performance caused by competition for a shared resource.

Example: highly contended lock.

Contention can degrade performance beyond the overhead observed from a 1-thread solution.

it means: 2 threads even slower than 1 thread

Example: both true sharing and false sharing create excessive load on the memory bus (bus traffic)

Affects even threads that access other memory locations (and hence do not contend for shared data)!

Page 41: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

41

Performance Trade-offs

We have seen 4 sources of performance loss.

Limiting one source of performance loss may increase another

Important to understand the trade-offs between different sources of performance loss.

Trade-off 1: Communication versus Computation

Often possible to reduce communication overhead by performing additional computations.

Redundant computations

• Re-compute value locally (within a thread) instead of transmitting from other thread

• works if cost (re-computation) < cost (transmission)

• Example: pseudo-random numbers

Overlapping Communication and Computation• Communication that is independent of computation can be performed while

computation is going on (hiding communication behind computation)

• Makes program more complicated (DMA-transfers, wait-for-completion, ...)

• Examples: in our lecture on openCL programming

Page 42: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

42

Performance Trade-offs (cont.)

Trade-off 2: Memory versus Parallelism

Parallelism can often be increased at cost of increased memory usage

Privatization (e.g., private_sum variable)

Padding

Trade-off 3: Overhead versus Parallelism

Parallelize Overhead• Final accumulation of private_sums on Slide #37 can become bottleneck with

large number of threads parallelize overhead itself!

• We will discuss in subsequent lecture

Load-Balance versus Overhead• increased parallelism can improve load-balance

• over-decomposing problem (=overhead) in fine-grained work-units

Granularity Trade-offs• Increase granularity of interaction

• Examples:

1) each thread grabs several work items at once from work-queue

2) send whole row/column of matrix instead of individual elements

Page 43: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

43

Summary: Common Performance Bottlenecks

Huge sequential part of a program.

Amdahl’s Law tells us that we cannot gain much in such a case.

Highly contended lock

All threads need to queue up to enter their critical section.

High communication overhead, little computation.

Remember: computation is our goal! Communication and synchronization is just overhead due to multiple threads.

Sharing of variables (mutexes, semaphores are also variables!)• rand() uses a seed variable (shared!) use rand_r() instead!

False sharing of variables in a cache line.

Poor load balancing.

Some threads do all the work (are overworked!), other threads are idle. Processor resources are not well-utilized.

Scalar code

Use vector units (SIMDization of scalar code) as much as possible.

Page 44: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

44

Outline

The fundamental laws of parallelism

Amdahl’s Law

Gustafson’s Law

Scalability

Load Balancing

Sources of Performance Loss

False sharing case study

The 4 sources of performance loss

Performance Measurement & Profiling next

With special thanks to our interns Hyewon, Changwook and Jongmyeong!

• For their kind exploration and lecture materials on Intel Vtune

Page 45: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

45

Measuring Performance

Execution time ... what’s time?

We can measure the time from start to termination of our program.

Also called wallclock time.

gcc foo.cgcc foo.ctime ./a.outreal 0m4.159suser 0m0.008ssys 0m0.026s

elapsed wallclock time

time executed in user mode

time executed in kernel mode

(see next slide)

time

stopclear/start

Page 46: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

46

Difference between user and system time

A program requests services from the operating system (file I/O, dynamic memory, ...) via calls to the operating system kernel. Called system calls ( ).

Depicted in blue in the above code example.

The time spent in system calls is counted as system time.

The rest of the code is executed outside of the system kernel. Counted as user time.

Depicted in orange in the above example.

Question: why is user time + system time ≠ wallclock time ?

Linux operating system kernel

T1 T2 T3

program 1

T1 T2

program 2

T1 T2 T3

program 3

Core 1 Core 2

#include<stdio.h>

int main() {int i, *j, k;FILE f = fopen(...);i = 255;j = malloc(255*sizeof(int));for (k=0;k<i;k++)

*(j+k) = 0;...

}

user

time

system

time

Page 47: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

47

Wanted: an unloaded machine

Principal problem with wallclock time: Different programs share the

available CPUs/cores.

The Linux kernel schedules the threads of all programs on those CPUs/cores.

Every thread gets to run for a little while (its timeslice), then another thread gets to run.

Programs get in the way of each other. If we are interested in the execution time of program 1, the measured wallclock time contains also time executing program 2 and 3.

Linux operating system kernel

T1 T2 T3

program 1

T4 T5

program 2

T6 T7 T8

program 3

Core 1 Core 2

CPU time (Core 1)

T1T4 T6 T2 T4 T6 T2 T4

CPU time (Core 2)T3T5 T5 T3T7 T8

clear/start program 1 program 1 terminates

T6T4

T8T8 T3 T8

T4 T1

Page 48: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

48

Measuring Performance

On an unloaded machine, wallclock time gives us a crude program performance indication.

Run the program several times

Difference between cold start and warm start

• Linux file system caches

• instruction caches, data caches

Wallclock time does not tell us in which parts of the program we spend time.

gettimeofday() allows us to measure wallclocktime for specific parts of the program.

We will study more elaborate performance measurement methods in the following slides.

int main(void) {

float s1=0, s2=0; int i,j;

for ( j = 0; j < T; j++ ) {

for ( i = 0; i < N; i++ ) {

b[i] = 0;

}for ( i = 0; i < N; i++) {

s1 += a[i] + b[i];

s2 += a[i] * a[i] * b[i];

}

}

}

record start-time

record stop-time

Page 49: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

49

Outline

The fundamental laws of parallelism

Amdahl’s Law

Gustafson’s Law

Scalability

Load Balancing

Sources of Performance Loss

False sharing case study

The 4 sources of performance loss

Performance Measurement & Profiling

With the kind help of our internship students:

Choi Hyewon, Jung Changwook and Lee Jongmyeong

Page 50: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

50

Dynamic ProfilingProfiling = Program Instrumentation + Execution CPUs provide performance counters to measure various

events: cache hits

cache misses

number of instructions executed

pipeline stalls

number of loads from memory

number of stores to memory

And then there is the Heisenberg effect: measuring can perturb a program’s execution time!

GNU gprof Intel VTune

http://www.intel.com/cd/software/products/asmo-na/eng/vtune/239144.htm

OProfile Valgrind Many more!

Page 51: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

51

Dynamic Profiling Profiling through instrumentation

manual or automatic adding of counters

• requires changes of your program Heisenberg!

measure exactly what you want, and where you want

Event-based profiling

CPU provides hw events that it can count with performance counters

• for examples see previous page

works without modifying code

but you should compile with –g (for source-level information)

Sample-based profiling

interrupt execution every x seconds and sample

works without modifying code (-g, same as above)

time

stopclear/start

Page 52: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

52

Common Profiling Workflow

gcc -g

-g tells gcc to generate debug information for source-level analysis & display of results

applicationsourcecode

binarycompile

binaryanalysis

application runs(profiling executions)

profileinterpretation

performanceprofile

source correlation& UI

Page 53: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

53

Intel Vtune Amplifier XE 2011

A dynamic profiler, provided by Intel.

• It works on both Intel/AMD hardware, but advanced

analyses available only for Intel processors, esp.

Core(TM) 2 Processors and later versions.

• Environment :

• 32/64-bit, x86-based architectures.

• Platforms : MS Windows, Linux (what we’re discussing)

Nov. 30th 17:00 we’ll have a seminar from Intel on VTune and related tools! Please mark already now in your calendar!

Page 54: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

54

• Algorithm Analysis

– Hotspots: figuring out the time-consuming part of the code.

– Locks and Waits: analyzing the parts where threads wait for some

synchronizations and I/O operations. We focus on waiting caused by

synchronization, i.e. lock and unlock operations on mutexes and

semaphores.

• Advanced Analysis : analyzing hardware issues, e.g., false and true

sharing, memory access patterns, etc.

Intel Vtune Amplifier XE 2011 (cont.)

Page 55: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

55

Example Source Code

int array[length];sum = 0;

get_randarrary() //fill array with random data

for(i=0; i<length; i++){sum = sum + array[i];

}

void get_randarray(){

size_t i; srand((unsigned)time(NULL) + (unsigned)getpid());

for(i=0; i<NrElements; ++i)data[i] = ((rand()% 128000000) + 1);

}

Page 56: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

56

Hot-Spot Analysis

Elapsed wallclock time: 1.932s

computation time across all threads: 2.54s

1.596 + 0.92 + 0.016 + 0.008

VTune identified get_randarrayas the function consuming most execution time (1.596s).

This is where you would start optimizing your program

functions are listed longest-running to shortest

4 threads consumed alltogether0.92s.

Total thread count: main thread plus 4 Pthreads

Page 57: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

57

Hotspots by CPU usage

Most time (1.5s), only one CPU core was runnig

For a small fraction of the execution time, 2 or 4 cores executing in parallel

idealy, 80%-100% of the system’s cores should be running (green range)

Page 58: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

58

Example Source Code

int array[length];sum = 0;

get_randarrary() //fill array with random data

pthread_mutex_t m;

for(i=start; i<start+length_per_thread; i++){pthread_mutex_lock(&m);sum = sum + array[i];pthread_mutex_unlock(&m);

}

Let’s check the locking overhead...

Page 59: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

59

Locks and Waits

Wait time: time (sum) of all threads spent waiting for mutex.

Spin time: the time threads are actively burning CPU cycles in a synchronization construct

Wait count : the overall number of times the system wait API was called for the analyzed application.

Link to Intel documentation

Page 60: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

60

Locks and Waits (cont.)

Very little parallelism, due to mutex serialization

Page 61: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

61

Timeline

The threads are busy synchronizing, almost no work done...

...

...

...

Page 62: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

62

Intel VTune Sample Callgraph

Page 63: Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law (cont.) We may use more threads executing in parallel for the parallelizable part,

Performance of Parallel Programs

63

Outline

The fundamental laws of parallelism

Amdahl’s Law

Gustafson’s Law

Scalability

Load Balancing

Sources of Performance Loss

False sharing case study

The 4 sources of performance loss

Performance Measurement & Profiling

With the kind help of our internship students:

Choi Hyewon, Jung Changwook and Lee Jongmyeong