local performance
TRANSCRIPT
EPFL CS-206 – Spring 2015 Lec.3 - 1
CS-206 Concurrency���Lecture 3Performance &EfficiencySpring 2015Prof. Babak Falsafiparsa.epfl.ch/courses/cs206/
Adapted from slides originally developed by Maurice Herlihy and Nir Shavit from the Art of Multiprocessor Programming, and Babak FalsafiEPFL Copyright 2015
cache
BusBus
cachecache
1
shared counter
shared memory
void primePrint { int i = ThreadID.get(); // IDs in {0..9} for (j = i*109+1, j<(i+1)*109; j++) { if (isPrime(j)) print(j); }}
code
Local variables
EPFL CS-206 – Spring 2015 Lec.3 - 2
Lecture& Lab
M T W T F16-Feb 17-Feb 18-Feb 19-Feb 20-Feb23-Feb 24-Feb 25-Feb 26-Feb 27-Feb2-Mar 3-Mar 4-Mar 5-Mar 6-Mar9-Mar 10-Mar 11-Mar 12-Mar 13-Mar16-Mar 17-Mar 18-Mar 19-Mar 20-Mar23-Mar 24-Mar 25-Mar 26-Mar 27-Mar30-Mar 31-Mar 1-Apr 2-Apr 3-Apr6-Apr 7-Apr 8-Apr 9-Apr 10-Apr13-Apr 14-Apr 15-Apr 16-Apr 17-Apr20-Apr 21-Apr 22-Apr 23-Apr 24-Apr27-Apr 28-Apr 29-Apr 30-Apr 1-May4-May 5-May 6-May 7-May 8-May11-May 12-May 13-May 14-May 15-May18-May 19-May 20-May 21-May 22-May25-May 26-May 27-May 28-May 29-May
Where are We?
u Parallelismw Parallel thinkingw Division of workw Performance
u Next weekw Mutual Exclusion
EPFL CS-206 – Spring 2015 Lec.3 - 3
To design well-balanced parallel software we need tothink about how a problem can be solved in parallel,
divide the work evenly among threads,maximize the parallelism and reduce overhead
EPFL CS-206 – Spring 2015 Lec.3 - 4
Example Problem: Galactic Motion
u Computing the mutual interactions of N bodiesw n-body problemsw stars, planets, molecules…w modeled as a tree
u Can approximate influence of distant bodiesu How do we compute this problem in parallel?
EPFL CS-206 – Spring 2015 Lec.3 - 5
Example Problem: Ocean Simulation
u Simulate ocean eddy currentsu Discretize in space and time
w Modeled as grids of elements with velocityw Compute the impact of near neighbor elementsw Iterate until a solution is reached
u Used in weather prediction, climate science, …..
Ocean Basin
EPFL CS-206 – Spring 2015 Lec.3 - 6
Example Problem: Deep Learning
Used in search, machine translation, face recognition, investment banking, …..
EPFL CS-206 – Spring 2015 Lec.3 - 7
Technical distinction: Concurrency vs. Parallelism
u Concurrency: two or more threads make forward progress together
u Parallelism: two or more threads execute at the same timew All parallel threads are concurrent, but not vice versa
u Roughly how many threads vs. how many cores
Thread 1 Thread 2 Thread 1
Thread 1
Thread 2
EPFL CS-206 – Spring 2015 Lec.3 - 8
Terminology
u A Task is a piece of workw Ocean simulation: compute a grid point, row, planew Raytracing: one ray or group of rays
u Task grainw small è fewer operations (less work) per taskw large è more operations (more work) per task
u Threads performs tasksw Threads execute on processors
EPFL CS-206 – Spring 2015 Lec.3 - 9
Forms of Parallelismu Throughput parallelism
w Perform many (identical) sequential tasks at the same timew E.g., Google search, ATM (bank) transactions
u Functional or task parallelismw Perform tasks that are functionally different in parallelw E.g., iPhoto (face recognition with slide show)
u Pipeline parallelismw Perform tasks that are different in a particular orderw E.g., speech (signal, phonemes, words, conversation)
u Data parallelismw Perform the same task on different dataw E.g., Data analytics, image processing
}Re
duce
tim
e fo
r one
job
EPFL CS-206 – Spring 2015 Lec.3 - 10
Division of Work: It’s about Performance
u Balance workloadw Give each parallel task the same rough amount of work
u Reduce communicationw Balance computation time with communication timew Computation à useful work, Communication à overhead
u Reduce extra workw Creating a thread, assigning workw Scheduling threads on processors, OS, etc.
u These are at odds with each other
EPFL CS-206 – Spring 2015 Lec.3 - 11
Example: Division of Work
}
} Overhead
} {Load imbalance
Small tasksLarge tasks
EPFL CS-206 – Spring 2015 Lec.3 - 13
c11 c12 … c1nc21 c22 … c2n cn1 cn2 … cnn
!
"
#####
$
%
&&&&&
=
a11 a12 … a1na21 a22 … a2n an1 an2 … ann
!
"
#####
$
%
&&&&&
×
b11 b12 … b1nb21 b22 … b2n bn1 bn2 … bnn
!
"
#####
$
%
&&&&&
Matrix Multiplication
Where cij = aik ⋅bkjk=1
n
∑
Cn×n( ) = An×n( )× Bn×n( )
EPFL CS-206 – Spring 2015 Lec.3 - 14
c11 c12 … c1nc21 c22 … c2n cn1 cn2 … cnn
!
"
#####
$
%
&&&&&
=
a11 a12 … a1na21 a22 … a2n an1 an2 … ann
!
"
#####
$
%
&&&&&
×
b11 b12 … b1nb21 b22 … b2n bn1 bn2 … bnn
!
"
#####
$
%
&&&&&
Matrix Multiplication
For each i and j:Multiply the entries aik by the entries bkj for k = 1, 2, …, n and summing the results over k
Cn×n( ) = An×n( )× Bn×n( )
EPFL CS-206 – Spring 2015 Lec.3 - 15
class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}
Matrix Multiplication
15
B
A
C
n
i
col
row
i
n
n
n
EPFL CS-206 – Spring 2015 Lec.3 - 16
Matrix Multiplication
class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}
a thread
EPFL CS-206 – Spring 2015 Lec.3 - 17
Matrix Multiplication
class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}
Which matrix entry to compute
EPFL CS-206 – Spring 2015 Lec.3 - 18
Matrix Multiplication
class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}
Actual computation
EPFL CS-206 – Spring 2015 Lec.3 - 19
Matrix Multiplication
void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}
EPFL CS-206 – Spring 2015 Lec.3 - 20
Matrix Multiplication
void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}
Create n x n threads
EPFL CS-206 – Spring 2015 Lec.3 - 21
Matrix Multiplication
void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}
Start them
EPFL CS-206 – Spring 2015 Lec.3 - 22
Matrix Multiplication
void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}
Wait for them to
finish
Start them
EPFL CS-206 – Spring 2015 Lec.3 - 23
Matrix Multiplication
void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}
Wait for them to
finish
Start them
What’s wrong with this picture?
EPFL CS-206 – Spring 2015 Lec.3 - 24
Thread Overhead
u One thread per taskw One dot product task
u Threads Require resourcesw State:
w Memory for stacksw A copy of the register filew Program state: Program Counter, Stack pointer,….
w Setup, teardownw Scheduler overhead
u Short-lived threadsw Bad ratio of work versus overhead
EPFL CS-206 – Spring 2015 Lec.3 - 25
One More “Big” Performance-Related Axiomu Amdahl’s Law
w In English: if you speed up only a small fraction of the execution time of a program or a computation, the speedup you achieve on the whole application is limited
u Example10 s 90 s
1 s 90 s
A 10x���speedup���on this part!
100s
91s
EPFL CS-206 – Spring 2015 Lec.3 - 26
Amdahl’s Law
Speedup = 1 Fractionenhanced + (1 - Fractionenhanced) Speedupenhanced
Parallel fraction
EPFL CS-206 – Spring 2015 Lec.3 - 27
Amdahl’s Law
Speedup = 1 Fractionenhanced + (1 - Fractionenhanced) Speedupenhanced
Sequentialfraction
EPFL CS-206 – Spring 2015 Lec.3 - 28
Amdahl’s Law
Simple example:Program runs for 100 seconds on a uniprocessor10% of the program can be parallelized on a multiprocessorAssume an ideal multiprocessor with 10 processors
Speedup = 1 Fractionenhanced + (1 - Fractionenhanced) Speedupenhanced
Speedup = 1 = 1 = 1 = 1.1 0.1 + (1-0.1) 0.01 + 0.9 0.91 10
EPFL CS-206 – Spring 2015 Lec.3 - 29
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1 2 4 8 16 32 64 128
Tim
e
# of Processors
10% 50% 70% 80% 90%
Implications of Amdahl’s Law
Fractionenhanced
EPFL CS-206 – Spring 2015 Lec.3 - 30
Parallel execution is not ideal
u 10 processors rarely get a speedup of 10w Load imbalancew Thread start/join overheadw Communication overhead
u Even if Fractionenhanced is close to 100%w Speedupenhanced << p for p processorsw Our goal is to get it as close as possible to p
EPFL CS-206 – Spring 2015 Lec.3 - 32
Back to Matrix Multiplication
void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}
Sequential(1 – Fractionenhanced)
EPFL CS-206 – Spring 2015 Lec.3 - 33
Matrix Multiplication
class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}
Parallel(Fractionenhanced)
EPFL CS-206 – Spring 2015 Lec.3 - 34
Example: n = 16
u How many threads will our Matrix Multiplication create?
u How many of these threads are concurrent (i.e., degree of concurrency)?
EPFL CS-206 – Spring 2015 Lec.3 - 35
Example: Assume there are 4 processors
u How many threads per processor?
u How many threads run in parallel (i.e., degree of parallelism)?
EPFL CS-206 – Spring 2015 Lec.3 - 36
Best efficiency: Concurrency ~ Parallelismu All work is independentu Max parallelism? 4u Only need 4 threads
w Reduce thread.start(), thread.join() overheadw Only 4 start() and 4 join()w Workers (each thread) should do more workw 16x16 dot products divided by 4 = 64 dot products per thread
Rows 0-3 4-7 8-11 12-15
EPFL CS-206 – Spring 2015 Lec.3 - 37
Matrix Multiplication on “p” cores
void multiply() { BigWorker[] worker = new BigWorker[n]; for (int row=0; row < n; row+=n/p) worker[row] = new BigWorker(row); for (int row=0; row < n; row+=n/p)
worker[row].start(); for (int row=0; row < n; row+=n/p)
worker[row].join();}
EPFL CS-206 – Spring 2015 Lec.3 - 38
Matrix Multiplication: (n/c) x n per worker
void multiply() { BigWorker[] worker = new BigWorker[n]; for (int row=0; row < n; row+=n/p) worker[row] = new Worker(row); for (int row=0; row < n; row+=n/p)
worker[row].start(); for (int row=0; row < n; row+=n/p)
worker[row].join();}
Create p threads
EPFL CS-206 – Spring 2015 Lec.3 - 39
BigWorker: Each thread does (n/p) x n class BigWorker extends Thread { int begin_row; BigWorker(int begin_row) { this.begin_row = begin_row; } public void run() { double dotProduct = 0.0; for (int row=begin_row; row < begin_row+n/p; row++) for (int col=0; col < n; col++) for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}
EPFL CS-206 – Spring 2015 Lec.3 - 40
BigWorker: Each thread does (n/p) x n class Worker extends Thread { int begin_row; BigWorker(int begin_row) { this.begin_row = begin_row; } public void run() { double dotProduct = 0.0; for (int row=begin_row; row < begin_row+n/p; row++) for (int col=0; col < n; col++) for (int i = 0; i < n; i++) dotProduct += A[row][i] * B[i][col]; C[row][col] = dotProduct; }}}
Multiple rowsAll columns
EPFL CS-206 – Spring 2015 Lec.3 - 41
What if n is not divisible by p?
u Lets pick n = 20 and p = 16w Matrices of 20x20 running on 16 cores
EPFL CS-206 – Spring 2015 Lec.3 - 42
Parallel Primality Testing
u Challengew Print primes from 1 to 1010
u Givenw Ten-processor multiprocessorw One thread per processor
u Goalw Get ten-fold speedup (or close)
EPFL CS-206 – Spring 2015 Lec.3 - 43
Load Balancing
u Split the work evenlyu Each thread tests range of 109
…
…109 10102·1091
P0 P1 P9
EPFL CS-206 – Spring 2015 Lec.3 - 44
Procedure for Thread i
void primePrint { int i = ThreadID.get(); // IDs in {0..9} for (j = i*109+1, j<(i+1)*109; j++) { if (isPrime(j)) print(j); } }
EPFL CS-206 – Spring 2015 Lec.3 - 45
Issues
u Higher ranges have fewer primesu Yet larger numbers harder to testu Thread workloads
w Unevenw Hard to predict
EPFL CS-206 – Spring 2015 Lec.3 - 46
Issues
u Higher ranges have fewer primesu Yet larger numbers harder to testu Thread workloads
w Unevenw Hard to predict
u Need dynamic load balancing
EPFL CS-206 – Spring 2015 Lec.3 - 48
Procedure for Thread i
int counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } }
EPFL CS-206 – Spring 2015 Lec.3 - 49
Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } }
Procedure for Thread i
Shared counterobject
EPFL CS-206 – Spring 2015 Lec.3 - 50
Where Things Reside
cache
BusBus
cachecache
1
shared counter
shared memory
void primePrint { int i = ThreadID.get(); // IDs in {0..9} for (j = i*109+1, j<(i+1)*109; j++) { if (isPrime(j)) print(j); }}
code
Local variables
EPFL CS-206 – Spring 2015 Lec.3 - 51
Procedure for Thread i
Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } }
Stop when every value taken
EPFL CS-206 – Spring 2015 Lec.3 - 52
Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } }
Procedure for Thread i
Increment & return each new value
EPFL CS-206 – Spring 2015 Lec.3 - 53
Counter Implementation
public class Counter { private long value; public long getAndIncrement() { return value++; } }
EPFL CS-206 – Spring 2015 Lec.3 - 54
Counter Implementation
public class Counter { private long value; public long getAndIncrement() { return value++; } } OK for single thread,
not for concurrent threads
EPFL CS-206 – Spring 2015 Lec.3 - 55
What It Means
public class Counter { private long value; public long getAndIncrement() { return value++; } }
EPFL CS-206 – Spring 2015 Lec.3 - 56
What It Means
public class Counter { private long value; public long getAndIncrement() { return value++; } }
temp = value; value = temp + 1; return temp;
EPFL CS-206 – Spring 2015 Lec.3 - 57
time
Not so good…
Value… 1
read 1
read 1
write 2
read 2
write 3
write 2
2 3 2
EPFL CS-206 – Spring 2015 Lec.3 - 58
Is this problem inherent?
If we could only glue reads and writes together…
read
write read
write!! !!
EPFL CS-206 – Spring 2015 Lec.3 - 59
Challenge
public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } }
EPFL CS-206 – Spring 2015 Lec.3 - 60
Challenge
public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } }
Make these steps atomic (indivisible)
EPFL CS-206 – Spring 2015 Lec.3 - 61
Hardware Solution
public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } } ReadModifyWrite()
instruction
EPFL CS-206 – Spring 2015 Lec.3 - 62
An Aside: Java™
public class Counter { private long value; public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; } }
EPFL CS-206 – Spring 2015 Lec.3 - 63
An Aside: Java™
public class Counter { private long value; public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; } }
Synchronized block
EPFL CS-206 – Spring 2015 Lec.3 - 64
An Aside: Java™
public class Counter { private long value; public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; } }
Mutual Exclusion