performance architecture for multimedia - riscp. bakowski 20 amdahl’s law the execution time of a...
TRANSCRIPT
Performance ArchitecturePerformance Architecture Introduction & Introduction & exercisesexercises
P. BakowskiP. Bakowski
[email protected]@ieee.org
P. Bakowski 2
What is performance ?What is performance ?
Computing performance relates to the execution Computing performance relates to the execution
speed/time of the given task (program)speed/time of the given task (program)
The exection time can be evaluated as product of :The exection time can be evaluated as product of :
execution_time = execution_time =
number of instructions * number of instructions *
number of clock cycles per instruction * number of clock cycles per instruction *
clock cycle timeclock cycle time
P. Bakowski 3
Number of instructionsNumber of instructions
Number of instructions represents the complexity Number of instructions represents the complexity (length) of the task/program.(length) of the task/program.
It depends on :It depends on :
source program complexitysource program complexity
the choice/adequation of the instruction list the choice/adequation of the instruction list
clock cycle timeclock cycle time
Discussion : CISC versus RISCDiscussion : CISC versus RISC
P. Bakowski 4
Number of clock cycles per instructionNumber of clock cycles per instruction
Number of clock cycles per instruction reflects the Number of clock cycles per instruction reflects the complexity and the degree of parallelism of the complexity and the degree of parallelism of the internal architectureinternal architecture
One separate instruction needs many clock cycles One separate instruction needs many clock cycles to be elaborated via several stages :to be elaborated via several stages :
- Fetch- Fetch
- Decode- Decode
- Execute- Execute
The essential « acceleration » mechanism is The essential « acceleration » mechanism is pipeliningpipelining
P. Bakowski 5
Clock cycleClock cycle
Clock cycleClock cycle is related to clock frequency : is related to clock frequency :
clock_cycle = 1/clock_frequencyclock_cycle = 1/clock_frequency
This parameter depends on :This parameter depends on :
- the number of logic levels (gates) between the - the number of logic levels (gates) between the synchronization registers (2-16) synchronization registers (2-16)
- the technology switching speed (5-50ps)- the technology switching speed (5-50ps)
P. Bakowski 6
Execution time - exerciseExecution time - exercise
Exercise :Exercise :
Given a program that involves the execution of 1 Given a program that involves the execution of 1 million instructionsmillion instructions
2 clock cycles per instruction (execution rate)2 clock cycles per instruction (execution rate)
andand
1 GHz clock1 GHz clock
What is the execution time for this program ?What is the execution time for this program ?
P. Bakowski 7
Very simple pipelining Very simple pipelining
fetchfetch decodedecode executeexecute
fetchfetch decodedecode executeexecute
fetchfetch decodedecode executeexecute
Instruction types :Instruction types :
calculation (no stall)calculation (no stall)
branches (stall – 2 instructions lost)branches (stall – 2 instructions lost)
memory access (stall – additional clock cycle)memory access (stall – additional clock cycle)
clock cycleclock cycle
P. Bakowski 8
Pipeline performance - exercise Pipeline performance - exercise
IIn this exercise we evaluate the execution speed up n this exercise we evaluate the execution speed up for a given pipeline.for a given pipeline.
Each pipeline stage correspond to one clock cycle.Each pipeline stage correspond to one clock cycle.
The length of the pipeline is a parameter.The length of the pipeline is a parameter.
P. Bakowski 9
Pipeline performancePipeline performance
The executed program is a mix of instructions:The executed program is a mix of instructions:
calculation instructions may be executed without calculation instructions may be executed without pipeline stallpipeline stall
conditional branch instructions involve bps conditional branch instructions involve bps stages of delay (stall)stages of delay (stall)
memory access instruction involve mps stages memory access instruction involve mps stages of delayof delay
P. Bakowski 10
Pipeline performancePipeline performance
Given the clock frequency, the number of stages, Given the clock frequency, the number of stages, the number of instructions to execute and the the number of instructions to execute and the proportion of conditional/memory_access/calculation proportion of conditional/memory_access/calculation instructionsinstructions
We can evaluate the necessary execution time for We can evaluate the necessary execution time for non-pipelined and pipelined architectures ?non-pipelined and pipelined architectures ?
P. Bakowski 11
Pipeline performancePipeline performance
clock frequency in MHz: clock frequency in MHz:
number of pipeline stages: number of pipeline stages:
total instruction number in millions (inb): total instruction number in millions (inb):
ratio of branch instructions [e.g. 0.2]: ratio of branch instructions [e.g. 0.2]:
average stall penalty for branch instructions [e.g. 4]: average stall penalty for branch instructions [e.g. 4]:
ratio of memory access instructions [e.g. 0.1]:ratio of memory access instructions [e.g. 0.1]:
average stall penalty for memory access instructions [e.g. 3]: average stall penalty for memory access instructions [e.g. 3]:
P. Bakowski 12
Pipeline performancePipeline performance
Non pipelined execution time is given as: Non pipelined execution time is given as:
npipe_time = clock_cycle * number_of_stages * npipe_time = clock_cycle * number_of_stages * number_of_instructionsnumber_of_instructions
Pipelined execution time is calculated as: Pipelined execution time is calculated as:
calrt = 1.0 - brrt – memrt calrt = 1.0 - brrt – memrt
caltr - ratio of calculation instructionscaltr - ratio of calculation instructions
pipe_time = pipe_time = clock_cycle clock_cycle *inb*calrt + *inb*calrt + clock_cycle clock_cycle *inb*brstall*brrt + *inb*brstall*brrt + clock_cycle clock_cycle *inb*memstall*memrt*inb*memstall*memrt
P. Bakowski 13
Pipeline and cache memoryPipeline and cache memory
fetchfetch decodedecode executeexecute
fetchfetch decodedecode executeexecute
fetchfetch decodedecode write/readwrite/read
cache memory cache memory accessaccess
hithit//missmiss
cachecache main main memorymemory
P. Bakowski 14
Cache memory accessCache memory access
hithit
cachecache main main memorymemory
missmiss
cachecache main main memorymemory
lineline
P. Bakowski 15
Cache memory : access timeCache memory : access time
hithit/miss/miss
cachecache main main memorymemory
lineline
ExerciseExercise::Hit ratio: 99%Hit ratio: 99%Cache access time – 2 nsCache access time – 2 nsMain memory access time – 20 nsMain memory access time – 20 nsData transfer time (bus) – 2 Gb/sData transfer time (bus) – 2 Gb/sLine size – 128 bytesLine size – 128 bytes
QuestionQuestion: What is the average access : What is the average access time to this memory system ?time to this memory system ?
P. Bakowski 16
Three wallsThree walls
Power wallPower wall (4 GHz => 160W) : (4 GHz => 160W) :
Unacceptable growth in power usage with clock rate.Unacceptable growth in power usage with clock rate.
P. Bakowski 17
Three wallsThree walls
Instruction-level parallelism Instruction-level parallelism ILP - wallILP - wall: Limits to available : Limits to available low-level parallelism.low-level parallelism.
Max 3-4 Max 3-4
Memory wallMemory wall: A growing discrepancy of processor speeds : A growing discrepancy of processor speeds relative to memory speeds.relative to memory speeds.
Exercise Exercise ::
Calculate the memory bandwidth for :Calculate the memory bandwidth for :
4 GHz clock and 4 instructions per clock cycle ?4 GHz clock and 4 instructions per clock cycle ?
P. Bakowski 18
Symmetric multiprocessing Symmetric multiprocessing The simplest form of multiprocessing is based on the The simplest form of multiprocessing is based on the
architecture of identical processing units communicating architecture of identical processing units communicating via shared memory – Symmetric Multi-Processing.via shared memory – Symmetric Multi-Processing.
SMM implies the usage of threadsSMM implies the usage of threads
CPUCPUCPUCPU
cachecachecachecache
CPUCPUCPUCPU
cachecachecachecache
CPUCPUCPUCPU
cachecachecachecache
CPUCPUCPUCPU
cachecachecachecache
shared memoryshared memoryshared memoryshared memory
P. Bakowski 19
Sequential-parallel executionSequential-parallel execution
sequential executionsequential execution
parallel execution parallel execution in threadsin threads
sequential executionsequential execution
main threadmain thread
P. Bakowski 20
Amdahl’s law Amdahl’s law
The execution time of a mixed, sequential-parallel The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part, is given program, after the acceleration of the parallel part, is given by a simple equation by a simple equation
known as Amdahl's law.known as Amdahl's law.
tex_new = tex_new = tex_partex_par/acceleration + /acceleration + tex_seqtex_seq
Exercise Exercise ::
Calculate the necessary Calculate the necessary accelerationacceleration to obtain the required to obtain the required new execution time. new execution time.
Take tex_par = 10 s, tex_seq = 5, tex_new = 12 sTake tex_par = 10 s, tex_seq = 5, tex_new = 12 s
P. Bakowski 21
OpenMPOpenMP
OpenMP is a shared-memory application programming OpenMP is a shared-memory application programming interface (API) whose features facilitate interface (API) whose features facilitate shared-memory shared-memory parallelparallel programming. programming.
OpenMP’s OpenMP’s directivesdirectives let the user tell the compiler which let the user tell the compiler which instructions to execute in parallel and how to distribute them instructions to execute in parallel and how to distribute them among the threads that will run the code.among the threads that will run the code.
The first step in creating an OpenMP program from a The first step in creating an OpenMP program from a sequential one is tosequential one is to identify the parallelism identify the parallelism it contains. it contains. Basically, this means finding instructions, sequences of Basically, this means finding instructions, sequences of instructions, or even large regions of code that may be instructions, or even large regions of code that may be executed executed concurrentlyconcurrently by by different processorsdifferent processors..
P. Bakowski 22
OpenMPOpenMP
OpenMP is a shared-memory application programming OpenMP is a shared-memory application programming interface (API) whose features facilitate interface (API) whose features facilitate shared-memory shared-memory parallelparallel programming. programming.
OpenMP’s OpenMP’s directivesdirectives let the user tell the compiler which let the user tell the compiler which instructions to execute in parallel and how to distribute them instructions to execute in parallel and how to distribute them among the threads that will run the code.among the threads that will run the code.
The first step in creating an OpenMP program from a The first step in creating an OpenMP program from a sequential one is tosequential one is to identify the parallelism identify the parallelism it contains. it contains. Basically, this means finding instructions, sequences of Basically, this means finding instructions, sequences of instructions, or even large regions of code that may be instructions, or even large regions of code that may be executed executed concurrentlyconcurrently by by different processorsdifferent processors..
P. Bakowski 23
OpenMPOpenMP
#pragma omp ...#pragma omp ...{{...code......code...}}
where the where the ......codecode...... is called the is called the parallel regionparallel region. .
That is the area of the code that you want to try to run in That is the area of the code that you want to try to run in parallel, if possible.parallel, if possible.
When the compiler sees the start of the parallel region, it When the compiler sees the start of the parallel region, it creates creates a pool of threadsa pool of threads. .
When the program runs, these threads start executing, and When the program runs, these threads start executing, and are controlled by what information is in the directives. are controlled by what information is in the directives.
P. Bakowski 24
OpenMPOpenMPThe number of cores is controlled by an environment The number of cores is controlled by an environment variable, variable, OMP_NUM_THREADSOMP_NUM_THREADS. It could default to 1 or the . It could default to 1 or the number of cores on your machine. number of cores on your machine. export OMP_NUM_THREADS=4export OMP_NUM_THREADS=4
#include "stdio.h" #include "stdio.h" #include <omp.h>#include <omp.h> int main(int argc, char *argv[])int main(int argc, char *argv[]){{ ##pragma omp parallelpragma omp parallel {{ printf("hello multicore user!\n");printf("hello multicore user!\n"); }} return(0);return(0);}}
Question Question : What message is displayed at the screen ?: What message is displayed at the screen ?
P. Bakowski 25
OpenMP – CPUs and threadsOpenMP – CPUs and threads#include "stdio.h"#include "stdio.h"#include <omp.h>#include <omp.h>
int main(int argc, char *argv[])int main(int argc, char *argv[]){{ #pragma omp parallel#pragma omp parallel {{ int NCPU,tid,NPR,NTHR;int NCPU,tid,NPR,NTHR; /* get the total number of CPUs/cores available for OpenMP *//* get the total number of CPUs/cores available for OpenMP */ NCPU = omp_get_num_procs();NCPU = omp_get_num_procs(); /* get the current thread ID in the parallel region *//* get the current thread ID in the parallel region */ tid = omp_get_thread_num();tid = omp_get_thread_num(); /* get the total number of threads available in this parallel region *//* get the total number of threads available in this parallel region */ NPR = omp_get_num_threads();NPR = omp_get_num_threads(); /* get the total number of threads requested *//* get the total number of threads requested */ NTHR = omp_get_max_threads();NTHR = omp_get_max_threads(); /* only execute this on the master thread! *//* only execute this on the master thread! */ if (tid == 0) {if (tid == 0) { printf("%i : NCPU\t= %i\n",tid,NCPU);printf("%i : NCPU\t= %i\n",tid,NCPU); printf("%i : NTHR\t= %i\n",tid,NTHR);printf("%i : NTHR\t= %i\n",tid,NTHR); printf("%i : NPR\t= %i\n",tid,NPR);printf("%i : NPR\t= %i\n",tid,NPR); }} printf("%i : I am thread %i out of %i\n",tid,tid,NPR);printf("%i : I am thread %i out of %i\n",tid,tid,NPR); }}return(0);return(0);}}
P. Bakowski 26
sharedshared, , privateprivate and and reductionreductionThe following OpenMP example is a program which computes The following OpenMP example is a program which computes the dot product of two arraysthe dot product of two arrays a a and and b b (that is (that is sum(a[i]*b[i])sum(a[i]*b[i]) ) ) using a sum using a sum reductionreduction. .
The input variables The input variables aa and and bb are are sharedshared tables of tables of doubledouble values. values.
#include <omp.h>#include <omp.h>#include <stdio.h>#include <stdio.h>#include <stdlib.h>#include <stdlib.h>#define N 1000#define N 1000
int main (int argc, char *argv[]) { int main (int argc, char *argv[]) { double a[N], b[N]; double a[N], b[N]; double sum = 0.0;double sum = 0.0; int i, n, tid; int i, n, tid; /* Start a number of threads *//* Start a number of threads */
P. Bakowski 27
sharedshared, , privateprivate and and reductionreduction
#pragma omp #pragma omp parallelparallel sharedshared(a) (a) sharedshared(b) (b) privateprivate(i) (i) {{ /* only one of the threads do this *//* only one of the threads do this */ #pragma omp #pragma omp singlesingle {{ n = omp_get_num_threads();n = omp_get_num_threads(); printf("Number of threads = %d\n", n);printf("Number of threads = %d\n", n); } } /* initialize a and b – parallel operations *//* initialize a and b – parallel operations */ #pragma omp #pragma omp forfor for (i=0; i < N; i++) {for (i=0; i < N; i++) { a[i] = 1.0;a[i] = 1.0; b[i] = 1.0;b[i] = 1.0; }}
P. Bakowski 28
sharedshared, , privateprivate and and reductionreduction
/* parallel for loop computing the sum of a[i]*b[i] *//* parallel for loop computing the sum of a[i]*b[i] */ #pragma omp #pragma omp forfor reductionreduction((++:sum):sum) for (i=0; i < N; i++) {for (i=0; i < N; i++) { sum sum ++= a[i]*b[i];= a[i]*b[i]; }} } }
/* End of parallel region *//* End of parallel region */ printf(" Sum = %2.1f\n",sum);printf(" Sum = %2.1f\n",sum); exit(0);exit(0);}}
P. Bakowski 29
SummarySummary
Processing performanceProcessing performance
Pipeline performancePipeline performance
Memory system with cacheMemory system with cache
Three wallsThree walls
Symmetric multiprocessingSymmetric multiprocessing
Amdahl's LawAmdahl's Law
openMPopenMP