performance architecture for multimedia - riscp. bakowski 20 amdahl’s law the execution time of a...

Performance ArchitecturePerformance Architecture Introduction & Introduction & exercisesexercises

P. BakowskiP. Bakowski

[email protected]@ieee.org

P. Bakowski 2

What is performance ?What is performance ?

Computing performance relates to the execution Computing performance relates to the execution

speed/time of the given task (program)speed/time of the given task (program)

The exection time can be evaluated as product of :The exection time can be evaluated as product of :

execution_time = execution_time =

number of instructions * number of instructions *

number of clock cycles per instruction * number of clock cycles per instruction *

clock cycle timeclock cycle time

P. Bakowski 3

Number of instructionsNumber of instructions

Number of instructions represents the complexity Number of instructions represents the complexity (length) of the task/program.(length) of the task/program.

It depends on :It depends on :

source program complexitysource program complexity

the choice/adequation of the instruction list the choice/adequation of the instruction list

clock cycle timeclock cycle time

Discussion : CISC versus RISCDiscussion : CISC versus RISC

P. Bakowski 4

Number of clock cycles per instructionNumber of clock cycles per instruction

Number of clock cycles per instruction reflects the Number of clock cycles per instruction reflects the complexity and the degree of parallelism of the complexity and the degree of parallelism of the internal architectureinternal architecture

One separate instruction needs many clock cycles One separate instruction needs many clock cycles to be elaborated via several stages :to be elaborated via several stages :

- Fetch- Fetch

- Decode- Decode

- Execute- Execute

The essential « acceleration » mechanism is The essential « acceleration » mechanism is pipeliningpipelining

P. Bakowski 5

Clock cycleClock cycle

Clock cycleClock cycle is related to clock frequency : is related to clock frequency :

clock_cycle = 1/clock_frequencyclock_cycle = 1/clock_frequency

This parameter depends on :This parameter depends on :

- the number of logic levels (gates) between the - the number of logic levels (gates) between the synchronization registers (2-16) synchronization registers (2-16)

- the technology switching speed (5-50ps)- the technology switching speed (5-50ps)

P. Bakowski 6

Execution time - exerciseExecution time - exercise

Exercise :Exercise :

Given a program that involves the execution of 1 Given a program that involves the execution of 1 million instructionsmillion instructions

2 clock cycles per instruction (execution rate)2 clock cycles per instruction (execution rate)

andand

1 GHz clock1 GHz clock

What is the execution time for this program ?What is the execution time for this program ?

P. Bakowski 7

Very simple pipelining Very simple pipelining

fetchfetch decodedecode executeexecute



Instruction types :Instruction types :

calculation (no stall)calculation (no stall)

branches (stall – 2 instructions lost)branches (stall – 2 instructions lost)

memory access (stall – additional clock cycle)memory access (stall – additional clock cycle)

clock cycleclock cycle

P. Bakowski 8

Pipeline performance - exercise Pipeline performance - exercise

IIn this exercise we evaluate the execution speed up n this exercise we evaluate the execution speed up for a given pipeline.for a given pipeline.

Each pipeline stage correspond to one clock cycle.Each pipeline stage correspond to one clock cycle.

The length of the pipeline is a parameter.The length of the pipeline is a parameter.

P. Bakowski 9

Pipeline performancePipeline performance

The executed program is a mix of instructions:The executed program is a mix of instructions:

calculation instructions may be executed without calculation instructions may be executed without pipeline stallpipeline stall

conditional branch instructions involve bps conditional branch instructions involve bps stages of delay (stall)stages of delay (stall)

memory access instruction involve mps stages memory access instruction involve mps stages of delayof delay

P. Bakowski 10


Given the clock frequency, the number of stages, Given the clock frequency, the number of stages, the number of instructions to execute and the the number of instructions to execute and the proportion of conditional/memory_access/calculation proportion of conditional/memory_access/calculation instructionsinstructions

We can evaluate the necessary execution time for We can evaluate the necessary execution time for non-pipelined and pipelined architectures ?non-pipelined and pipelined architectures ?

P. Bakowski 11


clock frequency in MHz: clock frequency in MHz:

number of pipeline stages: number of pipeline stages:

total instruction number in millions (inb): total instruction number in millions (inb):

ratio of branch instructions [e.g. 0.2]: ratio of branch instructions [e.g. 0.2]:

average stall penalty for branch instructions [e.g. 4]: average stall penalty for branch instructions [e.g. 4]:

ratio of memory access instructions [e.g. 0.1]:ratio of memory access instructions [e.g. 0.1]:

average stall penalty for memory access instructions [e.g. 3]: average stall penalty for memory access instructions [e.g. 3]:

P. Bakowski 12


Non pipelined execution time is given as: Non pipelined execution time is given as:

npipe_time = clock_cycle * number_of_stages * npipe_time = clock_cycle * number_of_stages * number_of_instructionsnumber_of_instructions

Pipelined execution time is calculated as: Pipelined execution time is calculated as:

calrt = 1.0 - brrt – memrt calrt = 1.0 - brrt – memrt

caltr - ratio of calculation instructionscaltr - ratio of calculation instructions

pipe_time = pipe_time = clock_cycle clock_cycle *inb*calrt + *inb*calrt + clock_cycle clock_cycle *inb*brstall*brrt + *inb*brstall*brrt + clock_cycle clock_cycle *inb*memstall*memrt*inb*memstall*memrt

P. Bakowski 13

Pipeline and cache memoryPipeline and cache memory



fetchfetch decodedecode write/readwrite/read

cache memory cache memory accessaccess

hithit//missmiss

cachecache main main memorymemory

P. Bakowski 14

Cache memory accessCache memory access

hithit


missmiss


lineline

P. Bakowski 15

Cache memory : access timeCache memory : access time

hithit/miss/miss


lineline

ExerciseExercise::Hit ratio: 99%Hit ratio: 99%Cache access time – 2 nsCache access time – 2 nsMain memory access time – 20 nsMain memory access time – 20 nsData transfer time (bus) – 2 Gb/sData transfer time (bus) – 2 Gb/sLine size – 128 bytesLine size – 128 bytes

QuestionQuestion: What is the average access : What is the average access time to this memory system ?time to this memory system ?

P. Bakowski 16

Three wallsThree walls

Power wallPower wall (4 GHz => 160W) : (4 GHz => 160W) :

Unacceptable growth in power usage with clock rate.Unacceptable growth in power usage with clock rate.

P. Bakowski 17


Instruction-level parallelism Instruction-level parallelism ILP - wallILP - wall: Limits to available : Limits to available low-level parallelism.low-level parallelism.

Max 3-4 Max 3-4

Memory wallMemory wall: A growing discrepancy of processor speeds : A growing discrepancy of processor speeds relative to memory speeds.relative to memory speeds.

Exercise Exercise ::

Calculate the memory bandwidth for :Calculate the memory bandwidth for :

4 GHz clock and 4 instructions per clock cycle ?4 GHz clock and 4 instructions per clock cycle ?

P. Bakowski 18

Symmetric multiprocessing Symmetric multiprocessing The simplest form of multiprocessing is based on the The simplest form of multiprocessing is based on the

architecture of identical processing units communicating architecture of identical processing units communicating via shared memory – Symmetric Multi-Processing.via shared memory – Symmetric Multi-Processing.

SMM implies the usage of threadsSMM implies the usage of threads

CPUCPUCPUCPU

cachecachecachecache

CPUCPUCPUCPU


CPUCPUCPUCPU


CPUCPUCPUCPU


shared memoryshared memoryshared memoryshared memory

P. Bakowski 19

Sequential-parallel executionSequential-parallel execution

sequential executionsequential execution

parallel execution parallel execution in threadsin threads

sequential executionsequential execution

main threadmain thread

P. Bakowski 20

Amdahl’s law Amdahl’s law

The execution time of a mixed, sequential-parallel The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part, is given program, after the acceleration of the parallel part, is given by a simple equation by a simple equation

known as Amdahl's law.known as Amdahl's law.

tex_new = tex_new = tex_partex_par/acceleration + /acceleration + tex_seqtex_seq

Exercise Exercise ::

Calculate the necessary Calculate the necessary accelerationacceleration to obtain the required to obtain the required new execution time. new execution time.

Take tex_par = 10 s, tex_seq = 5, tex_new = 12 sTake tex_par = 10 s, tex_seq = 5, tex_new = 12 s

P. Bakowski 21

OpenMPOpenMP

OpenMP is a shared-memory application programming OpenMP is a shared-memory application programming interface (API) whose features facilitate interface (API) whose features facilitate shared-memory shared-memory parallelparallel programming. programming.

OpenMP’s OpenMP’s directivesdirectives let the user tell the compiler which let the user tell the compiler which instructions to execute in parallel and how to distribute them instructions to execute in parallel and how to distribute them among the threads that will run the code.among the threads that will run the code.

The first step in creating an OpenMP program from a The first step in creating an OpenMP program from a sequential one is tosequential one is to identify the parallelism identify the parallelism it contains. it contains. Basically, this means finding instructions, sequences of Basically, this means finding instructions, sequences of instructions, or even large regions of code that may be instructions, or even large regions of code that may be executed executed concurrentlyconcurrently by by different processorsdifferent processors..

P. Bakowski 22

OpenMPOpenMP

OpenMP is a shared-memory application programming OpenMP is a shared-memory application programming interface (API) whose features facilitate interface (API) whose features facilitate shared-memory shared-memory parallelparallel programming. programming.

OpenMP’s OpenMP’s directivesdirectives let the user tell the compiler which let the user tell the compiler which instructions to execute in parallel and how to distribute them instructions to execute in parallel and how to distribute them among the threads that will run the code.among the threads that will run the code.

The first step in creating an OpenMP program from a The first step in creating an OpenMP program from a sequential one is tosequential one is to identify the parallelism identify the parallelism it contains. it contains. Basically, this means finding instructions, sequences of Basically, this means finding instructions, sequences of instructions, or even large regions of code that may be instructions, or even large regions of code that may be executed executed concurrentlyconcurrently by by different processorsdifferent processors..

P. Bakowski 23

OpenMPOpenMP

#pragma omp ...#pragma omp ...{{...code......code...}}

where the where the ......codecode...... is called the is called the parallel regionparallel region. .

That is the area of the code that you want to try to run in That is the area of the code that you want to try to run in parallel, if possible.parallel, if possible.

When the compiler sees the start of the parallel region, it When the compiler sees the start of the parallel region, it creates creates a pool of threadsa pool of threads. .

When the program runs, these threads start executing, and When the program runs, these threads start executing, and are controlled by what information is in the directives. are controlled by what information is in the directives.

P. Bakowski 24

OpenMPOpenMPThe number of cores is controlled by an environment The number of cores is controlled by an environment variable, variable, OMP_NUM_THREADSOMP_NUM_THREADS. It could default to 1 or the . It could default to 1 or the number of cores on your machine. number of cores on your machine. export OMP_NUM_THREADS=4export OMP_NUM_THREADS=4

#include "stdio.h" #include "stdio.h" #include <omp.h>#include <omp.h> int main(int argc, char *argv[])int main(int argc, char *argv[]){{ ##pragma omp parallelpragma omp parallel {{ printf("hello multicore user!\n");printf("hello multicore user!\n"); }} return(0);return(0);}}

Question Question : What message is displayed at the screen ?: What message is displayed at the screen ?

P. Bakowski 25

OpenMP – CPUs and threadsOpenMP – CPUs and threads#include "stdio.h"#include "stdio.h"#include <omp.h>#include <omp.h>

int main(int argc, char *argv[])int main(int argc, char *argv[]){{ #pragma omp parallel#pragma omp parallel {{ int NCPU,tid,NPR,NTHR;int NCPU,tid,NPR,NTHR; /* get the total number of CPUs/cores available for OpenMP *//* get the total number of CPUs/cores available for OpenMP */ NCPU = omp_get_num_procs();NCPU = omp_get_num_procs(); /* get the current thread ID in the parallel region *//* get the current thread ID in the parallel region */ tid = omp_get_thread_num();tid = omp_get_thread_num(); /* get the total number of threads available in this parallel region *//* get the total number of threads available in this parallel region */ NPR = omp_get_num_threads();NPR = omp_get_num_threads(); /* get the total number of threads requested *//* get the total number of threads requested */ NTHR = omp_get_max_threads();NTHR = omp_get_max_threads(); /* only execute this on the master thread! *//* only execute this on the master thread! */ if (tid == 0) {if (tid == 0) { printf("%i : NCPU\t= %i\n",tid,NCPU);printf("%i : NCPU\t= %i\n",tid,NCPU); printf("%i : NTHR\t= %i\n",tid,NTHR);printf("%i : NTHR\t= %i\n",tid,NTHR); printf("%i : NPR\t= %i\n",tid,NPR);printf("%i : NPR\t= %i\n",tid,NPR); }} printf("%i : I am thread %i out of %i\n",tid,tid,NPR);printf("%i : I am thread %i out of %i\n",tid,tid,NPR); }}return(0);return(0);}}

P. Bakowski 26

sharedshared, , privateprivate and and reductionreductionThe following OpenMP example is a program which computes The following OpenMP example is a program which computes the dot product of two arraysthe dot product of two arrays a a and and b b (that is (that is sum(a[i]*b[i])sum(a[i]*b[i]) ) ) using a sum using a sum reductionreduction. .

The input variables The input variables aa and and bb are are sharedshared tables of tables of doubledouble values. values.

#include <omp.h>#include <omp.h>#include <stdio.h>#include <stdio.h>#include <stdlib.h>#include <stdlib.h>#define N 1000#define N 1000

int main (int argc, char *argv[]) { int main (int argc, char *argv[]) { double a[N], b[N]; double a[N], b[N]; double sum = 0.0;double sum = 0.0; int i, n, tid; int i, n, tid; /* Start a number of threads *//* Start a number of threads */

P. Bakowski 27

sharedshared, , privateprivate and and reductionreduction

#pragma omp #pragma omp parallelparallel sharedshared(a) (a) sharedshared(b) (b) privateprivate(i) (i) {{ /* only one of the threads do this *//* only one of the threads do this */ #pragma omp #pragma omp singlesingle {{ n = omp_get_num_threads();n = omp_get_num_threads(); printf("Number of threads = %d\n", n);printf("Number of threads = %d\n", n); } } /* initialize a and b – parallel operations *//* initialize a and b – parallel operations */ #pragma omp #pragma omp forfor for (i=0; i < N; i++) {for (i=0; i < N; i++) { a[i] = 1.0;a[i] = 1.0; b[i] = 1.0;b[i] = 1.0; }}

P. Bakowski 28

sharedshared, , privateprivate and and reductionreduction

/* parallel for loop computing the sum of a[i]*b[i] *//* parallel for loop computing the sum of a[i]*b[i] */ #pragma omp #pragma omp forfor reductionreduction((++:sum):sum) for (i=0; i < N; i++) {for (i=0; i < N; i++) { sum sum ++= a[i]*b[i];= a[i]*b[i]; }} } }

/* End of parallel region *//* End of parallel region */ printf(" Sum = %2.1f\n",sum);printf(" Sum = %2.1f\n",sum); exit(0);exit(0);}}

P. Bakowski 29

SummarySummary

Processing performanceProcessing performance


Memory system with cacheMemory system with cache


Symmetric multiprocessingSymmetric multiprocessing

Amdahl's LawAmdahl's Law

openMPopenMP

performance architecture for multimedia - riscp. bakowski 20 amdahl’s law the execution time of a...

Documents