performance architecture for multimedia - riscp. bakowski 20 amdahl’s law the execution time of a...

29
Performance Architecture Performance Architecture Introduction & Introduction & exercises exercises P. Bakowski P. Bakowski [email protected] [email protected]

Upload: others

Post on 01-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

Performance ArchitecturePerformance Architecture Introduction & Introduction & exercisesexercises

P. BakowskiP. Bakowski

[email protected]@ieee.org

Page 2: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 2

What is performance ?What is performance ?

Computing performance relates to the execution Computing performance relates to the execution

speed/time of the given task (program)speed/time of the given task (program)

The exection time can be evaluated as product of :The exection time can be evaluated as product of :

execution_time = execution_time =

number of instructions * number of instructions *

number of clock cycles per instruction * number of clock cycles per instruction *

clock cycle timeclock cycle time

Page 3: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 3

Number of instructionsNumber of instructions

Number of instructions represents the complexity Number of instructions represents the complexity (length) of the task/program.(length) of the task/program.

It depends on :It depends on :

source program complexitysource program complexity

the choice/adequation of the instruction list the choice/adequation of the instruction list

clock cycle timeclock cycle time

Discussion : CISC versus RISCDiscussion : CISC versus RISC

Page 4: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 4

Number of clock cycles per instructionNumber of clock cycles per instruction

Number of clock cycles per instruction reflects the Number of clock cycles per instruction reflects the complexity and the degree of parallelism of the complexity and the degree of parallelism of the internal architectureinternal architecture

One separate instruction needs many clock cycles One separate instruction needs many clock cycles to be elaborated via several stages :to be elaborated via several stages :

- Fetch- Fetch

- Decode- Decode

- Execute- Execute

The essential « acceleration » mechanism is The essential « acceleration » mechanism is pipeliningpipelining

Page 5: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 5

Clock cycleClock cycle

Clock cycleClock cycle is related to clock frequency : is related to clock frequency :

clock_cycle = 1/clock_frequencyclock_cycle = 1/clock_frequency

This parameter depends on :This parameter depends on :

- the number of logic levels (gates) between the - the number of logic levels (gates) between the synchronization registers (2-16) synchronization registers (2-16)

- the technology switching speed (5-50ps)- the technology switching speed (5-50ps)

Page 6: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 6

Execution time - exerciseExecution time - exercise

Exercise :Exercise :

Given a program that involves the execution of 1 Given a program that involves the execution of 1 million instructionsmillion instructions

2 clock cycles per instruction (execution rate)2 clock cycles per instruction (execution rate)

andand

1 GHz clock1 GHz clock

What is the execution time for this program ?What is the execution time for this program ?

Page 7: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 7

Very simple pipelining Very simple pipelining

fetchfetch decodedecode executeexecute

fetchfetch decodedecode executeexecute

fetchfetch decodedecode executeexecute

Instruction types :Instruction types :

calculation (no stall)calculation (no stall)

branches (stall – 2 instructions lost)branches (stall – 2 instructions lost)

memory access (stall – additional clock cycle)memory access (stall – additional clock cycle)

clock cycleclock cycle

Page 8: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 8

Pipeline performance - exercise Pipeline performance - exercise

IIn this exercise we evaluate the execution speed up n this exercise we evaluate the execution speed up for a given pipeline.for a given pipeline.

Each pipeline stage correspond to one clock cycle.Each pipeline stage correspond to one clock cycle.

The length of the pipeline is a parameter.The length of the pipeline is a parameter.

Page 9: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 9

Pipeline performancePipeline performance

The executed program is a mix of instructions:The executed program is a mix of instructions:

calculation instructions may be executed without calculation instructions may be executed without pipeline stallpipeline stall

conditional branch instructions involve bps conditional branch instructions involve bps stages of delay (stall)stages of delay (stall)

memory access instruction involve mps stages memory access instruction involve mps stages of delayof delay

Page 10: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 10

Pipeline performancePipeline performance

Given the clock frequency, the number of stages, Given the clock frequency, the number of stages, the number of instructions to execute and the the number of instructions to execute and the proportion of conditional/memory_access/calculation proportion of conditional/memory_access/calculation instructionsinstructions

We can evaluate the necessary execution time for We can evaluate the necessary execution time for non-pipelined and pipelined architectures ?non-pipelined and pipelined architectures ?

Page 11: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 11

Pipeline performancePipeline performance

clock frequency in MHz: clock frequency in MHz:

number of pipeline stages: number of pipeline stages:

total instruction number in millions (inb): total instruction number in millions (inb):

ratio of branch instructions [e.g. 0.2]: ratio of branch instructions [e.g. 0.2]:

average stall penalty for branch instructions [e.g. 4]: average stall penalty for branch instructions [e.g. 4]:

ratio of memory access instructions [e.g. 0.1]:ratio of memory access instructions [e.g. 0.1]:

average stall penalty for memory access instructions [e.g. 3]: average stall penalty for memory access instructions [e.g. 3]:

Page 12: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 12

Pipeline performancePipeline performance

Non pipelined execution time is given as: Non pipelined execution time is given as:

npipe_time = clock_cycle * number_of_stages * npipe_time = clock_cycle * number_of_stages * number_of_instructionsnumber_of_instructions

Pipelined execution time is calculated as: Pipelined execution time is calculated as:

calrt = 1.0 - brrt – memrt calrt = 1.0 - brrt – memrt

caltr - ratio of calculation instructionscaltr - ratio of calculation instructions

pipe_time = pipe_time = clock_cycle clock_cycle *inb*calrt + *inb*calrt + clock_cycle clock_cycle *inb*brstall*brrt + *inb*brstall*brrt + clock_cycle clock_cycle *inb*memstall*memrt*inb*memstall*memrt

Page 13: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 13

Pipeline and cache memoryPipeline and cache memory

fetchfetch decodedecode executeexecute

fetchfetch decodedecode executeexecute

fetchfetch decodedecode write/readwrite/read

cache memory cache memory accessaccess

hithit//missmiss

cachecache main main memorymemory

Page 14: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 14

Cache memory accessCache memory access

hithit

cachecache main main memorymemory

missmiss

cachecache main main memorymemory

lineline

Page 15: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 15

Cache memory : access timeCache memory : access time

hithit/miss/miss

cachecache main main memorymemory

lineline

ExerciseExercise::Hit ratio: 99%Hit ratio: 99%Cache access time – 2 nsCache access time – 2 nsMain memory access time – 20 nsMain memory access time – 20 nsData transfer time (bus) – 2 Gb/sData transfer time (bus) – 2 Gb/sLine size – 128 bytesLine size – 128 bytes

QuestionQuestion: What is the average access : What is the average access time to this memory system ?time to this memory system ?

Page 16: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 16

Three wallsThree walls

Power wallPower wall (4 GHz => 160W) : (4 GHz => 160W) :

Unacceptable growth in power usage with clock rate.Unacceptable growth in power usage with clock rate.

Page 17: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 17

Three wallsThree walls

Instruction-level parallelism Instruction-level parallelism ILP - wallILP - wall: Limits to available : Limits to available low-level parallelism.low-level parallelism.

Max 3-4 Max 3-4

Memory wallMemory wall: A growing discrepancy of processor speeds : A growing discrepancy of processor speeds relative to memory speeds.relative to memory speeds.

Exercise Exercise ::

Calculate the memory bandwidth for :Calculate the memory bandwidth for :

4 GHz clock and 4 instructions per clock cycle ?4 GHz clock and 4 instructions per clock cycle ?

Page 18: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 18

Symmetric multiprocessing Symmetric multiprocessing The simplest form of multiprocessing is based on the The simplest form of multiprocessing is based on the

architecture of identical processing units communicating architecture of identical processing units communicating via shared memory – Symmetric Multi-Processing.via shared memory – Symmetric Multi-Processing.

SMM implies the usage of threadsSMM implies the usage of threads

CPUCPUCPUCPU

cachecachecachecache

CPUCPUCPUCPU

cachecachecachecache

CPUCPUCPUCPU

cachecachecachecache

CPUCPUCPUCPU

cachecachecachecache

shared memoryshared memoryshared memoryshared memory

Page 19: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 19

Sequential-parallel executionSequential-parallel execution

sequential executionsequential execution

parallel execution parallel execution in threadsin threads

sequential executionsequential execution

main threadmain thread

Page 20: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 20

Amdahl’s law Amdahl’s law

The execution time of a mixed, sequential-parallel The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part, is given program, after the acceleration of the parallel part, is given by a simple equation by a simple equation

known as Amdahl's law.known as Amdahl's law.

tex_new = tex_new = tex_partex_par/acceleration + /acceleration + tex_seqtex_seq

Exercise Exercise ::

Calculate the necessary Calculate the necessary accelerationacceleration to obtain the required to obtain the required new execution time. new execution time.

Take tex_par = 10 s, tex_seq = 5, tex_new = 12 sTake tex_par = 10 s, tex_seq = 5, tex_new = 12 s

Page 21: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 21

OpenMPOpenMP

OpenMP is a shared-memory application programming OpenMP is a shared-memory application programming interface (API) whose features facilitate interface (API) whose features facilitate shared-memory shared-memory parallelparallel programming. programming.

OpenMP’s OpenMP’s directivesdirectives let the user tell the compiler which let the user tell the compiler which instructions to execute in parallel and how to distribute them instructions to execute in parallel and how to distribute them among the threads that will run the code.among the threads that will run the code.

The first step in creating an OpenMP program from a The first step in creating an OpenMP program from a sequential one is tosequential one is to identify the parallelism identify the parallelism it contains. it contains. Basically, this means finding instructions, sequences of Basically, this means finding instructions, sequences of instructions, or even large regions of code that may be instructions, or even large regions of code that may be executed executed concurrentlyconcurrently by by different processorsdifferent processors..

Page 22: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 22

OpenMPOpenMP

OpenMP is a shared-memory application programming OpenMP is a shared-memory application programming interface (API) whose features facilitate interface (API) whose features facilitate shared-memory shared-memory parallelparallel programming. programming.

OpenMP’s OpenMP’s directivesdirectives let the user tell the compiler which let the user tell the compiler which instructions to execute in parallel and how to distribute them instructions to execute in parallel and how to distribute them among the threads that will run the code.among the threads that will run the code.

The first step in creating an OpenMP program from a The first step in creating an OpenMP program from a sequential one is tosequential one is to identify the parallelism identify the parallelism it contains. it contains. Basically, this means finding instructions, sequences of Basically, this means finding instructions, sequences of instructions, or even large regions of code that may be instructions, or even large regions of code that may be executed executed concurrentlyconcurrently by by different processorsdifferent processors..

Page 23: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 23

OpenMPOpenMP

#pragma omp ...#pragma omp ...{{...code......code...}}

where the where the ......codecode...... is called the is called the parallel regionparallel region. .

That is the area of the code that you want to try to run in That is the area of the code that you want to try to run in parallel, if possible.parallel, if possible.

When the compiler sees the start of the parallel region, it When the compiler sees the start of the parallel region, it creates creates a pool of threadsa pool of threads. .

When the program runs, these threads start executing, and When the program runs, these threads start executing, and are controlled by what information is in the directives. are controlled by what information is in the directives.

Page 24: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 24

OpenMPOpenMPThe number of cores is controlled by an environment The number of cores is controlled by an environment variable, variable, OMP_NUM_THREADSOMP_NUM_THREADS. It could default to 1 or the . It could default to 1 or the number of cores on your machine. number of cores on your machine. export OMP_NUM_THREADS=4export OMP_NUM_THREADS=4

#include "stdio.h"   #include "stdio.h"   #include <omp.h>#include <omp.h>  int main(int argc, char *argv[])int main(int argc, char *argv[]){{    ##pragma omp parallelpragma omp parallel    {{        printf("hello multicore user!\n");printf("hello multicore user!\n");    }}    return(0);return(0);}}

Question Question : What message is displayed at the screen ?: What message is displayed at the screen ?

Page 25: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 25

OpenMP – CPUs and threadsOpenMP – CPUs and threads#include "stdio.h"#include "stdio.h"#include <omp.h>#include <omp.h>

int main(int argc, char *argv[])int main(int argc, char *argv[]){{    #pragma omp parallel#pragma omp parallel    {{        int NCPU,tid,NPR,NTHR;int NCPU,tid,NPR,NTHR;        /* get the total number of CPUs/cores available for OpenMP *//* get the total number of CPUs/cores available for OpenMP */        NCPU = omp_get_num_procs();NCPU = omp_get_num_procs();        /* get the current thread ID in the parallel region *//* get the current thread ID in the parallel region */        tid = omp_get_thread_num();tid = omp_get_thread_num();        /* get the total number of threads available in this parallel region *//* get the total number of threads available in this parallel region */        NPR = omp_get_num_threads();NPR = omp_get_num_threads();        /* get the total number of threads requested *//* get the total number of threads requested */        NTHR = omp_get_max_threads();NTHR = omp_get_max_threads();        /* only execute this on the master thread! *//* only execute this on the master thread! */        if (tid == 0) {if (tid == 0) {            printf("%i : NCPU\t= %i\n",tid,NCPU);printf("%i : NCPU\t= %i\n",tid,NCPU);            printf("%i : NTHR\t= %i\n",tid,NTHR);printf("%i : NTHR\t= %i\n",tid,NTHR);            printf("%i : NPR\t= %i\n",tid,NPR);printf("%i : NPR\t= %i\n",tid,NPR);        }}        printf("%i : I am thread %i out of %i\n",tid,tid,NPR);printf("%i : I am thread %i out of %i\n",tid,tid,NPR);    }}return(0);return(0);}}

Page 26: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 26

sharedshared, , privateprivate and and reductionreductionThe following OpenMP example is a program which computes The following OpenMP example is a program which computes the dot product of two arraysthe dot product of two arrays a a and and b b (that is (that is sum(a[i]*b[i])sum(a[i]*b[i]) ) ) using a sum using a sum reductionreduction. .

The input variables The input variables aa and and bb are are sharedshared tables of tables of doubledouble values. values.

#include <omp.h>#include <omp.h>#include <stdio.h>#include <stdio.h>#include <stdlib.h>#include <stdlib.h>#define N 1000#define N 1000

int main (int argc, char *argv[]) {  int main (int argc, char *argv[]) {      double a[N], b[N];   double a[N], b[N];       double sum = 0.0;double sum = 0.0;    int i, n, tid;  int i, n, tid;      /* Start a number of threads *//* Start a number of threads */

Page 27: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 27

sharedshared, , privateprivate and and reductionreduction

#pragma omp #pragma omp parallelparallel  sharedshared(a) (a) sharedshared(b) (b) privateprivate(i) (i)     {{    /* only one of the threads do this *//* only one of the threads do this */    #pragma omp #pragma omp singlesingle        {{            n = omp_get_num_threads();n = omp_get_num_threads();            printf("Number of threads = %d\n", n);printf("Number of threads = %d\n", n);        }    }            /* initialize a and b – parallel operations *//* initialize a and b – parallel operations */    #pragma omp #pragma omp forfor          for (i=0; i < N; i++) {for (i=0; i < N; i++) {            a[i] = 1.0;a[i] = 1.0;            b[i] = 1.0;b[i] = 1.0;        }}

Page 28: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 28

sharedshared, , privateprivate and and reductionreduction

/* parallel for loop computing the sum of a[i]*b[i] *//* parallel for loop computing the sum of a[i]*b[i] */            #pragma omp #pragma omp forfor  reductionreduction((++:sum):sum)        for (i=0; i < N; i++) {for (i=0; i < N; i++) {            sum sum ++= a[i]*b[i];= a[i]*b[i];        }}    } } 

/* End of parallel region *//* End of parallel region */        printf("   Sum = %2.1f\n",sum);printf("   Sum = %2.1f\n",sum);    exit(0);exit(0);}}

Page 29: Performance Architecture for Multimedia - RISCP. Bakowski 20 Amdahl’s law The execution time of a mixed, sequential-parallel program, after the acceleration of the parallel part,

P. Bakowski 29

SummarySummary

Processing performanceProcessing performance

Pipeline performancePipeline performance

Memory system with cacheMemory system with cache

Three wallsThree walls

Symmetric multiprocessingSymmetric multiprocessing

Amdahl's LawAmdahl's Law

openMPopenMP