levels of processor/memory hierarchy

University at Albany,SUNYlrm-1

lrm 04/20/23

Levels of Processor/Memory Hierarchy

• Can be Modeled by Increasing Dimensionality of Data Array.

– Additional dimension for each level of the hierarchy.– Envision data as reshaped to reflect increased

dimensionality.– Calculus automatically transforms algorithm to reflect

reshaped data array.– Data, layout, data movement, and scalarization automatically

generated based on reshaped data array.


lrm 04/20/23

Levels of Processor/Memory Hierarchycontinued

• Math and indexing operations in same expression

• Framework for design space search– Rigorous and provably correct– Extensible to complex architectures

Approach

Mathematics of Arrays

Example: “raising” arraydimensionality

y= conv (x)

Me

mo

ry H

iera

rch

y

Parallelism

Main Memory

L2 Cache

L1 Cache

Map

x: < 0 1 2 … 35 >

Map:

< 3 4 5 >< 0 1 2 >

< 6 7 8 >< 9 10 11 >

< 12 13 14 >

< 18 19 20 >< 21 22 23 >

< 24 25 26 >< 27 28 29 >

< 30 31 32 >

< 15 16 17 >

< 33 34 35 >

P0 P1 P2

P0

P1

P2


lrm 04/20/23

Application DomainSignal Processing

3-d Radar Data Processing Composition of Monolithic Array Operations

Algorithm is Input

Architectural Information is Input

Hardware Info:- Memory- Processor

Change algorithmto better match

hardware/memory/communication.Lift dimensionalgebraically

PulseCompression

DopplerFiltering

Beamforming Detection

Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1)

Model All Three: dim=dim+3

Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1)

Model All Three: dim=dim+3

Convolution Matrix Multiply


lrm 04/20/23

Some Psi Calculus Operations


lrm 04/20/23

Convolution: PSI Calculus Description

PSI Calculus operators compose to form higher level operationsPSI Calculus operators compose to form higher level operations

Definitionof

y=conv(h,x)

y[n]= where x ‘has N elements, h has M elements, 0≤n<N+M-1, andx’ is x padded by M-1 zeros on either end

knxkhM

k

1

0

AlgorithmandPSI

CalculusDescription

Algorithm step Psi Calculus

sum Y=unaryOmega (sum, 1, Prod)

Initial step x= < 1 2 3 4 > h= < 5 6 7 >

rotate x’(N+M-1) times x’ rot=binaryOmega(rotate,0,iota(N+M-1), 1 x’)

Form x’ x’=cat(reshape(<k-1>, <0>), cat(x, reshape(<k-1>,<0>)))=

take the size of h part of x’rot

x’ final=binaryOmega(take,0,reshape<N+M-1>,<M>,=1,x’ rot

multiply Prod=binaryOmega (*,1, h,1,x’ final)

x= < 1 2 3 4 > h= < 5 6 7 >

< 0 0 1 . . . 4 0 0 >

x’ rot=< 0 0 1 2 . . . >< 0 1 2 3 . . . >< 1 2 3 4 . . . >

< 7 20 38 . . . >

< 0 0 1 >< 0 1 2 >< 1 2 3 >

< 0 0 7 >< 0 6 14 >< 5 12 21 >

x’ final=

Prod=

Y=

x’=


lrm 04/20/23

Experimental Platform and Method

Hardware• DY4 CHAMP-AV Board

– Contains 4 MPC7400’s and 1 MPC 8420

• MPC7400 (G4)– 450 MHz– 32 KB L1 data cache– 2 MB L2 cache– 64 MB memory/processor

Software

• VxWorks 5.2– Real-time OS

• GCC 2.95.4 (non-official release)– GCC 2.95.3 with patches for

VxWorks– Optimization flags:

-O3 -funroll-loops -fstrict-aliasing

Method

• Run many iterations, report average, minimum, maximum time

– From 10,000,000 iterations for small data sizes, to 1000 for large data sizes

• All approaches run on same data

• Only average times shown here

• Only one G4 processor used

• Use of the VxWorks OS resulted in very low variability in timing• High degree of confidence in results

• Use of the VxWorks OS resulted in very low variability in timing• High degree of confidence in results


lrm 04/20/23

Convolution and Dimension Lifting

• Model Processor and Level 1 cache.– Start with 1-d inputs(input dimension).– Envision 2nd dimension ranging over output values.– Envision Processors

Reshaped into a 3rd dimension. The 2nd dimension is partitioned.

– Envision Cache Reshaped into a 4th dimension. The 1st dimension is partitioned.

– “psi” Reduce to Normal Form


lrm 04/20/23

– Envision 2nd dimension ranging over output values.

Let tz=N+M-1 M=h=4

N=x

tztz

h3 h2 h1 h0 0 0 0 x0 x4


lrm 04/20/23

2

x xtz

2

tz

2

- Envision Processors Reshaped into a 3rd dimension. The 2nd dimension is partitioned.

Let p = 2

-4-4 - -4-


lrm 04/20/23

– Envision Cache

Tz/2

x

2x

2

Reshaped into a 4th dimension

The 1st dimension is partitioned.

tz/2

Tz/2

x

2x

2

2

tz

2

2 2


lrm 04/20/23

ONF for the Convolution Decomposition with Processors & Cache

Generic form- 4 dimensional after “psi” Reduction

1. For i0= 0 to p-1 do:2. For i11= 0 to tz/p –1 do:3. sum 04. For icacherow= 0 to M/cache -1 do:5. For i3 = 0 to cache –1 do:6. sum sum + h [(M-((icacherow cache) + i3))-1]

x’[(((tz/p i0)+i1) + icacherow cache) + i3)]

Let tz=N+M-1M=hN=x

Time DomainTime Domain

Processor

loop

TI

me

loop

Cache

loop

sum is calculated for each element of y.


lrm 04/20/23

Outline

• Overview• Array Algebra: MoA and Index Calculus: Psi Calculus• Time Domain Convolution• Other algorithms in Radar

– Modified Gram-Schmidt QR Decompositions MOA to ONF Experiments

– Composition of Matrix Multiplication in Beamforming MoA to DNF Experiments

– FFT

• Benefits of Using Moa and Psi Calculus


lrm 04/20/23

ONFfor1

proc

Algorithms in Radar

Time DomainConvolution (x,y)

Modified Gram SchmidtQR (A)

A x (BH x C)Beamforming

Manualdescription

&derivation

for 1 processor

DNF

Lift dimension- Processor- L1 cache

reformulate

DNF ONF

Mechanize UsingExpression Templates

Use toreason

about RAW

Benchmark at NCSAw/LAPACK

CompilerOptimizationsDNF to ONF

ImplementDNF/ONFFortran 90

Thoughtson an

AbstractMachine

MoA & CalculusMoA & Calculus


lrm 04/20/23

Benefits of Using Moa and Psi Calculus

• Processor/Memory Hierarchy can be modeled by reshaping data using an extra dimension for each level.

• Composition of monolithic operations can be reexpressed as composition of operations on smaller data granularities

– Matches memory hierarchy levels– Avoids materialization of intermediate arrays.

• Algorithm can be automatically(algebraically) transformed to reflect array reshapings above.

• Facilitates programming expressed at a high level– Facilitates intentional program design and analysis– Facilitates portability

• This approach is applicable to many other problems in radar.


lrm 04/20/23

ONF for the QRDecomposition with Processors & Cache

Modified Gram Schmidt

Main

Loop

ProcessorLoop

ProcessorLoop

ProcessorCacheLoop

ProcessorCacheLoop

Initialization

ComputeNorm

Normalize

DoTProduct

Ortothogonalize


lrm 04/20/23

DNF for the Composition of A x (BH x C)

Generic form- 4 dimensional

1. Z=02. For i=0 to n-1 do:3. For j=0 to n-1 do:4. For k=0 to n-1 do:5. z[k;]z[k;]+A[k;j]xX[j;i]xB[i;]

Given A, B, X, Zn by n arrays

BeamformingBeamforming


lrm 04/20/23

Typical C++ Operator Overloading

temp

B+C temp

temp copy A

Main

Operator +

Operator =

1. Pass B and Creferences tooperator +

2. Create temporaryresult vector

3. Calculate results,store in temporary

4.Return copy of temporary

5. Pass results referenceto operator=

6. Perform assignment

temp copy

temp copy &

Example: A=B+C vector addExample: A=B+C vector add

B&,

C&

Additional Memory Use

Additional Execution Time

•Static memory•Dynamic memory(also affectsexecution time)

• Cache misses/page faults

• Time to create anew vector

• Time to create a copy of a vector

• Time to destructboth temporaries

2 temporary vectors created2 temporary vectors created


lrm 04/20/23

C++ Expression Templates and PETE

Parse trees, not vectors, createdParse trees, not vectors, created

Reduced Memory Use

Reduced Execution Time

• Parse tree only contains references

• Better cache use• Loop fusion style

optimization• Compile-time

expression tree manipulation

PETE: http://www.acl.lanl.gov/pete

• PETE, the Portable Expression Template Engine, is available from theAdvanced Computing Laboratory at Los Alamos National Laboratory

• PETE provides:– Expression template capability– Facilities to help navigate and evaluating parse trees

A=B+CA=B+CBinaryNode<OpAdd, Reference<Vector>, Reference<Vector > >Expression

Templates

Expression

Expression TypeParse Tree

B+C A

Main

Operator +

Operator =

+

B& C&

1. Pass B and Creferences tooperator +

4. Pass expression treereference to operator

2. Create expressionparse tree

3. Return expressionparse tree

5. Calculate result andperform assignment

copy &

copy

B&,

C&

Parse trees, not vectors, createdParse trees, not vectors, created

+

B C


lrm 04/20/23

Implementing Psi Calculus with Expression Templates

Example: A=take(4,drop(3,rev(B)))

B=<1 2 3 4 5 6 7 8 9 10>A=<7 6 5 4>

Recall:Psi Reduction for 1-d arrays always yields one or more expressions of the form:

x[i]=y[stride*i+ offset]l ≤ i < u

1. Form expression tree

2. Add size information 3. Apply Psi Reduction rules

4. Rewrite as sub-expressions with iterators at the leaves, and loop bounds information at the root

take

drop

rev

B

4

3

take

drop

rev

B

4

3

size=4

size=7

size=10

size=10

size=4iterator:offset=6stride=-1

size=4 A[i]=B[-i+6]

size=7 A[i]=B[-(i+3)+9]=B[-i+6]

size=10 A[i]=B[-i+B.size-1] =B[-i+9]

size=10 A[i]=B[i]

Siz

e i

nfo

Re

du

cti

on

• Iterators used for efficiency, rather than recalculating indices for each i• One “for” loop to evaluate each sub-expression

levels of processor/memory hierarchy

Documents