of parallel thigh performance computing

10
Design of Parallel & thigh Performance Computing Reasoning about Performance II Roofline Rode Balance Principles - Analysis Scheduling

Upload: others

Post on 29-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Design of Parallel & thigh Performance ComputingReasoning about Performance II

Roofline Rode - Balance Principles - Analysis Scheduling

Roofline Rodell Williams et al. 20087

Computer : Program :

men I= W/Q I fbpslsyde ]bandwidth A

LLC tsyteskyck ] T = runtime I cycles ]

CPU peak perf . T P= WIT ( performance ) [ fbpskyck ]tfbpskyclei

Roofline plot : ( example T= 2, A=4 )

Ptfbpskyde ] sound jasedenp fP±pI ) reason :¥=Ww¥a=¥2f^

Sind

Sounds ( PET ) PZ p. I2-

: ⇒ logp a- legit leg B1 -

r

' k - I× ← program Isl ⇒ P 2- A

-

i

run on Some

I. guts input^ 1 1 1 1 > I I f6ps/J > tif

YYYZ1 2

,

Operational Intensity : Upper Surely

Icu ) =Wen )/QCu ) x ,y : length n

, RB ,C= men,

as scalarHeflops I # Sytes

meme > cache

asymptotic best exact

daepy y-

- AttyOct ) OCI ) I' 112

dgemv y - Arty Oct ) OCI ) E' 14

-

fft oclogu ) oaogy ) -

dgemm C - AB -1C 0647 o Crf ) E YT should Se

sharp ifg= code tie

gut . 828Example . y = 12nA

⇒ n E 700

baek to Slides → measured roofline plots

G. Balance Principles I Ckuey 86 )

Compete : Pegram :

men Qy : data transfer men a LLC

sLLC sire j W : work = # ops

CPU IT

The computer is called Salaam ifcompute time = data transfer the

ire . , assuming perfect utilization

¥ = Ese⇒ Ip=¥Jr

r

assume it → at increases ;how A nesalauee ?

a.) A → a. A ,✓ set A grows slower than 5 !s

. ) y → x. 8 what is × ?

Example I : Mahn multiplicities Ca ABTC

algorithm worth opkmel I= hot - QCTF )

Ip =¥←=o( of ) t.at, r→e?g

Example 2 : FFT / dorky

algorithm ark optimal I= @( logo )

Is = [email protected] → ya

pots are unnealrttfe .

7. Balance principles I ( Czechowski et al.

Lou )

coal : none detailed principles for multi ones

algorithm / architecture adesrgn ,assess webof HW tends

Computer ; Algorithm :

slow men PRAM : W : work,

Di depthdi lately ,

B : throughput,

fast men sing dnansaehon fired Qp, , : men . transfersof she A

. . . each poll perf IT

P,

Ppassumptions : optimal W

,W/Q

processors

Processor is faleuafd ;ftwo * set Qp , rn fun Qnx ?

- there are general fends

Them E Tamp ("

compute - tuned and some Arnet results

" as :

of a- put ⇐ > F 3- ¥a

Derivation principles :

i. ) Estimate Tuen : Idea Hinde DAG in levels

level I : o a o a o o 9,mean transfers of the A

level 2 : \o/###qz

r - - - . - - - - - -- - -

- - - e

^

i

level D= o a a o o O O 9g

In each level 2- B model : a+9,⇐ Fit

⇒ Tues I C at 9)= a. Dt Q -- Qara

B I

2.) Estimably Top : Brent 's lemma

Taupe CD -1¥ ) . ¥

←heh - par . compute

Balance : Them stamp ← parallelism computer

⇐ At .fi?j)eoY-eltIa )^ T parallelism algorithmI

Ken#

new. par . algorithm

Example I ; matrix multiplication

use Q 7 r¥TrFp. Cluny et al.

2004 ) ⇒ Icu ) -- O City)

and D K W,

D G Q C a means much smaller )

⇒ balance principle t.IE ofFr )p

I → a. IT rebalance : A → af on p → al g

p → a. p u : ( A → af and p - say ) onCp → Pr )

Example 2 a sending FFFT⇒ PF e o Clay E)

Back to Sholes → Evolution of balance

Greedy scheduling , Analysts

Reminder : T,

=W, To =D Tp 7 T ,/p , To Tp E Dt w/p= Tothlp

Greedy scheduler :

# complete steps IT ,/pIf incomplete steps E I ( every incomplete step reduces

critical path by 1 )⇒ Tp e T ,/pt To

How far from optimal ?

Assume Tpt is optimal :

Tpt >, nae { Tlp , Ta }

Tp I T ,/p t To E 2 met { Tlp , To } E 2Tp← so at most

footer 2 away

enough parallelism : WID = T.to ⇒ p ( ⇒ Tok Tlp

⇒ Tp E T ,/p + To a T, /p

Work stealing scheduling : Analysts

Theorem : The scheduler achieves Tp E T ,/p+ Otto )

Proof sketch : Every processor is either working or stealingTotal true working : T

,

Total time stealing : Every steel has Yp chance to reducethe critical path ; hence OCP To )

⇒ total time : T, + Ocpty )

⇒ time per processor : total Line /p = T, /pt Otto )

Note : space requirement Sp I p S, .