1 sigmetrics ‘96 generalized data transfers at memory bandwidth peter a. dinda peter a. dindadavid...
Post on 19-Dec-2015
217 views
TRANSCRIPT
1SIGMETRICS ‘96
Generalized Data Transfers
At Memory Bandwidth
Generalized Data Transfers
At Memory Bandwidth
Peter A. DindaPeter A. Dinda David R. O’Hallaron
Carnegie Mellon University
http://www.cs.cmu.edu/~pdinda http://www.cs.cmu.edu/~pdinda
http://www.cs.cmu.edu/~droh
2SIGMETRICS ‘96
Generalized Data TransfersGeneralized Data Transfers
Receiving Node Memory
ABC
D
FE
Sending Node Memory
3SIGMETRICS ‘96
Address RelationsAddress Relations
R={(x,y) | data item at address x on sender is copied to addressy on receiver}
R={(x,y) | data item at address x on sender is copied to addressy on receiver}
{(A,F),(B,D),(C,E)}
Receiving Node Memory
ABC
D
FE
Sending Node Memory
4SIGMETRICS ‘96
Send/Recv ImplementationSend/Recv Implementation
{(A,F), (B,D), (C,E)}
Sending NodeMemory
Receiving Node Memory
Message Contents
Data TransferData Transfer
ABC
D
FE
Message Disassembly
Message Disassembly
Message Assembly
Message Assembly
(also put and get communication models)
5SIGMETRICS ‘96
Storing Address RelationsStoring Address Relations
while not doneget_address_pair(x,y)buffer[i++]=data[x]
end while
while not donecompute_address_pair(x,y)store_address_pair(x,y)
end while
Done Once
RepeatedMany Times
Compute Address Relation - “Inspector”
Assemble Message - “Executor”
6SIGMETRICS ‘96
Inspector/Executor [Salz, et al]Inspector/Executor [Salz, et al]In-line Computation Inspector/Executor
i=1
i=2
i=3
do i=1,1000 call Work() call COPY()
call Work()
enddo
i=2
i=1
i=3
Inspector
Executor
Executor
Executor
i=3
Executor
7SIGMETRICS ‘96
Context: Array AssignmentsContext: Array Assignments
Abstraction
Array A Array BB=AB=A
do i=1,1000call Work(A)
call Work(B)end
dim A(N,N),B(N,N)
We concentrate on B=A and B=TRANSPOSE(A)
More general forms exist
8SIGMETRICS ‘96
Distributed ArraysDistributed Arrays
(*,BLOCK) (*,CYCLIC)(*,CYCLIC(k))
Regular Block-cyclic distributions as in High Performance Fortran(HPF)
Elements Processor 0Owns
LocalArray onProcessor 0
Distribution
9SIGMETRICS ‘96
Representative AssignmentsRepresentative Assignments
(BLOCK,*) (*,BLOCK) (CYCLIC,*)
(*,CYCLIC)
(BLOCK,*)
(BLOCK,*) (CYCLIC,*)(CYCLIC,*)Data Transpose
10SIGMETRICS ‘96
Representing Address RelationsRepresenting Address Relations General Purpose Space Efficiency Hardware Limited Performance In-line expansion
11SIGMETRICS ‘96
AAPAIR: Simple RepresentationAAPAIR: Simple Representation
Simple sequence of pointer pairsSimple sequence of pointer pairs
PROBLEM: Space EfficiencyPROBLEM: Performance
Receiving Node Memory
ABC
D
FE
Sending Node Memory
{(A,F),(B,D),(C,E)}
ABC
DE
F
12SIGMETRICS ‘96
AABLK: Run-length EncodingAABLK: Run-length Encoding
A
B
C
D
F
E
Sequence of pointer, pointer, length triplesSequence of pointer, pointer, length triples
PROBLEM: Strided Access
{(A,F),(A+1,F+1), (B,D),(B+1,D+1), (C,E),(C+1,E+1)}
ABC
DE
F22
2
13SIGMETRICS ‘96
DMRLE: Handling StridesDMRLE: Handling Strides
sequence of offset, offset, length triplessequence of offset, offset, length triples
PROBLEM: Repeated Strides
A
B
C
D
F
Eg
g h
h
Ag h
F21
{(A,F),(B,E),(C,D)}B-A = C-B = gE-F = D-E = h
14SIGMETRICS ‘96
D
FE
DMRLEC: Repeated StridesDMRLEC: Repeated Strides
Sequence of indices into table of offset, offset, length triples
Sequence of indices into table of offset, offset, length triples
ABCg
gh
h
A’B’C’
D’
F’E’
g
gh
h
Ag h
F21
uv
u v 1
0 1 2 1
{(A,F),(B,E),(C,D),(A’,F’),(B’,E’),(C’,D’)}
B-A = C-B = B’-A’ = C’-B’ = g E-F = D-E = E’-F’= D’-E’ = h
A’-C = u and F’-D=v
0:1:2:
15SIGMETRICS ‘96
Address Relation Storage CostsAddress Relation Storage Costs
1
10
100
1000
10000
100000
1000000
10000000
Tota
l Sto
rage
(B
ytes
)
Various Testcases
AAPAIR
AABLK
DMRLE
DMRLEC
16SIGMETRICS ‘96
Copying & Superscalar PlateauCopying & Superscalar Plateau
Maximum number of non load/store instructions before copy bandwidth suffers
Maximum number of non load/store instructions before copy bandwidth suffers
load
stor
e
load
stor
e
...
...
Time
stallstall
stall
load
stall
stor
e
...
n Plateau = np = 2*3= 6
p
Issued attime t
load
stor
e
FreeIssueSlots
17SIGMETRICS ‘96
Paragon: No Superscalar Plat.Paragon: No Superscalar Plat.
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70
Co
py
Ra
te (
MB
/s)
Extra Instructions in Copy Loop
18SIGMETRICS ‘96
Pentium 90: Clear PlateauPentium 90: Clear Plateau
0
2
4
6
8
10
12
14
16
18
0 10 20 30 40 50 60 70
Cop
y R
ate
(MB
/s)
Extra Instructions in Copy Loop
19SIGMETRICS ‘96
DEC 3K/400: Complex PlateauDEC 3K/400: Complex Plateau
0
5
10
15
20
25
30
35
40
45
0 10 20 30 40 50 60 70
Cop
y R
ate
(MB
/s)
Extra Instructions in Copy Loop
20SIGMETRICS ‘96
Measurement DetailsMeasurement Details Portable Library written in C Four representative assignments 512x512, 1Kx1K, 2Kx2K arrays of
doubles distributed on Four processors
Six Machines Assembly and Disassembly Rates
21SIGMETRICS ‘96
Measurement TestcasesMeasurement Testcases
(BLOCK,*) (*,BLOCK) (CYCLIC,*)
(*,CYCLIC)
(BLOCK,*)
(BLOCK,*) (CYCLIC,*)(CYCLIC,*)Data Transpose
22SIGMETRICS ‘96
Performance: DEC 3K/400Performance: DEC 3K/400
(B,*) to (*,B)
(B,*) to (C,*)
(C,*) to (B,*)
(*,C) to (C,*) T
05
1015202530354045
Mes
sage
Ass
embl
y R
ate
(MB
/s)
AAPAIR
DMRLEC
Memory
23SIGMETRICS ‘96
Performance:IBM 250 (PPC601)Performance:IBM 250 (PPC601)
(B,*) to (*,B)
(B,*) to (C,*)
(C,*) to (B,*)
(*,C) to (C,*)T
0
5
10
15
20
25
30
35M
essa
ge A
ssem
bly
Rat
e (M
B/s
)
AAPAIR
DMRLEC
Memory
24SIGMETRICS ‘96
Performance: IBM SP2 (PWR2)Performance: IBM SP2 (PWR2)
(B,*) to (*,B)
(B,*) to (C,*)
(C,*) to (B,*)
(*,C) to (C,*)T
0
10
20
30
40
50
60
70M
essa
ge A
ssem
bly
Rat
e (M
B/s
)
AAPAIR
DMRLEC
Memory
25SIGMETRICS ‘96
Performance: ParagonPerformance: Paragon
(B,*) to (*,B)
(B,*) to (C,*)
(C,*) to (B,*)
(*,C) to (C,*)T
0
5
10
15
20
25
30
35M
essa
ge A
ssem
bly
Rat
e (M
B/s
)
AAPAIR
DMRLEC
Memory
26SIGMETRICS ‘96
Performance: Pentium 90Performance: Pentium 90
(B,*) to (*,B)
(B,*) to (C,*)
(C,*) to (B,*)
(*,C) to (C,*)T
02468
101214161820
Mes
sage
Ass
embl
y R
ate
(MB
/s)
AAPAIR
AABLK
DMRLE
DMRLEC
Memory
27SIGMETRICS ‘96
Performance: Pentium 133Performance: Pentium 133
(B,*) to (*,B)
(B,*) to (C,*)
(C,*) to (B,*)
(*,C) to (C,*)T
05
101520253035404550
Mes
sage
Ass
embl
y R
ate
(MB
/s)
AAPAIR
AABLK
DMRLE
DMRLEC
Memory
28SIGMETRICS ‘96
ConclusionsConclusions Exploit “Superscalar Plateau” using
compact address relation encodings
Cheap enough even for scalar machines
Generalized data transfer with hardware-limited throughput
Many possible applications
29SIGMETRICS ‘96
Copying with Address RelationsCopying with Address Relations
Copy Engine
Sender Data Addresses
Data Items Data Items
Receiver Data Addresses
AddressRelationAddresses
AddressRelationData
Address RelationDecoder