culler 1997 cs267 l28 sort.1 cs 267 applications of parallel computers lecture 28: logp and the...
Post on 19-Dec-2015
217 views
TRANSCRIPT
![Page 1: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/1.jpg)
Culler 1997CS267 L28 Sort.1
CS 267 Applications of Parallel Computers
Lecture 28: LogP and the
Implementation and Modeling of Parallel Sorts
James Demmel
(taken from David Culler,
Lecture 18, CS267, 1997)
http://www.cs.berkeley.edu/~demmel/cs267_Spr99
![Page 2: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/2.jpg)
Culler 1997CS267 L28 Sort.2
Practical Performance Target (circa 1992)
• Sort one billion large keys in one minute on one thousand processors.
• Good sort on a workstation can do 1 million keys in about 10 seconds
– just fits in memory
– 16 bit Radix Sort
• Performance unit: µs per key per processor– s ~ 10 for single Sparc 2
![Page 3: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/3.jpg)
Culler 1997CS267 L28 Sort.3
Studies on Parallel Sorting
Sorting Networks
PRAM Sorts
MEM
p p p°°°
Sorting on Network Y
P
M
network
P
M
P
M°°°
LogP SortsSorting onMachine X
![Page 4: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/4.jpg)
Culler 1997CS267 L28 Sort.4
The Study
Interesting ParallelSorting Algorithms
Analyze under LogP
Parametersfor CM-5
Estimate ExecutionTime
Implement in Split-C
Execute on CM-5
Compare
??
(Bitonic, Column, Histo- radix, Sample)
![Page 5: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/5.jpg)
Culler 1997CS267 L28 Sort.5
LogP
![Page 6: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/6.jpg)
Culler 1997CS267 L28 Sort.6
Deriving the LogP Model
° Processing
– powerful microprocessor, large DRAM, cache => P
° Communication
+ significant latency (100's of cycles) => L
+ limited bandwidth (1 – 5% of memory bw) => g
+ significant overhead (10's – 100's of cycles) => o- on both ends
– no consensus on topology
=> should not exploit structure
+ limited capacity– no consensus on programming model
=> should not enforce one
![Page 7: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/7.jpg)
Culler 1997CS267 L28 Sort.7
LogP
Interconnection Network
MPMPMP° ° °
P ( processors )
Limited Volume( L/ g to or from
a proc)
o (overhead)
L (latency)
og (gap)
• Latency in sending a (small) mesage between modules
• overhead felt by the processor on sending or receiving msg
• gap between successive sends or receives (1/BW)
• Processors
![Page 8: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/8.jpg)
Culler 1997CS267 L28 Sort.8
Using the Model
° Send n messages from proc to proc in time 2o + L + g(n-1)
– each processor does o n cycles of overhead
– has (g-o)(n-1) + L available compute cycles
° Send n messages from one to many
in same time
° Send n messages from many to one
in same time
– all but L/g processors block
so fewer available cycles
o L o
o og
Ltime
P
P
![Page 9: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/9.jpg)
Culler 1997CS267 L28 Sort.9
Use of the Model (cont)
° Two processors sending n words to each other (i.e., exchange) in time
2o + L + max(g,2o) (n-1) max(g,2o) + L
° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts , e.g., transpose, fft, cyclic-to-block, block-to-cyclic
same
exercise: what’s wrong with the formula above?
Assumes optimal pattern of send/receive, so could underestimate time
![Page 10: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/10.jpg)
Culler 1997CS267 L28 Sort.10
LogP "philosophy"
• Think about:
• – mapping of N words onto P processors
• – computation within a processor, its cost, and balance
• – communication between processors, its cost, and balance
• given a charaterization of processor and network performance
• Do not think about what happens within the network
This should be good enough!
![Page 11: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/11.jpg)
Culler 1997CS267 L28 Sort.11
Typical Sort
Exploits the n = N/P grouping
° Significant local computation
° Very general global communication / transformation
° Computation of the transformation
![Page 12: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/12.jpg)
Culler 1997CS267 L28 Sort.12
Split-C
Global Address Space
P0 Pprocs-1P1
local
• Explicitly parallel C
• 2D global address space– linear ordering on local spaces
• Local and Global pointers – spread arrays too
• Read/Write
• Get/Put (overap compute and comm)– x := G; . . .
– sync();
• Signaling store (one-way)– G :– x; . . .
– store_sync(); or all_store_sync();
• Bulk transfer
• Global comm.
![Page 13: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/13.jpg)
Culler 1997CS267 L28 Sort.13
Basic Costs of operations in Split-C
• Read, Write x = *G, *G = x 2 (L + 2o)
• Store *G :– x L + 2o
• Get x := *G o
.... 2L + 2o sync(); o
– with interval g
• Bulk store (n words with words/message)
2o + (n-1)g + L
• Exchange 2o + 2L + (nL/g) max(g,2o)
• One to many
• Many to one
![Page 14: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/14.jpg)
Culler 1997CS267 L28 Sort.14
LogP model
• CM5:– L = 6 µs
– o = 2.2 µs
– g = 4 µs
– P varies from 32 to 1024
• NOW– L = 8.9
– o = 3.8
– g = 12.8
– P varies up to 100
• What is the processor performance?
![Page 15: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/15.jpg)
Culler 1997CS267 L28 Sort.15
Sorting
![Page 16: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/16.jpg)
Culler 1997CS267 L28 Sort.16
Local Sort Performance (11 bit radix sort of 32 bits numbers)
Log N/P
µs
/ K
ey
0
1
2
3
4
5
6
7
8
9
10
0 5 10 15 20
31
25.1
16.9
10.4
6.2
Entropy inKey Values
Entropy = -i pi log pi ,
pi = Probability of key i
<--------- TLB misses ---------->
![Page 17: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/17.jpg)
Culler 1997CS267 L28 Sort.17
Local Computation Parameters - EmpiricalParameter Operation µs per key Sort
Swap Simulate cycle butterfly per key 0.025 lg N Bitonic
mergesort Sort bitonic sequence 1.0
scatter Move key for Cyclic-to-block 0.46
gather Move key for Block-to-cyclic 0.52 if n<=64k or P<=64 Bitonic & Column
1.1 otherwise
local sort Local radix sort (11 bit) 4.5 if n < 64K
9.0 - (281000/n)
merge Merge sorted lists 1.5 Column
copy Shift Key 0.5
zero Clear histogram bin 0.2 Radix
hist produce histogram 1.2
add produce scan value 1.0
bsum adjust scan of bins 2.5
address determine desitination 4.7
compare compare key to splitter 0.9 Sample
localsort8 local radix sort of samples 5.0
![Page 18: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/18.jpg)
Culler 1997CS267 L28 Sort.18
Bottom Line (Preview)
N/P
us/
key
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Bitonic 1024
Bitonic 32
Column 1024
Column 32
Radix 1024
Radix 32
Sample 1024
Sample 32
• Good fit between predicted and measured (10%)
• Different sorts for different sorts– scaling by processor, input size, sensitivity
• All are global / local hybrids– the local part is hard to implement and model
![Page 19: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/19.jpg)
Culler 1997CS267 L28 Sort.19
Odd-Even Merge - classic parallel sort
N values to be sorted
A0 A1 A2 A3 AM-1 B0 B1 B2 B3 BM-1
Treat as two lists ofM = N/2
Sort each separately
A0 A2 … AM-2 B0 B2 … BM-2
Redistribute intoeven and odd sublists A1 A3 … AM-1 B1 B3 … BM-1
Merge into twosorted lits
E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1
Pairwise swaps ofEi and Oi will put itin order
![Page 20: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/20.jpg)
Culler 1997CS267 L28 Sort.20
Where’s the Parallelism?
E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1
1xN
1xN
4xN/4
2xN/2
![Page 21: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/21.jpg)
Culler 1997CS267 L28 Sort.21
Mapping to a Butterfly (or Hypercube)A0 A1 A2 A3 B0 B1 B2 B3
A0 A1 A2 A3
A0 A1 A2 A3
B0B1 B3 B2
B2B3 B1 B0
A0 A1 A2 A3 B2B3 B1 B0
Reverse Orderof one list viacross edges
two sorted sublists
Pairwise swapson way back2 3 4 8 7 6 5 1
2 3 4 7 6 5 81
2 4 6 81 3 5 7
1 2 3 4 5 6 7 8
![Page 22: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/22.jpg)
Culler 1997CS267 L28 Sort.22
Bitonic Sort with N/P per node
all_bitonic(int A[PROCS]::[n])
sort(tolocal(&A[ME][0]),n,0)
for (d = 1; d <= logProcs; d++)
for (i = d-1; i >= 0; i--) {
swap(A,T,n,pair(i));
merge(A,T,n,mode(d,i));
}
sort(tolocal(&A[ME][0]),n,mask(i));
sortswap
A bitonic sequence decreases and then increases (or vice versa)Bitonic sequences can be merged like monotonic sequences
![Page 23: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/23.jpg)
Culler 1997CS267 L28 Sort.23
Bitonic Sort
lg N/p stages are local sort
Block Layout
remaining stages involve Block-to-cyclic, local merges (i - lg N/P cols)cyclic-to-block, local merges ( lg N/p cols within stage)
![Page 24: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/24.jpg)
Culler 1997CS267 L28 Sort.24
Analysis of Bitonic
• How do you do transpose?
• Reading Exercise
![Page 25: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/25.jpg)
Culler 1997CS267 L28 Sort.25
Bitonic Sort: time per key
Predicted
N/P
us/
key
0
10
20
30
40
50
60
70
80
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Measured
N/P
us/
key
0
10
20
30
40
50
60
70
80
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
512
256
128
64
32
![Page 26: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/26.jpg)
Culler 1997CS267 L28 Sort.26
Bitonic: Breakdown
Predicted
N/P
us/k
ey
0
10
20
30
40
50
60
70
80
16384
32768
65536
131072
262144
524288
1048576
Measured
N/P
us/k
ey
0
10
20
30
40
50
60
70
80
16384
32768
65536
131072
262144
524288
1048576
Remap B-C
Remap C-B
Mergesort
Swap
Localsort
P= 512, random
![Page 27: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/27.jpg)
Culler 1997CS267 L28 Sort.27
Bitonic: Effect of Key Distributions
Entropy (bits)
µs/k
ey
0
5
10
15
20
25
30
35
40
45
0 6 10 17 25 31
Swap
Merge Sort
Local Sort
Remap C-B
Remap B-C
P = 64, N/P = 1 M
![Page 28: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/28.jpg)
Culler 1997CS267 L28 Sort.28
Column Sort
(3) Sort
(2) Transpose - block to cyclic
(1) Sort
(4) Transpose- cyclic to block w/o scatter
(6) shift
(5) Sort
(8) Unshift
(7) mergework efficient
Treat datalike n x P array,with n >= P^2,I.e. N >= P^3
![Page 29: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/29.jpg)
Culler 1997CS267 L28 Sort.29
Column Sort: Times
Predicted
N/P
us/
key
0
5
10
15
20
25
30
35
40
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Measured
N/P
us/
key
0
5
10
15
20
25
30
35
40
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
512
256
128
64
32
Only works for N >= P^3
![Page 30: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/30.jpg)
Culler 1997CS267 L28 Sort.30
Column: Breakdown
Predicted
N/P
us/k
ey
0
5
10
15
20
25
30
35
40
16384
32768
65536
131072
262144
524288
1048576
Measured
N/P
us/k
ey
0
5
10
15
20
25
30
35
40
16384
32768
65536
131072
262144
524288
1048576
Sort1
Sort2
Sort3
Merge
Trans
Untrans
Shift
Unshift
P= 64, random
![Page 31: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/31.jpg)
Culler 1997CS267 L28 Sort.31
Column: Key distributions
Entropy (bits)
µs
/ key
0
5
10
15
20
25
30
35
0 6 10 17 25 31
Merge
Sorts
Remaps
Shifts
P = 64, N/P = 1M
![Page 32: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/32.jpg)
Culler 1997CS267 L28 Sort.32
Histo-radix sortP
n=N/P
Per pass:
1. compute local histogram
2. compute position of 1st
member of each bucket in
global array
– 2^r scans with end-around
3. distribute all the keys
Only r = 8,11,16 make sense
for sorting 32 bit numbers
2^r23
![Page 33: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/33.jpg)
Culler 1997CS267 L28 Sort.33
Histo-Radix Sort (again)
Local Data
Local Histograms
Each Passform local histogramsform global histogramglobally distribute data
P
![Page 34: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/34.jpg)
Culler 1997CS267 L28 Sort.34
Radix Sort: Times
Predicted
N/P
us/
key
0
20
40
60
80
100
120
140
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Measured
N/P
us/
key
0
20
40
60
80
100
120
140
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
512
256
128
64
32
![Page 35: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/35.jpg)
Culler 1997CS267 L28 Sort.35
Radix: Breakdown
N/P
us/k
ey
0
20
40
60
80
100
120
140
1638
4
3276
8
6553
6
1E+
05
3E+
05
5E+
05
1E+
06
Dist
GlobalHist
LocalHist
Dist-m
GlobalHist-m
LocalHist-m
![Page 36: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/36.jpg)
Culler 1997CS267 L28 Sort.36
Radix: Key distribution
Entropy
µs /
ke
y
0
10
20
30
40
50
60
70
0 6
10
17
25
31
Cycl
ic
118
Dist
Global Hist
Local Hist
Slowdown due to contentionin redistribution
![Page 37: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/37.jpg)
Culler 1997CS267 L28 Sort.37
Radix: Stream Broadcast Problem
n
(P-1) ( 2o + L + (n-1) g ) ? Need to slow first processor to pipeline well
![Page 38: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/38.jpg)
Culler 1997CS267 L28 Sort.38
What’s the right communication mechanism?
• Permutation via writes– consistency model?
– false sharing?
• Reads?
• Bulk Transfers?– what do you need to change in the algorithm?
• Network scheduling?
![Page 39: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/39.jpg)
Culler 1997CS267 L28 Sort.39
Sample Sort
1. compute P-1 values of keys that
would split the input into roughly equal pieces.
– take S~64 samples per processor
– sort PS keys
– take key S, 2S, . . . (P-1)S
– broadcast splitters
2. Distribute keys based on splitters
3. Local sort
[4.] possibly reshift
![Page 40: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/40.jpg)
Culler 1997CS267 L28 Sort.40
Sample Sort: Times
Predicted
N/P
us/
key
0
5
10
15
20
25
30
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Measured
N/P
us/
key
0
5
10
15
20
25
30
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
512
256
128
64
32
![Page 41: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/41.jpg)
Culler 1997CS267 L28 Sort.41
Sample Breakdown
N/P
us/
key
0
5
10
15
20
25
30
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Split
Sort
Dist
Split-m
Sort-m
Dist-m
![Page 42: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/42.jpg)
Culler 1997CS267 L28 Sort.42
Comparison
N/P
us/
key
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Bitonic 1024
Bitonic 32
Column 1024
Column 32
Radix 1024
Radix 32
Sample 1024
Sample 32
• Good fit between predicted and measured (10%)
• Different sorts for different sorts– scaling by processor, input size, sensitivity
• All are global / local hybrids– the local part is hard to implement and model
![Page 43: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James](https://reader031.vdocuments.site/reader031/viewer/2022032800/56649d375503460f94a0f8c6/html5/thumbnails/43.jpg)
Culler 1997CS267 L28 Sort.43
Conclusions
• Distributed memory model leads to hybrid global / local algorithms
• LogP model is good enough for the global part– bandwidth (g) or overhead (o) matter most
– including end-point contention
– latency (L) only matters when BW doesn’t
– g is going to be what really matters in the days ahead (NOW)
• Local computational performance is hard!– dominated by effects of storage hierarchy (TLBs)
– getting trickier with multilevels » physical address determines L2 cache behavior
– and with real computers at the nodes (VM)
– and with variations in model» cycle time, caches, . . .
• See http://www.cs.berkeley.edu/~culler/papers/sort.ps
• See http://now.cs.berkeley.edu/Papers2/Postscript/spdt98.ps– disk-to-disk parallel sorting