2010 blue waters performance modeling workshop opening and...
TRANSCRIPT
2010 Blue Waters Performance Modeling
Workshop – Opening and Introduction
Torsten Hoefler
1
With slides from: William Kramer, Marc Snir, William Gropp,
IBM, and the Blue Waters team
Introduction and Overview
• My slides contain only public information and
will be available online after the workshop
• No need to take pictures or notes!
• Parts of tomorrow will contain IBM confidential
information
• You may only attend the NDA session if your
institution signed and cleared all NDAs for you!
• You are responsible to maintain the confidentiality
of the information!
2
Blue Waters in a Nutshell
• >300.000 compute cores
• based on Power7
• 10 PF/s peak
• 1 PF/s sustained
• >1 PiB RAM
• >10 PiB disk storage
• >0.5 EiB archival storage
3
Performance Modeling for Blue Waters
• Most users have only experience at
comparatively “small” scale (<8000 cores)
• Applications should be ready to run on the full
system
• Needs a clear understanding before system is
deployed (run, tweak, rerun loop not possible)
Programmers need to develop a deep
understanding of the application scaling and
bottlenecks at scale by performance modeling!
4
From Chip to Entire Integrated System
5
On-line Storage
Near-line Storage
L-L
ink C
ab
les
Super Node(32 Nodes / 4 CEC)
P7 Chip
(8 cores)
SMP node
(32 cores)
Drawer
(256 cores)
SuperNode
(1024 cores)
Building Block
Blue Waters System
NPCF
6
Power7 Chip (8 cores)
• Base Technology
• 45 nm, 576 mm2
• 1.2 B transistors
• Chip
• 8 cores
• 4 FMAs/cycle/core
• 32 MB L3 (private/shared)
• Dual DDR3 memory
• 128 GiB/s peak bandwidth
• (1/2 byte/flop)
• Clock range of 3.5 – 4 GHz
7
Quad-chip MCM
L3 Cache/On-Chip Communication
8
• L1 32KB Instruction / core
• L1 32KB Data / core
• L2 = 256KB / core
• L3 = 4MB eDRAM / core
• Fast private and
shared region
MC
1M
C 0
8c uP
MC
1M
C0
8c uP
MC
0M
C1
8c uP
MC
0M
C1
8c uP
P7-0 P7-1
P7-3 P7-2
A
B
X
W
B
A
W
X
B A
Y X
A B
X Y
C
Z
C
Z
W
Z
Z
W
C
Y
C
Y
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
MC
0M
C 1
MC
0M
C 1
MC
0M
C 1
MC
0M
C 1
Quad Chip Module (4 chips)
• 32 cores!
• 32 cores*8 F/core*4 GHz = 1 TF
• 4 threads per core (max)
• 4x32 MiB L3 cache
• 512 GB/s RAM BW (0.5 B/F)
• 800 W (0.8 W/F)
• Flat shared memory!
9
Adding a Network Interface (Torrent)
10
DIM
M 1
5D
IMM
8D
IMM
9D
IMM
14
DIM
M 4
DIM
M 1
2D
IMM
13
DIM
M 5
DIM
M 1
1D
IMM
3D
IMM
2D
IMM
10
DIM
M 0
DIM
M 7
DIM
M 6
DIM
M 1
P7-IH Quad Block Diagram32w SMP w/ 2 Tier SMP Fabric, 4 chip Processor MCM + Hub SCM w/ On-Board Optics
MC
1
Mem
Mem
Mem
Mem
MC
0
Mem
Mem
8c uP
MC
1
Mem
Mem
Mem
Mem
MC
0
Mem
Mem
Mem
Mem
8c uP
7 Inter-Hub Board Level L-Buses
3.0Gb/s @ 8B+8B, 90% sus. peak
D0-D15 Lr0-Lr23
320 GB/s 240 GB/s
28x XMIT/RCV pairs@ 10 Gb/s
832
624
5+
5G
B/s
(6
x=
5+
1)
Hub Chip Module
22
+2
2G
B/s
164
Ll0
22
+2
2G
B/s
164
Ll1
22
+2
2G
B/s
164
Ll2
22
+2
2G
B/s
164
Ll3
22
+2
2G
B/s
164
Ll4
22
+2
2G
B/s
164
Ll5
22
+2
2G
B/s
164
Ll6
12x 12x
12x 12x
12x 12x
12x 12x
12x 12x
12x 12x
10
+1
0G
B/s
(1
2x=
10
+2
)
12x 12x 12x 12x
12x 12x 12x 12x
12x 12x 12x 12x
12x 12x 12x 12x
7+
7G
B/s
72
EG2
PC
Ie 6
1x
7+
7G
B/s
72
EG1
PC
Ie 1
6x
7+
7G
B/s
72
EG2
PC
Ie 8
x
MC
0
Mem
Mem
Mem
Mem
MC
1
Mem
Mem
Mem
Mem
8c uP
MC
0
Mem
Mem
Mem
Mem
MC
1
Mem
Mem
Mem
Mem
8c uP
Mem
Mem
P7-0 P7-1
P7-3 P7-2
A
B
X
W
B
A
W
X
B A
Y X
A B
X Y
C
Z
C
Z
W
Z
Z
W
C
Y
C
Y
Z
W
Y
X
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
A Clk Grp
B Clk Grp
C Clk Grp
D Clk Grp
MC
0M
C 1
MC
0M
C 1
MC
0M
C 1
MC
0M
C 1
DIM
M 1
Mem
Mem
DIM
M 0
Mem
Mem
DIM
M 5Mem
Mem
DIM
M 4Mem
Mem
DIM
M 7
Mem
Mem
DIM
M 6
Mem
Mem
DIM
M 1
0
Mem
Mem
DIM
M 1
1
Mem
Mem
DIM
M 2
Mem
Mem
DIM
M 3
Mem
Mem
DIM
M 1
4Mem
Mem
DIM
M 1
5Mem
Mem
DIM
M 8Mem
Mem
DIM
M 9Mem
Mem
DIM
M 1
2Mem
Mem
DIM
M 1
3Mem
Mem
• Connects QCM to PCI-e
• (two 16x and one 8x PCI-e slot)
• Connects 8 QCM's via low latency,
high bandwidth, copper fabric.
• Provides a message passing
mechanism with very
high bandwidth
• Provides the lowest possible
latency between 8 QCM's
EI-
3 P
HY
s
Torrent
Diff PHYs
L lo
ca
l
HU
B T
o H
UB
Co
pp
er
Bo
ard
Wir
ing
L remote
4 Drawer Interconnect to Create a Supernode
Optical
LR
0 B
us
Op
tic
al
6x 6x
LR
23
Bu
s
Op
tic
al
6x 6x
LL0 Bus
Copper
8B
8B
8B
8B
LL1 Bus
Copper
8B
8B
LL2 Bus
Copper
8B
8B
LL4 Bus
Copper
8B
8B
LL5 Bus
Copper
8B
8B
LL6 Bus
Copper
8B
8B
LL3 Bus
Copper
Dif
f P
HY
s
PX
0 B
us
16x
16x
PC
I-E
IO P
HY
Ho
t P
lug
Ctl
PX
1 B
us
16x
16x
PC
I-E
IO P
HY
Ho
t P
lug
Ctl
PX
2 B
us
8x8x
PC
I-E
IO P
HY
Ho
t P
lug
Ctl
FS
I
FS
P1-A
FS
I
FS
P1-B
I2C
TP
MD
-A, T
MP
D-B
SV
IC
MD
C-A
SV
IC
MD
C-B
I2C
SE
EP
RO
M 1
I2C
SE
EP
RO
M 2
24
L remote
BusesHUB to QCM Connections
Address/Data
D B
us
Inte
rco
nn
ec
t o
f S
up
ern
od
es
Op
tic
al
D0 Bus
Optical
12x
12x
D15 Bus
Optical
12x
12x
16
D B
us
es
28
I2C
I2C_0 + Int
I2C_27 + Int
I2C
To
Op
tic
al
Mo
du
les
TO
D S
yn
c
8B
Z-B
us
8B
Z-B
us
TO
D S
yn
c
8B
Y-B
us
8B
Y-B
us
TO
D S
yn
c
8B
X-B
us
8B
X-B
us
TO
D S
yn
c
8B
W-B
us
8B
W-B
us
1.1 TB/s HUB
11
• 192 GB/s Host Connection
• 336 GB/s to 7 other local nodes
• 240 GB/s to local-remote nodes
• 320 GB/s to remote nodes
• 40 GB/s to general purpose I/O
12
First Level Interconnect
L-Local
HUB to HUB Copper Wiring
256 Cores
DCA-0 Connector (Top DCA)
DCA-1 Connector (Bottom DCA)
1st Level Local Interconnect (256 cores)
HUB
7
HUB
6
HUB
4
HUB
3
HUB
5
HUB
1
HUB
0
HUB
2
P
C
I
e
9
P
C
I
e
10
P
C
I
e
11
P
C
I
e
12
P
C
I
e
13
P
C
I
e
14
P
C
I
e
15
P
C
I
e
16
P
C
I
e
17
P1-C
17
-C1
P
C
I
e
1
P
C
I
e
2
P
C
I
e
3
P
C
I
e
4
P
C
I
e
5
P
C
I
e
6
P
C
I
e
7
P
C
I
e
8
Op
tica
l F
an-o
ut fr
om
HU
B M
od
ule
s
2,3
04
Fib
er
'L-L
ink'
64/40 Optical'D-Link'
64/40 Optical'D-Link'
P7-0
P7-2
P7-3P7-1
QCM 0
U-P1-M1
P7-0
P7-2
P7-3P7-1
QCM 1
U-P1-M2
P7-0
P7-2
P7-3P7-1
QCM 2
U-P1-M3
P7-0
P7-2
P7-3P7-1
QCM 3
U-P1-M4
P7-0
P7-2
P7-3P7-1
QCM 4
U-P1-M5
P7-0
P7-2
P7-3P7-1
QCM 5
U-P1-M6
P7-0
P7-2
P7-3P7-1
QCM 6
U-P1-M7
P7-0
P7-2
P7-3P7-1
QCM 7
U-P1-M8
P1-C
16
-C1
P1-C
15
-C1
P1-C
14
-C1
P1-C
13
-C1
P1-C
12
-C1
P1-C
11
-C1
P1-C
10
-C1
P1-C
9-C
1
P1-C
8-C
1
P1-C
7-C
1
P1-C
6-C
1
P1-C
5-C
1
P1-C
4-C
1
P1-C
3-C
1
P1-C
2-C
1
P1-C
1-C
1
N0
-DIM
M1
5N
0-D
IMM
14
N0
-DIM
M1
3N
0-D
IMM
12
N0
-DIM
M1
1N
0-D
IMM
10
N0
-DIM
M0
9N
0-D
IMM
08
N0-D
IMM
07
N0-D
IMM
06
N0-D
IMM
05
N0-D
IMM
04
N0-D
IMM
03
N0-D
IMM
02
N0-D
IMM
01
N0-D
IMM
00
N1
-DIM
M1
5N
1-D
IMM
14
N1
-DIM
M1
3N
1-D
IMM
12
N1
-DIM
M1
1N
1-D
IMM
10
N1
-DIM
M0
9N
1-D
IMM
08
N1-D
IMM
07
N1-D
IMM
06
N1-D
IMM
05
N1-D
IMM
04
N1-D
IMM
03
N1-D
IMM
02
N1-D
IMM
01
N1-D
IMM
00
N2
-DIM
M1
5N
2-D
IMM
14
N2
-DIM
M1
3N
2-D
IMM
12
N2
-DIM
M1
1N
2-D
IMM
10
N2
-DIM
M0
9N
2-D
IMM
08
N2-D
IMM
07
N2-D
IMM
06
N2-D
IMM
05
N2-D
IMM
04
N2-D
IMM
03
N2-D
IMM
02
N2-D
IMM
01
N2-D
IMM
00
N3
-DIM
M1
5N
3-D
IMM
14
N3
-DIM
M1
3N
3-D
IMM
12
N3
-DIM
M1
1N
3-D
IMM
10
N3
-DIM
M0
9N
3-D
IMM
08
N3-D
IMM
07
N3-D
IMM
06
N3-D
IMM
05
N3-D
IMM
04
N3-D
IMM
03
N3-D
IMM
02
N3-D
IMM
01
N3-D
IMM
00
N4
-DIM
M1
5N
4-D
IMM
14
N4
-DIM
M1
3N
4-D
IMM
12
N4
-DIM
M1
1N
4-D
IMM
10
N4
-DIM
M0
9N
4-D
IMM
08
N4-D
IMM
07
N4-D
IMM
06
N4-D
IMM
05
N4-D
IMM
04
N4-D
IMM
03
N4-D
IMM
02
N4-D
IMM
01
N4-D
IMM
00
N5
-DIM
M1
5N
5-D
IMM
14
N5
-DIM
M1
3N
5-D
IMM
12
N5
-DIM
M1
1N
5-D
IMM
10
N5
-DIM
M0
9N
5-D
IMM
08
N5-D
IMM
07
N5-D
IMM
06
N5-D
IMM
05
N5-D
IMM
04
N5-D
IMM
03
N5-D
IMM
02
N5-D
IMM
01
N5-D
IMM
00
N6
-DIM
M1
5N
6-D
IMM
14
N6
-DIM
M1
3N
6-D
IMM
12
N6
-DIM
M1
1N
6-D
IMM
10
N6
-DIM
M0
9N
6-D
IMM
08
N6-D
IMM
07
N6-D
IMM
06
N6-D
IMM
05
N6-D
IMM
04
N6-D
IMM
03
N6-D
IMM
02
N6-D
IMM
01
N6-D
IMM
00
N7
-DIM
M1
5N
7-D
IMM
14
N7
-DIM
M1
3N
7-D
IMM
12
N7
-DIM
M1
1N
7-D
IMM
10
N7
-DIM
M0
9N
7-D
IMM
08
N7-D
IMM
07
N7-D
IMM
06
N7-D
IMM
05
N7-D
IMM
04
N7-D
IMM
03
N7-D
IMM
02
N7-D
IMM
01
N7-D
IMM
00
Drawer
• 8 nodes
• 32 chips
• 256 cores
13
P7-IH Board Layout 2nd Level of Interconnect (1,024 cores)
4.6 TB/s
Bisection BWBW of 1150
10G-E ports
DCA-0 Connector (Top DCA)
DCA-1 Connector (Bottom DCA)
2nd
Level Interconnect (1,024 cores)
HUB
7
61x96mm
HUB
6
61x96mm
HUB
4
61x96mm
HUB
3
61x96mm
HUB
5
61x96mm
HUB
1
61x96mm
HUB
0
61x96mm
HUB
2
61x96mm
P
C
I
e
9
P
C
I
e
10
P
C
I
e
11
P
C
I
e
12
P
C
I
e
13
P
C
I
e
14
P
C
I
e
15
P
C
I
e
16
P
C
I
e
17
P
C
I
e
1
P
C
I
e
2
P
C
I
e
3
P
C
I
e
4
P
C
I
e
5
P
C
I
e
6
P
C
I
e
7
P
C
I
e
8
Op
tica
l Fa
n-o
ut fro
m
HU
B M
od
ule
s
2,3
04
Fib
er 'L
-Lin
k'
64/40 Optical'D-Link'
FSP/CLK-A
64/40 Optical'D-Link'
FSP/CLK-B
P7-0
P7-2
P7-3P7-1
QCM 0
P7-0
P7-2
P7-3P7-1
QCM 1
P7-0
P7-2
P7-3P7-1
QCM 2
P7-0
P7-2
P7-3P7-1
QCM 3
P7-0
P7-2
P7-3P7-1
QCM 4
P7-0
P7-2
P7-3P7-1
QCM 5
P7-0
P7-2
P7-3P7-1
QCM 6
P7-0
P7-2
P7-3P7-1
QCM 7
DCA-0 Connector (Top DCA)
DCA-1 Connector (Bottom DCA)
2nd
Level Interconnect (1,024 cores)
HUB
7
61x96mm
HUB
6
61x96mm
HUB
4
61x96mm
HUB
3
61x96mm
HUB
5
61x96mm
HUB
1
61x96mm
HUB
0
61x96mm
HUB
2
61x96mm
P
C
I
e
9
P
C
I
e
10
P
C
I
e
11
P
C
I
e
12
P
C
I
e
13
P
C
I
e
14
P
C
I
e
15
P
C
I
e
16
P
C
I
e
17
P
C
I
e
1
P
C
I
e
2
P
C
I
e
3
P
C
I
e
4
P
C
I
e
5
P
C
I
e
6
P
C
I
e
7
P
C
I
e
8
Op
tica
l Fa
n-o
ut fro
m
HU
B M
od
ule
s
2,3
04
Fib
er 'L
-Lin
k'
64/40 Optical'D-Link'
FSP/CLK-A
64/40 Optical'D-Link'
FSP/CLK-B
P7-0
P7-2
P7-3P7-1
QCM 0
P7-0
P7-2
P7-3P7-1
QCM 1
P7-0
P7-2
P7-3P7-1
QCM 2
P7-0
P7-2
P7-3P7-1
QCM 3
P7-0
P7-2
P7-3P7-1
QCM 4
P7-0
P7-2
P7-3P7-1
QCM 5
P7-0
P7-2
P7-3P7-1
QCM 6
P7-0
P7-2
P7-3P7-1
QCM 7
DCA-0 Connector (Top DCA)
DCA-1 Connector (Bottom DCA)
2nd
Level Interconnect (1,024 cores)
HUB
761x96mm
HUB
661x96mm
HUB
461x96mm
HUB
361x96mm
HUB
561x96mm
HUB
161x96mm
HUB
061x96mm
HUB
261x96mm
P
C
I
e
9
P
C
I
e
10
P
C
I
e
11
P
C
I
e
12
P
C
I
e
13
P
C
I
e
14
P
C
I
e
15
P
C
I
e
16
P
C
I
e
17
P
C
I
e
1
P
C
I
e
2
P
C
I
e
3
P
C
I
e
4
P
C
I
e
5
P
C
I
e
6
P
C
I
e
7
P
C
I
e
8
Op
tica
l F
an-o
ut fr
om
HU
B M
od
ule
s
2,3
04
Fib
er
'L-L
ink'
64/40 Optical'D-Link'
FSP/CLK-A
64/40 Optical'D-Link'
FSP/CLK-B
P7-0
P7-2
P7-3P7-1
QCM 0
P7-0
P7-2
P7-3P7-1
QCM 1
P7-0
P7-2
P7-3P7-1
QCM 2
P7-0
P7-2
P7-3P7-1
QCM 3
P7-0
P7-2
P7-3P7-1
QCM 4
P7-0
P7-2
P7-3P7-1
QCM 5
P7-0
P7-2
P7-3P7-1
QCM 6
P7-0
P7-2
P7-3P7-1
QCM 7
DCA-0 Connector (Top DCA)
DCA-1 Connector (Bottom DCA)
2nd
Level Interconnect (1,024 cores)
HUB
761x96mm
HUB
661x96mm
HUB
461x96mm
HUB
361x96mm
HUB
561x96mm
HUB
161x96mm
HUB
061x96mm
HUB
261x96mm
P
C
I
e
9
P
C
I
e
10
P
C
I
e
11
P
C
I
e
12
P
C
I
e
13
P
C
I
e
14
P
C
I
e
15
P
C
I
e
16
P
C
I
e
17
P
C
I
e
1
P
C
I
e
2
P
C
I
e
3
P
C
I
e
4
P
C
I
e
5
P
C
I
e
6
P
C
I
e
7
P
C
I
e
8
Op
tica
l F
an-o
ut fr
om
HU
B M
od
ule
s
2,3
04
Fib
er
'L-L
ink'
64/40 Optical'D-Link'
FSP/CLK-A
64/40 Optical'D-Link'
FSP/CLK-B
P7-0
P7-2
P7-3P7-1
QCM 0
P7-0
P7-2
P7-3P7-1
QCM 1
P7-0
P7-2
P7-3P7-1
QCM 2
P7-0
P7-2
P7-3P7-1
QCM 3
P7-0
P7-2
P7-3P7-1
QCM 4
P7-0
P7-2
P7-3P7-1
QCM 5
P7-0
P7-2
P7-3P7-1
QCM 6
P7-0
P7-2
P7-3P7-1
QCM 7
14
Second Level Interconnect
Optical ‘L-Remote’ Links from HUB
4 drawers
1,024 Cores
L-L
ink C
ab
les
Super Node(32 Nodes / 4 CEC)
Supernode
Global Interconnection Network
15
• This space is intentionally left blank
• More details in the NDA sessions
A photo of the RAM for distraction.
National Petascale Computing Facility
16
A facility dedicated to Blue Waters
Back to Performance Modeling
17
• Main goals of this workshop:
• Ignite performance modeling efforts within all
PRAC teams in collaboration with NCSA
• Start to gather a deep understanding of the
performance characteristics of all codes
Logistics
18
• Today:
• A team from LANL will present a tutorial about
performance modeling
• Specific examples and use-cases
• Tomorrow:
• Hands-on sessions to get modeling of
applications started
• Supported by LANL and NCSA teams
• Try to work with your PoC
NDA issues
19
• Not all participants are covered by all
necessary NDAs
• Badges will be marked
• Please be careful what you talk about
• You are responsible for the information
• Everything in my slides can be communicated
freely!
I’m here to help!
20
• We have 15 training accounts on a Power 5
available for tomorrow
• It’s AIX
• Ask me if you need one
• Let me know if you have questions, problems,
or comments!