2012 atlas technical i nterchange meeting annecy, france
DESCRIPTION
2012 ATLAS Technical I nterchange Meeting Annecy, France. Stephen Gray Dell Global CERN/LHC Technologist + 1.512.574.5032 | Stephen [email protected]. Building a “Bulldozer” Processor. Each processor die is composed of 4 “Bulldozer” modules - PowerPoint PPT PresentationTRANSCRIPT
2012 ATLAS Technical Interchange MeetingAnnecy, France
Stephen GrayDell Global CERN/LHC Technologist+1.512.574.5032 | [email protected]
Dell LHC Program
Building a “Bulldozer” Processor
• Each processor die is composed of 4 “Bulldozer” modules
• Module divisions are transparent to shared hardware, operating system or application
• The modular architecture speeds chip development and increases product flexibility
Server:“Interlagos” –16 cores (2 dies) “Valencia” –8 cores (1 die) Client:“Zambezi” –8 cores (1 die)
Shared L3 Cache
NB/HT LinksMemory Controller
DELL/AMD CONFIDENTIAL
8MB Shared L3 Cache per die
Dell LHC Program
DDR3
Romley EP Platform
Sandy Bridge CPU
Sandy Bridge CPU
Patsburg
QPIDDR3
DDR3
DDR3
MemoryDDR3 & DDR3L
RDIMMs & UDIMMs, LR DIMMs4 channels per socket, up to 3 DPC; speeds up to DDR3 1600
PCI Express* 3.040 lanes per socket
Extra Gen 2 x4 on 2nd CPU
DDR3
DDR3
DDR3
PCIe
3 x8
PCIe
3 x8
PCIe
3 x8
PatsburgOptimized Server & WS
PCHIntegrated Storage:
Up to 8 ports 6Gb/s SASRAID 5 optional
Sandy Bridge CPUsUp to 8 cores / socket
with up to 20M of cache
DM
I2
PCIe
3 x8
PCIe
3 x8
PCIe
3 x8
PCIe
2 x4
QPI2 QPI links with
bandwidth up to 8 GT/s
QPI
DDR3
PCIe
3 x8
PCIe
3 x8
PCIe
3 x8
PCIe
3 x8
DELL/Intel CONFIDENTIAL
Dell LHC Program4 Confidential
All HS 06 Test Before 12/2011
6 8 12 16 32 64 6.00
106.00
206.00
306.00
406.00
506.00
606.00
79.09
101.52157.33
519.46
48.5565.20
129.61
330.25
531.50
Intel Sandy BridgeAMD Interlagos
Cores Present
HEPS
PEC0
6/sy
stem
Notes:* All tests are 32-bit, hyperthreading disabled, clock speed up enabled* Multiple tests on the same proc type are averaged * 32 core AMD is 3.0 GHz, all others are 2.3 GHz* Intel 6 & 12 core is 2.0 GHz, 8 core is 1.6 GHz, 32 core is 2.7 GHz* 64 Core Intel is a 4 x Socket 2.4GHz R820
Dell LHC Program5 Confidential
Core Control
1 2 48 16 32
64
0
100
200
300
400
500
600
11 21 4386
170
337
504
00
0
101
175
358
566
21 27 51
93
161
316
536
22 4174
128
255
502
0
AMD Interlagos - Numactl BindingIntel Sandy Bridge - BIOs Core DowningIntel Sandy Bridge - Numactl Bind-ingIntel Sandy Bridge - Numactl Bind-ing HT Off
Cores/Threads
HEP
SPEC
06
Notes:- All tests used RHEL 6.2 and Gcc 4.4.5- Intel SB numbers are from an R820 with 4 x 2.4GHz 8 core engineering processors and HT enabled- AMD Interlagos numbers come from a C6145 with 4 x 6276 2.3GHz production processors
Dell LHC Program6 Confidential
Very Cool Scalability
1 2 4 8 16 32 640.0
0.2
0.4
0.6
0.8
1.0
1.2
101.3% 100.4%100.7%
97.1% 97.8%
49.6%
28.6%
88.9%82.4%
73.1%
96.3%
69.6%AMD Interlagos - 4 Socket Numactl Binding
Intel Sandy Bridge - 4 Socket Numactl Binding
Cores/Threads
SPee
d U
p
Notes:- RHEL 6.2 and gcc 4.4.5 used for all tests- Sandy Brigde numbers are from an R820 with 4 sockets 2.4GHz 8 core engineering processors w/ HT en-abled- Interlagos numbers are from a C6145 with 1 tray with 4 socket Optern 6276 2.3GHz production pro-cessors
Dell LHC Program7 Confidential
AMD C6145 Interlagos Map
Dell LHC Program8 Confidential
Intel Sandy BridgeGet the Map Right
Dell LHC Program9 Confidential
The Problem is an Old One• New x86 systems think they
are SMP• As many CPUs in 2u as an HP
SuperDome in a 42u rack (eta 2004)• One must relearn
process/thread binding
Dell LHC Program10 Confidential
OS Effect On HS06
5.5/5.76.2
RHEL 7A wo tune RHEL 7A w
tune RHEL 7A w tune & avx
0
100
200
300
400
500
600
198 198 207
291 309
586 587
428
503541 548 Intel Westmere-C6100 w 2x Intel
x5670 2.66GHz 6C
Intel Sandy Bridge-R820 w 2 x Intel SB 2.4GHz 8C
AMD Interlagos - C6145 w 4 x AMD 6276 2.3GHz 16C
Operating System
HEP
SPEC
06
Notes:- R820 with RHEL 7A and GCC 4.6.2 com-piled all HEPSPEC06 benchmarks except Deal II (see whitepaper)-C6145 with RHEL 7A and GCC 4.6.2 compiled all HEPSPEC06 benchmarks ex-cept Deal II (see whitepaper)- The "w tune" designation refers to the linux64-cern.cfg file compiler flags being modified to include the -march=bdver1 for AMD's Interlagos and -march=corei7 for Intel's Sandy Bridge- No patching or tuning to the OS was made
Dell LHC Program11 Confidential
Newer OSes Vs SL 5.5/5.7
Inter RHEL 7A SB RHEL 7A Inter RHEL 6.2 SB RHEL 6.2 Westmere SL 6.280.00%
100.00%
120.00%
140.00%
160.00%
180.00%
200.00%
220.00%
186%
201%
118%
106%100%
Operating System Perform...
SL or RHEL 55/57 vs
Perc
ent
Incr
ease
Notes:- No tuning was per-formed on RHEL 7A runtimes- The standard linux32_cern.cfg was used for all testing- AMD Interlagos num-bers are from a C6145 tray with 4 x AMD 6276 2.3 GHz 16 core processor- Intel Sandy Bridge numbers are from an R820 with a 2.4 GHz 8 core processor- Intel Westmere num-bers are from a C6100 tray with Intel X5650 2.66 GHz 6 core pro-cessors
Dell LHC Program12 Confidential
SB R 7A w tune - 64T
Inter R7A w tune - 64C
SB R 7A wo tune - 64T
Inter R 7A wo tune - 64C
Inter R 6.2 - 64C
Inter SL 5.7 - 64C
SB R 6.2 - 64T
SB SL 5.7 - 64T
Westmere SL 5.5/6.2 - 12C
0 30 60 90 120 150 180 210
68
73
68
74
80
93
129
137
202
40,000 HS06 Target
Systems Required
Servers Required
Syst
em O
S
Notes:- The standard linux64_cern.cfg was used for SL 5.7 and RHEL 6.2- AMD Interlagos numbers are from a 4 x 2.3GHz 16c processor- Intel Sandy Bridge numbers are from a 4 x 2.4GHz 8c processors- Intel Westmere are from 2 x 2.66 GHz 6c processors- Hyper threading was enabled on all Intel testing- HS06 numbers are based on total system ores/threads- The "w tune" designation refers to the linux64-cern.cfg file compiler flags being modified to include the -march=bdver1 for AMD's Interlagos and -march=corei7 for Intel's Sandy Bridge
Dell LHC Program13 Confidential
Walk A Way• Intel Sandy Bridge is Fast (Porsche GT3)
• Must learn to use Numactl to bind thread• Expensive - Intel = $18362.22, ~1000 HS06
Intel Solution: Dell PowerEdge C6220, $18.36/HS06, 8 E5-2670 2.6GHz 8C, 128GB 1600MHz (total RAM per C6220, 2GB/core), 8 500GB drives.
• Interlagos (Volkswagen GTI)• Must learn to use Numactl to bind threads• For some applications you must turn half the
cores• Cheaper -
AMD = $11011.65, ~1000 HS06 AMD Solution: Dell PowerEdge C6145, $11.01/HS06, 8 6276 2.3GHz 16C, 256GB 1600MHz (total RAM per C6145, 2GB/core), 8 500GB drives; $11011.65, ~1000 HS06
• New Operating Systems and Gcc are your friend