introduction to dicc hpc design
TRANSCRIPT
Computer ServersTotal of 15 Compute Servers:
● 11 Dell Compute Servers○ Acquire at 2015○ 6 online,5 offline
● 4 Gigabyte Compute Servers○ Acquire at 2021○ 4 online
What DICC HPC currently have?
Processor:● 4 x AMD Opteron Processor 6366HE 16C/16T 1.8GHz
For each processor:● 16x Cores - RED● 4x L3 Cache: 8MB each, 32MB Total - GREEN● 8x L2 Cache: 2MB each, 16MB Total - Yellow● 16x L1d Cache: 16KB each, 256KB Total - Orange● 8x L1i Cache 64KB each, 512KB Total● 2 x NUMA Nodes - Orange Dotted Line
Memory:● 16 x 16GB DDR3 1333Mhz Memory
Network:● 1 Gigabit Ethernet - Management● 2x 10 Gigabit Ethernet - Compute and Storage
Interconnect
Memory/IO DIE
Specification of Dell Computer Servers
Memory/IO DIE
16GB RAM
16GB RAM 16GB RAM
16GB RAM
Processor:● 2 x AMD EPYC 7F72 24-Core Processor 24C/48T 3.2GHz
For each processor:● 24x Cores - RED● 12x L3 Cache: 16MB each, 192MB Total - GREEN● 24x L2 Cache: 512KB each, 12MB Total - Yellow● 24x L1d Cache: 32KB each, 768KB Total - Orange● 24x L1i Cache 32KB each, 768KB Total● 1 x NUMA Nodes
Memory:● 8 x 32GB DDR4 3200MHz Memory
Network:● 1 Gigabit Ethernet - Management● 2x 10 Gigabit Ethernet - Compute and Storage Interconnect● RDMA support through RoCEv2
Specification of Gigabyte Computer Servers
Memory/IO DIE
32GB RAM 32GB RAM
32GB RAM 32GB RAM
Compute Servers
Total Available Resources in DICC
Dell Opteron Servers:
● Currently Online:○ 384 Cores○ 1.536 TB of Memory
Gigabyte EPYC Servers
● Currently Online○ 192 Cores (384 with
Multithreading)○ 1TB of Memory
CPU Benchmark - HPLinpack - Opteron
Half Nodes (32CPU) = 160.25GFlopsSingle Node Estimation = ~320 GFlops4 Nodes estimation = ~1.3 TFlops
GPU Servers
What DICC HPC currently have?
Total of 5 GPU Servers:● 4 Online
○ GPU Nodes 1-4● 1 In Progress
○ GPU Nodes 5
Available GPUs:● 8x Nvidia Tesla K10● 2x Nvidia Tesla K40● 1x Nvidia Titan X● 2x Nvidia Titan Xp● 2x Nvidia V100 (In progress)
Processor:● 2 x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz 8C/16T
GPU:● 4x Nvidia Tesla K10 8GB Memory GDDR5
Memory:● 6 x 8GB DDR3 1600 MHz Memory (GPU01)● 8 x 8GB DDR3 1600MHz Memory (GPU03)
Network:● 1 Gigabit Ethernet - Management● 10 Gigabit Ethernet - Storage Interconnect
Specification of GPU Node 01 and 03
Processor:● 2 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz 8C/16T
GPU:● 1x Nvidia Geforce Titan X 12GB GDDR5● 2x Nvidia Titan Xp 12GB GDDR5X
Memory:● 8 x 16GB DDR4 2400 MHz Memory
Network:● 1 Gigabit Ethernet - Management● 10 Gigabit Ethernet - Storage Interconnect
Specification of GPU Node 02
Processor:● 2 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz 8C/16T
GPU:● 2x Nvidia Tesla K40 12GB GDDR5
Memory:● 4 x 16GB DDR4 2400 MHz Memory
Network:● 1 Gigabit Ethernet - Management● 10 Gigabit Ethernet - Storage Interconnect
Specification of GPU Node 04
What DICC currently have?
Storage
Total of 3 storage:● Home Directory Storage (/home)● Lustre Directory Storage (/lustre)● Backup Storage
Processor:● 2x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz 6C/12T
Memory:● 4 x 16GB DDR3 1333 MHz Memory
Storage:● MegaRAID SAS 2108● 32x 4TB SAS Drive 7200 RPM
Network:● 1 Gigabit Ethernet - Management● 10 Gigabit Ethernet - Storage Interconnect
Specification of NFS Server (Home Directory)
Processor:● Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz 12C/24T
Memory:● 4 x 16GB DDR4 2666 MHz Memory
Storage:● SAS3008 PCI-Express Fusion-MPT SAS-3● 40x 4TB SAS Drive 7200 RPM
Network:● 1 Gigabit Ethernet - Management● 10 Gigabit Ethernet - Mellanox Connectx4-Lx - RoCEv2 supported
Specification of Lustre Servers - LustreOSS01-02 (Lustre Directory)
Overview of DICC Storage Connection
Compute 01-02
Compute 03-04
Compute 05-06
Compute 07-08
Compute 09-10
Compute 11
Compute 12
Compute 13
Compute 14
Compute 15
LustreOSS01LustreOSS02NFS Server
Login Node(umhpc)
10GbE
10GbE with RoCEv2GPU01
GPU02
GPU03
GPU04
Home Directory
StorageHome Directory (/home)● NFSv4● XFS File system● RAID 10● Usable Capacity: ~60TB● Backup daily● Quota per user: 100GB
Purpose: For users to store important files and computation results.
Lustre Directory
Storage Lustre Directory (/lustre)● High Performance Storage● Lustre Filesystem● RAID 6● 2 OSS nodes● 8 OSTs● Usable Capacity: ~230TB● No backup● No quota Limit
Purpose: For users to store their raw datasets and run computation jobs.
Policy: Data that are not accessed for the last 60 days will be removed and not recoverable.
Storage
What is Lustre Filesystem?Lustre is a type of parallel file system, generally used for large-scale cluster computing.
Lustre default file stripping count and size
Storage
Default object count = 1Default striping size = 1MB
Storage Benchmark Test
Benchmark Test details:● Using ior packages● Running on cpu12-15● 8 processes per nodes
Use Case 1 - Multi Nodes Single File● Single 64GB files● Block size - 1MB● NFS● Lustre - 1,2,4,8 Objects
Use Case 2 - Multi Nodes Multi Files● 32x 2GB files● Block size - 1MB● NFS● Lustre
Currently supported:
● 10GbE● 10GbE with RoCEv2 support
using Mellanox Connectx4-Lx
Network
What DICC Currently have?
Overview of DICC Network
Com
pute
01-
02
Com
pute
03-
04
Com
pute
05-
06
Com
pute
07-
08
Com
pute
09-
10
Com
pute
11
Com
pute
12
Com
pute
13
Com
pute
14
Com
pute
15
10GbE Compute
10GbE Compute with RoCEv2
GP
U01
GP
U02
GP
U03
GP
U04
10GbE Storage
10GbE Storage with RoCEv2
Lust
reO
SS
01
Lust
reO
SS
02
NFS
Ser
ver
Logi
n N
ode(
umhp
c)
Workload Manager
SLURM - Simple Linux Utility for Resource Management
Provides three key functions:
● allocating exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work,
● providing a framework for starting, executing, and monitoring work, typically a parallel job such as Message Passing Interface (MPI) on a set of allocated nodes, and
● arbitrating contention for resources by managing a queue of pending jobs.
Thank youIf you have any further questions, please feel free to drop me an email:
Or login in to our service desk and create a request.
This will be my last presentation in DICC as I will be leaving UM at the end of the month.
If you are interested in joining DICC, please feel free to drop an email to my supervisor:
[email protected] OR [email protected]
Please note that there will be another 2 training sessions on tomorrow, 2nd September 2021