the biohpc nucleus cluster & future developments · the biohpc nucleus cluster & future...

34
The BioHPC Nucleus Cluster & Future Developments 1

Upload: others

Post on 20-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

The BioHPC Nucleus Cluster & Future Developments

1

Page 2: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Today we’ll talk about the BioHPC Nucleus

HPC cluster – with some technical details for

those interested!

How is it designed?

What hardware does it use?

How does this affect the work I need to run?

Future Plans

(2017 cluster upgrade and more!)

Overview

2

Page 3: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

HPC Clusters

3

HPC clusters consist of 3 major components

Compute Nodes

• Powerful servers that run your jobs

• Some also contain GPU cards

High-Speed Network

• Transfers data to/from compute nodes

• Carries communication for parallel code

High-Speed, High Capacity Storage

• Terabytes of storage for your research data

• 10s of GB per second bandwidth to feed nodes

HPC Cluster

Page 4: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Balance in Clusters

4

Some clusters need much more storage than compute.Data intensive tasks (e.g. Next Generation Sequencing)

Some clusters need very little storage, but a lot of compute.Compute intensive tasks (e.g. physical process modelling)

Some clusters don’t need very high performance networkingEmbarrassingly parallel tasks (no communication between tasks)

Best solution depends on the workload of users

Nucleus is a balanced, general purpose cluster

Slight bias toward storage – more storage than typical HPC system of its size

Page 5: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Compared to your PC

5

Combined, the cluster is between 1,000 and 8,000x faster/larger than a typical PC

Compute Nodes

• 8500 cores (~2,000x Desktop)

• 45TB RAM (~5,000x Desktop)

High-Speed Network

• 5.5Tbps Throughput (~5,000x Desktop)

High-Speed, High Capacity Storage

• >8PB Storage (~8000x Desktop)

• 90GB/s Throughput (~1000x Desktop)

4 Cores8GB RAM1TB HDD

openclipart.org - https://openclipart.org/detail/17924/computer

Page 6: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Nucleus has 196 Compute Nodes

Based on standard servers, used by businesses:

Lots of CPU cores: 32, 48 or 56 logical per server

each physical core has 2 logical cores

Lots of RAM: 128, 256, or 384GB per server

Differences from business servers:

Very little local storage

Keep things on central storage systems

High Speed Infiniband Network

Much faster than normal business networking

Compute Nodes

6

Possible to buy much individual faster machines…But this is the sweet spot for price-performance of a cluster of machines.

Page 7: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

We add nodes often, and buy newer, faster machines when they become available:

24 * 128GB Nodes Oldest Xeon E5 32 Cores

78 * 256GB Nodes Most nodes Xeon E5 v3 48 Cores

48 * 256GBv1 Nodes Fastest CPU nodes Xeon E5 v4 56 Cores

2 * 384GB Nodes Largest RAM Xeon E5 /v2 32/40 Cores

Compute Nodes - Types

7

Newer nodes have more cores

Can be much faster if your work can use the extra cores

Also have newer numerical features – can speed up linear algebra a lot

Logical Cores

Page 8: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Most jobs that users run don’t use compute nodes fully.

56 cores is a lot to fill up. Might be slower to split task into 56 parts due to overhead.

Combine smaller jobs – run programs in parallel on fewer nodes

Watch out for RAM usage – 256GB / 56 cores is 4.5GB per core.

You might need to run fewer than 56 tasks

How Does this Affect Me – Cores and RAM?

8 http://www.exascience.com/wp-content/uploads/2013/12/Herzeel-BWAReport.pdf

Herzeel et. alPerformance Analysis of BWA AlignmentExaScience Life Lab

Page 9: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Older nodes are often less busy – shorter waits.

If your code is not specifically optimized for new CPUs the older nodes are often not much slower.

E.g. newest 256GBv1 node is often only 25% faster than oldest 128GB node on code not specifically

optimized for many cores, and CPU numerical improvements.

Running a test ChipSeq workflow (minimal 385MB test dataset)

32 Cores AVX Xeon E5 (v1) 128GB - 255s

56 Cores AVX2 Xeon E5 v4 256GBv1 - 194s

How Does this Affect Me? – CPU types

9

75% more cores24% speedup

astrocyte_example_chipseq workflow, run on a single node

Page 10: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Optimized numerical code will benefit from new CPUs – but you must compile it for specific

machines

To compile for specific machine (fastest possible binaries) use:

GNU gcc: -march=native

Intel icc: -xHost

4096x4096 Element Matrix multiplication benchmark:

32 Cores AVX Xeon E5 (v1) 128GB - 507ms

56 Cores AVX2 Xeon E5 v4 256GBv1 - 168ms

How Does this Affect Me? – CPU types

10

75% more cores3x speedup

MKL sgemm – mean time across 1000 replicate computationsIntel 2016 compiler –xHost –O3 options for machine specific optimization

Page 11: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Approx. 300 nodes will soon be added to the cluster (from TACC Stampede)

32GB RAM, 32 logical cores, similar to existing 128GB nodes

Ideal for smaller RAM, interactive jobs. Will improve immediate availability of sessions.

New Nodes for Low Memory Tasks – Coming Soon

11

Page 12: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Nucleus has 20 GPU Compute Nodes

Single or Multiple GPUs

GPU – NVIDIA Tesla K20 or K40

GPUv1 – Dual NVIDIA Tesla P100

Differences vs Consumer GPUs

Double Precision Arithmetic Performance

Can be important for high accuracy work

Reliability and Stability

GPU Nodes

12

On well-suited tasks, 2x P100 GPUs can be 20x faster than using 56 CPU cores

Page 13: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

New Dual P100 nodes are much faster than K40 nodes for GPU compute intensive software

Relion CryoEM Classification

Dual P100 approx. >6x faster than single K40

Speed-up on small benchmark limited by CPU initialization step

TensorFlow AlexNet Benchmark

Dual P100 approx. 4.3x faster than single K40

If you are using heavy GPU compute, the new GPUv1 nodes should be preferred

Make sure your application can use, and is set to use 2 GPU cards!

Relion & Tensorflow Benchmarking

13 GFDL, https://commons.wikimedia.org/w/index.php?curid=11356884

Page 14: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Older K20 and K40 nodes are still ideal for:

3D Visualization – very good 3D rendering performance

Programs with limited GPU support (only 1 GPU, not much code GPU optimized)

Please use them when they are appropriate, so P100 nodes are available for heavy computation

K20 & K40 GPU Nodes Still Very Useful!

14

Page 15: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

We use a normal Ethernet network to manage the nodes

Just like the network connected to your desktop – 1Gbps

>125us latency for messages

Your Jobs on Nucleus use the high-speed Infiniband network.

56Gbps connection per node

2:1 blocking – each node guaranteed at least 28Gbps

>0.7us latency for messages

Supports RDMA - Remote Direct Memory Access

Transfer data between RAM of nodes, without using CPU

High Speed Network - Infiniband

15 David.Monniaux CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1587748

Page 16: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Nodes have 2 network addresses

192.168.54.x - 1Gbps Ethernet

10.10.10.x - 56Gbps Infiniband

Storage traffic, MPI traffic is setup to use the fast Infiniband network.

Sometimes parallel programs (non-MPI) try to use the first network interface (1GbE)

Must tell them to use Infiniband or things will be slow!

How Does this Affect Me? - Infiniband

16

Page 17: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Storage Systems

17

We use 2 main high-performance storage systems, plus others for lower-speed tasks

They use large hard drives – to give a lot of capacity per $ for your data

A single hard drive in your desktop/laptop is slow

Lots of hard drives (100s) working together can be very quick!

Page 18: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

DDN SFA12K ExaScaler Lustre System

420 6TB drives in 40 disk pools

4 IO Servers, 2 Metadata Servers

Redundancy in pools, gives 1.7PB usable

space

Each pool provides up to 1GB/s

throughput

Total Max throughput ~30GB/s

Connected to cluster Infiniband Network

Project Storage System

18

Page 19: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

IBM Elastic Storage Server GL6

(SpectrumScale/GPFS)

712 8TB drives, and 4 IO Servers

Redundancy in pools, gives 3.4 PB usable space

Total Max throughput ~20GB/s*

* Limited by network

Located in Clements University Hospital

Connected to cluster Infiniband network with 4 pairs of fiber

under Harry Hines Boulevard

Work Storage System

19

Page 20: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Metadata is the information about a file or directory

Name, dates, permissions, location of data

Data is what’s actually stored in your files

On a single PC the data and metadata stay together on the disk

On HPC storage we spread the data out over disk pools, so we can

have fast parallel access to read and write

Metadata has to be kept separately, and served to clients separately

HPC filesystems have huge numbers of files = a lot of metadata to

manage

Difficult problem to do this when many clients could use same files

Data & Metadata

20

Page 21: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

HPC storage systems read and write data very quickly

HPC storage systems handle metadata slowly

Slow operations:

Creating, deleting many files and folders

Getting information about directories containing 1000s of files

When writing code and workflows prefer using large files, instead of many small files.

Use image stacks, instead of 1000s of individual TIFF files

Use archives (tar, zip) to store small files you aren’t working with

Use node /local and /tmp space for large numbers of very small files

Data & Metadata – How Does this Affect Me?

21

Page 22: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

On HPC filesystems the data for a file is striped across the disk pools to achieve high speed.

/work does this for you automatically

/project does not stripe by default. Need to stripe very large files to get best speeds

File Striping

22

1 Default – Any file that doesn’t fit the criteria below. Don’t stripe small files!2 Moderate size files 2-10GB that are read by 1-2 concurrent processes

4Moderate size files 2-10GB that are read by 3+ concurrent processes regularlyLarge files 10GB+ that are read by 1-2 concurrent processes

8Large files 10GB+ that are read by 3+ concurrent processes regularlyAny very large files 200GB+ (to balance storage target usage)

https://portal.biohpc.swmed.edu/content/guides/storage-cheat-cheet/

lfs setstripe -c 4 /project/department/myuser/bigfiles

Page 23: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Future Plans…

23

Nucleus is an excellent resource, built up over the past 4 years thanks to our contributing departments

BioHPC is focused on an exciting future with new ways to use Nucleus, to advance your research

Page 24: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

A newer version of Linux (we currently use 6)

Improved security, usability and compatibility with newer software

Popular software that will work (again)

Google Chrome

Visual Studio Code

Atom Editor

Update to Red Hat EL 7

24

Page 25: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

More graphical tools:

Modern desktop environment, web browsers, office suite, editors

OpenGL (3D) support on all compute nodes, not just GPU nodes

Use simple 3D software in webGUI on any machine

Full-featured Interactive Sessions

25

Page 26: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Containers - Singularity

26

Singularity will allow containers to be run on BioHPC

Supports docker containersSupport containers using GPUs

Use software in a different environmente.g. Ubuntu Linux

Direct access to 3045 tools from the biocontainers project

Integrate with Astrocyte/Nextflow for reproducible workflows

http://singularity.lbl.gov/

Page 27: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Approx. 300 nodes will soon be added to the cluster (from TACC Stampede)

32GB RAM, 32 logical cores, similar to existing 128GB nodes

Ideal for smaller RAM, interactive jobs. Will improve immediate availability of sessions.

New Nodes for Low Memory Tasks

27

Page 28: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Xeon Phi (Knights Corner)

28

Nodes from Stampede have Xeon Phi(Knights Corner) Coprocessors

61 Cores, 8GB RAM

3x faster than CPUs for numerical workRun standard code, unlike GPUs

Can be used to speed up compute intensive, highly parallel code

Will add function to portal to help launch code on the Xeon Phi MICs

Page 29: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

On the new Nucleus cluster, you can launch an NVIDIA DIGITS session from the BioHPC Portal.

Digits provides an easy-to-use, web-browser interface to deep learning tools.

Easy to define models. Create and execute multiple runs, using GPU computation.

Deep Learning with NVIDIA DIGITS – Coming January

29

Coming Soon!

Page 30: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Start & connect to dedicated Python, R, and DIGITS environments

Directly from the BioHPC Portal

Portal DIGITS, RStudio & Jupyter – Coming 2018

30

Page 31: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

NucleusCluster

Astrocyte will become a gateway to use resources beyond Nucleus

Distributed Computing, on Campus or in the Cloud - Planned

31

3rd Party Cloud

BioHPC CloudWorkstations/Thin Clients

~500 cores

Page 32: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Workflow Designer – Alpha version November

32

Choose tools to create a workflow in your web browser

Run analyses and share workflows with your lab, or wider

Page 33: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

Workflow Visualization & Interactivity - Planned

33

Downstream visualization of workflow results with interactive tools

NGS Visualization apps

Clinical / Microscopy

Page 34: The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future Developments 1. ... TensorFlow AlexNet Benchmark Dual P100 approx. 4.3x faster than single

It’s Your Cluster!

34

Nucleus was built with your department contributions

BioHPC is here to help you do your research

What works well?What do you need?

Let us know!

[email protected]

Microsoft Teams: BioHPC General