minerva user group 2018...jan 18, 2018  · demeter data science cluster the demeter cluster has...

36
Minerva User Group 2018 Jan 18, 2018 Patricia Kovatch Bhupender Thakur, PhD Francesca Tartaglione, MS Dansha Jiang, PhD Eugene Fluder, PhD Hyung Min Cho, PhD Lili Gai, PhD

Upload: others

Post on 28-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Minerva User Group 2018

Jan 18, 2018

Patricia KovatchBhupender Thakur, PhDFrancesca Tartaglione, MSDansha Jiang, PhDEugene Fluder, PhDHyung Min Cho, PhDLili Gai, PhD

Page 2: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Outline

Welcome and general comments

● 2017 Accomplishments● 2017 Minerva usage● 2017 outages and known issues● Survey results and discussion

Q&A

● Road map for 2018○ Compute and storage upgrade○ OS upgrade and package rebuild○ Demeter data science cluster ○ Cloud services such as VM, Spark and containers (Docker/Shifter)○ Documentation and tutorial sessions

Q&A; Floor walk

2

Page 3: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Welcome and general comments

Page 4: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

2017 Accomplishments

● GPFS upgrade Upgraded Minerva file system from release 3.5 to 4.1 which introduced improvements and bug fixes.

● IBM ESS storage installation and data migration Installed 6 PB of IBM Elastic Storage Server to provide additional and faster storage to Minerva users and started users’ data migration to the new pool.

● TSM upgrade Successfully upgraded TSM software from v6.1 to v8.1, as well as tape library and tape drives firmware of our archive storage system for better performance and reliability.

● New database server deployment and data migration Installed and configured a new database server (7.3 SSD disk space, Centos 7.3, MariaDB v10.1). Migrated the users and their data from the old system that was reaching EOL.

● New compute nodes including high-mem node and GPU nodesInstalled and configured a new high-mem node (16 CPU @ 3.2 GHz, 1.7 TB memory). Job tested and now in production.Installed and configured a new GPU node which equipped with 4*P100 Nvidia GPUs. Job tested on this node and will open to users in 01/2018.

4

Page 5: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

● Minerva OS upgrade Started the preparations for the cluster operating system update from CentOS 6.3 to CentOS 7 (OS image setup, packages rebuild, testing).

● Consulting services Provided consulting for users’ private nodes purchases.

● User support Continued to support Minerva users through ticketing system (closed more than 2,304 tickets in 2017) and in person meetings.

● Packages installation Installed more than 100 packages to satisfy users requests and needs (1,299 total packages and growing).

● Accomplished a new round of allocations Allocated 228 users’ projects for a total of 2.6 PB on BODE and 3.7 PB on ESS storage.

● New storage purchaseNew Flash file system 260 TB for metadata and small files.New ESS file system 3.7 PB to be added to the existing storage.

● Collaboration accounts for groups

5

2017 Accomplishments - continued

Page 6: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

2017 Minerva Usage

Page 7: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Storage

High-speed storage used 5.1 PB (77% utilization)

Archival storage used 6.8 PB (13.7 PB total including offsite copy)

2017 Minerva usage summary

7

Compute

Number of jobs run 28,672,579

CPU-hours utilized 61,208,294 hours

Accounts

Number of new users 422

Number of active users in 2017 767

Number of total users 1,735 (600 external users)

Number of project groups 267

System

Number of maintenance sessions 3 planned / 4 unplanned (99% uptime)

Page 8: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

8

Jobs and CPU-hours break down by compute resource:

Compute # Jobs CPU-hour UtilizationManda 14,709,127 28,183,931 88%

Mothra 8,369,565 11,467,197 93%

BODE 8,994,398 7,636,607 79%

Hi-memory node 746 22,215

GPU nodes 2,596 999,166

Total: 28,672,579 61,208,294 91%

20% increase in compute cycles used compared to 2016

Page 9: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

CPU-hours break down by project

9

Gabriel Hoffman

Marta Filizola

Hardik Shah

Eimear Kenny

Avner Schlessinger

Rui Chang

Gaurav Pandey

Page 10: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Job Mix

10

Page 11: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

11

High-speed storage (Orga)

Storage Size Reads/Day Writes/Day Reads/day Writes/day

FLASH (Small File/Metadata) 150 TB ~1.05 billion ~0.6 billion - -

Total (GSS+DDN+FLASH) 8 PB ~1.8 billion ~1 billion ~500 TB ~175 TB

What Minerva’s storage processes on a daily basis: # of reads and writes

Page 12: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

12

Alison Goate LOAD 30,060,193 files

Milind MahajanProduction Bioinformatics Group

22,532,449 files

Gabriel HoffmanCommon Mind Psychiatry 170,553,026 files

Storage usage breakdown by research project

Total: 5.1 PB used by 1,452,320,208 files from 267 projects.

Bin ZhangAdineto 52,056,583 files

Hardik Shah (Robert Sebra)Pacbio SmrtPortalGenetics and Genomic Sciences26,342,148 files

Bin ZhangAMPADWGS131,182 files

Michael MarinMMAAAS 344,530 files

Lisa Edelman CCSQXT 19,398,482 files

Gabriel Hoffman Psychiatry 19,167,326 files

Milind MahajanGenomics Core Facility 208,067,608 files

Page 13: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Archive storage

13

Current archive storage usageArchived data 6.85 PB

(270,461,256 files)

Total data with offsite copy 13.71 PB

Number of tapes used 8,673

Statistics of 2017Archived data in 2017 2,834 TB Retrieved data 7.5 %

# of archive operations 23,065 # of retrieve operations 7,917

# of archive users 106 # of retrieve users 63

Page 14: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Archival storage occupancy by users

14

Page 15: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

2017 Outages and Known issues

Page 16: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

2017 Minerva Outages

16

Summary Planned / unplanned

Date Duration

GPFS issue, long waiters and file system not accessible Unplanned 04/26/2017 ~6 h

ESS recovery groups lost causing long waiters and file system issues

Unplanned 08/02/2017 ~4 h

ESS node communication problems causing long waiters and file system issues

Unplanned 08/05/2017 ~4 h

Two ESS nodes went arbitrating and several servers in unknown state

Unplanned 08/08/2017 ~3 h

The new version of GPFS (4.1.1-15) was installed on all Minerva nodes and storage servers

Planned 08/22/2017 ~8 h

Upgraded GPFS on GSS, worked on GOLD and DB1 Planned 09/19/2017 ~8 h

Generated new keys for GPFS, tested ESS system on orga compute nodes, upgraded minerva2, added fixes for LSF

Planned 11/07/2017 ~8 h

TSM upgrade which includes, Firmware upgrade on tape library, OS upgrade on hpctsm1, tsm upgrade on both server nodes and client nodes.

Planned 11/06/2017 ~7 days

Total 99% uptime

Page 17: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

LSF Job Scheduler Overload in Dec 2017

- Caused by certain pipelines submitting several (tens of thousands) short jobs (~minutes) query the LSF too often.

"Batch system concurrent query limit exceeded ... retrying in 1 second(s)."

- Resolved by communicating with users to understand and optimize the script. We also put a limit on a couple of user’s jobs.

17

Page 18: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

2017 Survey Results

Page 19: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Survey results and discussion

We asked four questions:

Q1: Overall, how satisfied are you with queue structure, compute and storage resources?

Q2: Please rate current software environment (packages and services such as database, web, container etc):

Q3: Please rate your satisfaction with operations (documentation, ticket system, responsiveness of staff,...):

Q4: General suggestions for service improvement.

We received 42 responses and 52 comments.

19

Page 20: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Comments:

● Short of compute resources: long waiting time for job to start; queues are crowded. ● Storage issues: Not enough space in Scratch and longer time before deletion; file-system not

stable.● Better queue support: the compute resource can be occupied by one group only; different

queues for many small jobs and high CPU jobs. ● Memory usage exceed in login nodes; need more interactive nodes.

20

Q1: Overall, how satisfied are you with queue structure, compute and storage resources?

Survey results and discussion

Page 21: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Comments:

● Any omissions in the packages are quickly installed, but need to keep up-to-date.● OS upgrade with services such as containers, VM, Hadoop, Spark and HDFS.

21

Q2: Please rate current software environment (packages and services such as database, web, container etc):

Survey results and discussion

Page 22: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Comments:

● Lack of documentation and examples, documentations are outdated.● Training for new users.

22

Q3: Please rate your satisfaction with operations (documentation, ticket system, responsiveness of staff,...):

Survey results and discussion

Page 23: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Summary

Thank you for your feedback!

Actions we took and/or are taking:

● Upgrading storage and compute nodes.● Upgrading the OS and rebuilding packages.● Deploying the Demeter data science cluster for the community● Making cloud services available such as VMs, Spark and containers (Docker/Shifter).● Updating the documentation and increasing the number of tutorial sessions.

Please continue to provide feedback at any time via our ticketing system or talk to us directly.

23

Page 24: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

2018 Minerva Roadmap

Page 25: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Drivers for our 2018 Roadmap

Hardware reaching end of life (no vendor will support anymore)● Storage component

○ Flash (Stores all metadata for the file system) 50 TB, EOL in Q1.○ Ramsan (Stores tiny files which see frequent use) 100 TB, EOL in Q1.○ GSS (The default location for all data and sees highest use) 2.9 PB, unsupported in Q1.○ DDN10k (The oldest and the smallest storage tier) 1.2 PB EOL in Q1.

● Computes ○ Manda compute nodes, EOL in Q4.

Outdated OS/Software stack● Minerva OS and base packages

○ CentOS 6.4 is out of support.○ Older base compilers and libraries block newer packages.○ Newer MPI/OFED stack will require dependent packages to be rebuilt.

Feedback from user survey

25

Page 26: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Storage and file system upgrade plan

Manda

Login nodes

Mgmt nodes

Compute nodes

...

Client side:

File system side:

Mothra/Bode

Login nodes

Mgmt nodes

Compute nodes

...

1G Ethernet network Infiniband (IB) network

GSS Data Pool

(3PB)

DDN 12K Data Pool

(4PB)

DDN 10K Data Pool(1.5PB)

Flash Data Pool(160TB)

Orga

ESS Data Pool

(6PB)

Data Transfer

GPFS upgrade from v3.5 to v4.2

Page 27: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Current file system upgrade status

Storage upgrade completed in 2017:

After the last town hall we setup a separate GPFS cluster to:

● Identify with the vendors (IBM and Mellanox) what went wrong during the first integration.● Solve the remaining issues (mainly related to the Infiniband network).● Extensively test and stress the ESS file system to be sure the new storage is stable.

Currently:

● We gradually integrated the ESS storage into orga. ● We started the data migration from the DDN10 (~790 TB) and GSS (~2.2 PB) to the ESS pool

(currently migrated ~27% of data).

In future:

● Complete the current file-system upgrade (ETA 2018/03).○ Remove GSS and DDN10k from Orga after data migration completion.○ Upgrade client cluster and file system cluster to GPFS v4.2.

27

Page 28: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

In Dec 2017, we purchased: ● New Flash file system 260 TB for metadata and small files.● New ESS file system 3.7 PB to be added to the existing storage.

Integration schedule:● We will work on a plan with IBM to integrate the new ESS storage and replace the metadata tier.● ETA: April-May 2018 (need to coordinate with IBM).

28

New storage purchase and integration

Storage tier after the upgrade Size

NewFlash (Stores all metadata for the file system) 132 TB

NewRamsan (Stores tiny files which see frequent use) 132 TB

ESS (IBM storage where the current data is being migrated to) 5.1 PB

NewESS 3.7 PB

Total storage available after the upgrade: 8.8 PB

Page 29: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Compute upgrades for 2018

Login nodes

● New set of public login nodes to replace overloaded minerva2.

Compute upgrade

● Newer racks of high density nodes to replace Manda compute partition.We will tailor the nodes based on usage (more memory? more cores? more GPUs? …)

Infrastructure upgrade:● New management and infrastructure nodes.● New network switches.● Dedicated data transfer nodes.● Additional VMs for web services.● A migration path to EDR+ Infiniband fabric.

If you have special requirements for compute resources, please let us know, we are happy to work with you.

29

Page 30: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

OS upgrade and package rebuild

● Centos 7 image is being tested on new login nodes: data2. A new set of load-balanced login nodes will be made available to users for CentOS 7 testing.

● We are currently rebuilding the software packages on a test cluster, which we will open for early user testing in 02/2018.

● The new pool of compute nodes (due this year) will be installed with Centos 7 and newer packages.

● We will update the rest of the computes partition, i.e., manda, mothra and bode, after the integration of the new compute nodes.

We need your help to test the new OS and packages!

30

Page 31: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Demeter data science cluster

● The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC.● It is a Hadoop cluster with 80+ nodes, 3 PB storage space as an hdfs file system.● the Demeter cluster is not under maintenance from any vendor, so when compute nodes or

storage fail, they cannot be replaced.

We are in the process of upgrading it to make it available to you to determine the demand for this type of cluster.

We plan to open it for early user testing in April 2018.

If you have an Apache/Spark pipeline which can benefit from this resource, let us know!

31

Page 32: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Cloud services - VM, containers, and other services

Containers

● We will provide containers via Shifter or Singularity on Minerva by April 2018.● This will be part of the new OS stack on the compute nodes.● We will be adding additional GPU nodes (with newer GPUs) to support containers.

VMs

● We are considering migrating Minerva user websites to multiple VMs to support multiple package requirements (would like your feedback).

● We are considering support for user VMs going forward (would like your feedback).

Database

● We will provide support for MongoDB user databases going forward.

32

Page 33: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Documentation and training

● For most recent announcement and updates:○ Join our mail-list: [email protected]○ Follow us on Twitter @mssmhpc○ Minerva user group meetings will be scheduled as needed.

● Four training sessions will be offered this year.○ Two sets of training sessions in spring and fall.

Topics include “Introduction to Minerva” and “LSF job scheduler”. ○ Introduction to Scientific Computing BSR1015 is now a two credit course with an expanded

lab. It’s being taught this spring by Anthony Costa, PhD.

● Documentation update on the website (https://hpc.mssm.edu/).○ We will refresh the website by March. ○ We will add newer pages/articles as needed over next 3-6 months. ○ We will provide additional training material (including slides) online.

● We are also considering a new ticket system/knowledge base (please give us feedback).

33

Page 34: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

HPC Roadmap - 2018

34

Finish ESS migration

New ESS and Flash storage deployment

CentOS 7 upgrade and package rebuild

Cloud technology deployment and testing

New Compute nodes

Documentation and web page updates

Spring training Fall training

Demeter data science cluster reinstall and deployment

Compute node early

testing

Page 35: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Question and comments

Page 36: Minerva User Group 2018...Jan 18, 2018  · Demeter data science cluster The Demeter cluster has been run by Hammerbacher Lab and is being transferred to HPC. It is a Hadoop cluster

Thank you!