themis athanassiadou hpc project manager clustervision · pdf filethemis athanassiadou hpc...

41
ClusterVision Engineer Innovate Integrate Themis Athanassiadou HPC Project Manager ClusterVision

Upload: duongnhan

Post on 10-Mar-2018

222 views

Category:

Documents


1 download

TRANSCRIPT

ClusterVision Engineer Innovate Integrate

Themis Athanassiadou

HPC Project Manager

ClusterVision

ClusterVision Engineer Innovate Integrate

About ClusterVision

• 12 years Europe's Dedicated Specialist for High-Performance Computing

• End-to-end hardware/software/services solution provider

• HPC engineering and innovation is at the heart of what we do

• Active in Europe, Middle-East & Africa, Asia-Pacific

• Amsterdam based: - Well connected to major European business locations

• Quality - ISO9001:2008 & ISO14001 certified

• More than 400 projects and 250 customers

ClusterVision Engineer Innovate Integrate

ClusterVision Customers

Government Education Industry

ClusterVision Engineer Innovate Integrate

HPC in Industry HPC in Industry

Industry 51%

Research 24%

Academic 18%

Government 4%

Vendor 3%

Systems share Top 500

Industry Research Academic Government Vendor

ClusterVision Engineer Innovate Integrate

“Today, to Out-Compute is to Out-Compete”

“Today, to Out-Compute is to Out-Compete”

HPC:

• Enables development of new products and services

• Reduces time to market

• Reduces R&D costs

• Increases quality

• Reduces personnel costs

ClusterVision Engineer Innovate Integrate

HPC powers industry giants…

ClusterVision Engineer Innovate Integrate

and empowers a broad spectrum of other businesses and empowers a broad spectrum of other businesses

ClusterVision Engineer Innovate Integrate

Adopting HPC can be challenging for many businesses Adopting HPC can be challenging for many businesses

• Lack of infrastructure

• Cost of equipment

• Cost of operation

• Lack of expertise

• Lack of experience

ClusterVision Engineer Innovate Integrate

CV Innovations that ease HPC adoption CV Innovations that ease HPC adoption

OpenStack Cloud Compute

▪ Convergence of big data, cloud and HPC

▪ Merge with the general IT infrastructure

▪ Open industry standard, open source

▪ Securely host CPU cycles and storage

HPC mineral oil based cooling solution

▪ Saving ~20% power: no air-cooling or fans

▪ Less current leakage, higher HPC performance

▪ Skinless servers

▪ Re-use of racks and same oil for > 15 years

Remote System Administration ▪ Outsourced infrastructure management ▪ Power of scaling: lower your cost ▪ Especially suitable for OpenStack ▪ You can focus on workflow instead of

hardware

Trinity – HPC enabled cloud environment ▪ OpenStack base. ▪ Support for containers ▪ Enhanced for High Performance Compute ▪ Full HPC Ecosystem

ClusterVision Engineer Innovate Integrate

Why cloud? Freedom of choice and flexibility Why cloud? Freedom of choice and flexibility

Main characteristics:

• On demand self-service

• Broad network access

• Resource pooling

• Rapid elasticity

• Measured service

Service models:

• Software as a Service

• Platform as a Service

• Infrastructure as a Service

Deployment models:

• Private Cloud

• Public Cloud

• Hybrid Cloud

• Community Cloud

ClusterVision Engineer Innovate Integrate

Ideally… Ideally…

Engineering (HPC nodes)

IT department

Finance

Physical Hardware

(Servers, Database, Storage, GPU Pool)

ClusterVision Engineer Innovate Integrate

To get there, you need some form of virtualization To get there, you need some form of virtualization

Engineering IT

department Finance

Physical Hardware

(Servers, Database, Storage, GPU Pool)

VM VM VM

Virtual Machine Monitor

ClusterVision Engineer Innovate Integrate

Virtualization technologies

HYPERVISORS

(eg. Vmware, Xen, KVM)

• Full virtualization

• Great workload isolation

• Slower

CONTAINERS

(eg LXC, Docker)

• Lightweight virtualization

•Good workload isolation

•Faster

VS

Image credit: CISCO

ClusterVision Engineer Innovate Integrate

Which is best for HPC? Which is best for HPC?

HPC applications:

• usually tuned to specific hardware

• strive to maximize performance (compute, I/O)

• some require a very fast network (Infiniband)

• Clouds only guarantee "minimal" level of performance

IBM research report: RC25482 (AUS1407-001) July 21, 2014

CPU test: Linpack performance on 2 sockets (16 cores) CPU test: Linpack performance on 2 sockets (16 cores)

ClusterVision Engineer Innovate Integrate

Is performance using container virtualization good enough?

According to the IBM study:

•Docker equals or exceeds

KVM performance in every case tested

(CPU, memory, I/O)

• For I/O-intensive workloads, both

forms of virtualization should be used

carefully.

•Network, which is very important in

many applications, needs to be tested.

CPU test: Linpack performance on 2 sockets (16 cores) CPU test: Linpack performance on 2 sockets (16 cores)

Container based virtualization is a great

starting point for HPC in the Cloud.

ClusterVision Engineer Innovate Integrate

Building a suitable HPC Cloud Building a suitable HPC Cloud

An HPC Cloud should strive to find the balance

between flexibility, convenience and

acceptable performance.

An HPC Cloud should strive to find the balance

between flexibility, convenience and

acceptable performance.

ClusterVision Engineer Innovate Integrate

Choosing a suitable Cloud for

HPC

Choosing a suitable Cloud for

HPC

• Depends on the need (Public? Private? Level of abstraction?

Specialized hardware? Performance? )

• A number of companies, including Penguin, R-HPC, Amazon,

Univa, SGI, Sabalcore, UberCloud and Gompute offer specialized

HPC clouds. Evaluate and choose.

• For full freedom of choice/customization/security, build your

cloud from an OpenSource project (OpenStack, OpenNebula,

Eucalyptus) + add HPC functionality using containers.

ClusterVision Engineer Innovate Integrate

Openstack: Cloud building toolkit of choice

• Open source set of software tools for building and managing cloud computing platforms for public and

private clouds.

•Currently managed by the OpenStack Foundation.

•More than 200 companies have joined, including Dell, Intel, Red Hat and Oracle

•Rapidly becoming industry standard

• It is primarily deployed as an infrastructure as a service (IaaS) solution.

• Open source set of software tools for building and managing cloud computing platforms for public and

private clouds.

•Currently managed by the OpenStack Foundation.

•More than 200 companies have joined, including Dell, Intel, Red Hat and Oracle

•Rapidly becoming industry standard

• It is primarily deployed as an infrastructure as a service (IaaS) solution.

ClusterVision Engineer Innovate Integrate

Challenges addressed by OpenStack

Problem Solution

Manager: Resources wasted when Virtual Desktop Infrastructure (VDI) is idle at night

Admin: Inability to collaborate with external parties due to lack of a security infrastructure for hosting CPUs/disks

Admin : Inflexibility in building and maintaining similar environments on multiple physical platforms

User: Finance department needs to run the payroll for tomorrow, and doesn’t have the resources to do so in time!

With OpenStack you can easily switch from the Virtual Desktop Infrastructure (VDI) to HPC

With OpenStack, you can securely host CPU/DISK for paying customers through the use of virtual instances / environments

With OpenStack, you can easily share images which contain predefined application binaries and/or environments

With OpenStack this user would more easily access a larger or even external infrastructure with the required CPU / Storage environment

Manager: Virtual server and HPC environments running independently.

With OpenStack, you merge these into one single efficient environment

ClusterVision Engineer Innovate Integrate

• Full HPC Stack - Monitoring

- Checking

- Module and Library Environment

- Scheduler (SLURM, PBS, SGE etc)

• Performance Optimisations

(containers)

• Typical HPC services and integration

- InfiniBand

- Parallel filesystems

- GPUs/Accelerators

Enhancing OpenStack for HPC Enhancing OpenStack for HPC

ClusterVision Engineer Innovate Integrate

Trinity: Linking Cloud and HPC

Trinity is a set of software tools for building and managing virtual HPC or OpenStack environments in a platform as a service (PaaS) solution, customized for HPC performance.

Trinity is a set of software tools for building and managing virtual HPC or OpenStack environments in a platform as a service (PaaS) solution, customized for HPC performance.

• Adds ease of management (Trinity dashboard) to HPC

• Scalable to tens of thousands of nodes

• Full hardware support (IPMI, infiniband, PXE)

• Provides full HPC stacks (schedulers, MPI, libraries)

• No performance loss (virtualization based on Docker)

• Allows customers to host their own private

or public IaaS cloud (for general IT)

• Load balancing (HPC) partitions

• Environment customization

ClusterVision Engineer Innovate Integrate

Features Bright CM IBM Platform Trinity

Node provisioning

Health check and monitoring

GUI and command line interfaces

SLURM, SGE & PBS support

Parallel shell

Modules environment

Compilers, debuggers & profilers

MPI + Scientific libraries

All of the standard HPC cluster manager requirements included All of the standard HPC cluster manager requirements included

Containerized HPC building blocks

Cloud Computing Ready

ClusterVision Engineer Innovate Integrate

A traditional cluster

Login node(s) Worker node(s) Storage node(s)

ClusterVision Engineer Innovate Integrate

A Trinity managed cluster

Login node(s) Worker node(s) Storage node(s)

Virtual Cluster A: runs the HPC stack for department A Virtual Cluster A: runs the HPC stack for department A

Virtual Cluster B: runs the HPC stack for department B Virtual Cluster B: runs the HPC stack for department B

Virtual Cluster C: runs VDI Virtual Cluster C: runs VDI

Virtual Cluster D: runs general IT infrastructure using OpenStack Virtual Cluster D: runs general IT infrastructure using OpenStack

Trinity Dashboard

(Single management interface)

Trinity Dashboard

(Single management interface)

ClusterVision Engineer Innovate Integrate

In summary: In summary:

• An HPC Cloud is a powerful tool for the arsenal of any industry, small or big.

•It gives both power and flexibility at many stages of product design and testing

•It can reduce cost by consolidating resources used for different purposes

•New software technologies are alleviating performance concerns

•Many vendor choices to suit every need

•Trinity is a great choice for a private cloud, providing a full cluster manager, HPC

stack and cloud management for HPC, IT, Data.

ClusterVision Engineer Innovate Integrate

Thank You!

ClusterVision Engineer Innovate Integrate

HPC, Cloud and BigData are coming together HPC, Cloud and BigData are coming together

Hardware Resources (nodes, network, disks etc) Hardware Resources (nodes, network, disks etc)

Resource Management Resource Management

Compute Virtualization Compute Virtualization

HPC CRM Database VDI Email

Monitoring Monitoring

Authentication Authentication

Deployment Deployment

Object storage Object storage Dashboard Dashboard

HPC, Cloud and BigData have a lot of overlap: • Centralised resources • Same complex management, same complex environment • Similar high performance storage and powerful server requirements • Almost same physical networking • Same controller <-> worker-node relationship

ClusterVision Engineer Innovate Integrate

Neat tricks in Virtual Machines

Reliable checkpoint restart of jobs

• Towards a 100% scheduling efficiency

• Fast track high priority users

• Move jobs within the cluster & outside the cluster

The price: loosing performance (5% -> 1%)

Node A

Node B

Node C

Node D

Current time

t

Current time

t

ClusterVision Engineer Innovate Integrate

Why HPC in the Cloud?

What is OpenStack?

A set of software tools for building and managing cloud computing platforms

for public and private clouds.

OpenStack is primarily deployed as an infrastructure as a service (IaaS)

solution.

OpenStack began in 2010 as a joint project of Rackspace Hosting and

NASA.

Currently, it is managed by the OpenStack Foundation. More than 200

companies have joined this project, including Dell, Intel and Oracle.

ClusterVision Engineer Innovate Integrate

Challenges addressed with OpenStack Problem Solution

Manager: Resources wasted on a Virtual Desktop Infrastructure (VDI) is idle at night

Admin: Inability to collaborate with external parties due to lack of a security infrastructure for hosting CPUs/disks

User Group: Inflexibility in building and maintaining similar environments on multiple physical platforms

User: has a deadline for a paper/conference tomorrow, requires fast amounts of immediate CPU cycles

With OpenStack you can easily switch from the Virtual Desktop Infrastructure (VDI) to HPC

With OpenStack, you can securely host CPU/DISK for paying customers through the use of virtual instances / environments

With OpenStack, you can easily share images which contain predefined application binaries and/or environments

With OpenStack this user would more easily access a larger or even external infrastructure with the required CPU / Storage environment

Manager: vmware and HPC environment running independently from each other

With OpenStack, you merge these into one single efficient environment

ClusterVision Engineer Innovate Integrate

Using the cloud to address growing challenges in HPC

IT Manager

▪ Rising infrastructure & personnel costs

▪ Growing complexity / fragmented infrastructure

▪ Increasingly complicated personnel needs

System Administrator

▪ Growing complexity of hardware & software environment

▪ Managing tenants with different needs/workflows

▪ Managing secure access to resources

▪ Dealing with hardware changes

User ▪ Cluster environment different from workstation ▪ Software stack needed for workflow not pre-installed ▪ Non- availability of resource when needed most

ClusterVision Engineer Innovate Integrate

Why HPC in the Cloud?

What can the Cloud bring to HPC?

HPC and Cloud also have significant differences

Cloud:

● Split bigger computing units

into smaller

● Timesharing execution model

● Elastic

● Provides commodity (virtual)

hardware

● Increases utilisation to 85%

HPC:

● Merge smaller computing

units into a single whole

● Batch execution model

● Backfilling

● Needs specialised hardware

(GPU, IB)

● Utilisation already above

85%

ClusterVision Engineer Innovate Integrate

ClusterVision Recent References

Dolphin Geophysical

▪ Lasting Partnership

▪ Onshore / offshore

▪ HPC solutions & extended consulting services

▪ Bright, BeeGFS, Dell / Asus hardware

Volvo IT

▪ HPC services partner for Dell

▪ Main installs

▪ Lyon (France)

▪ Gothenburg (Sweden)

Ecole Polytechnique Fédérale de Lausanne

▪ Framework contract 512+ compute nodes

▪ Intel Ivy Bridge / Haswell (Truescale & Servers)

▪ Close collaboration on application fine tuning

▪ XCat2

National Supercomputer Center (Sweden)

▪ 640 compute nodes

▪ Asus 4 nodes in 2U systems

▪ First Haswell reference: 2640V3 (8 cores / 2.6GHz)

ClusterVision Engineer Innovate Integrate

What is RSA and why does it fit so nicely?

What?

ClusterVision end-to-end HPC management

Why?

- Avoid single point of failure

- Empower your highly qualified admins

- Use the power of scaling: lower your cost

- Reliability of service delivery

How?

- Central monitoring system

- Know instantly when something is wrong –

and react upon it

- Software updates

- Remote and onsite repair

- Notification and explanation for actions

- Management reporting

ClusterVision Engineer Innovate Integrate

OpenStack Trinity

OpenStack Trinity

Based on open source components with custom OpenStack plugins

● xCAT Deployment

● Docker Virtualization and image management

● SLURM Scheduling

● OpenMPI Communication

● RSA/Nagios Monitoring and Health-checks

Cookbook style recipes and easy customizations

ClusterVision Engineer Innovate Integrate

Trinity

Why xCAT?

1. xCAT is stable, full featured and well tested & Open Source

2. xCAT is also usable without OpenStack

3. Supports node discovery, image management, stateless nodes, IPMI

abstraction, PXE boot

4. OpenStack ironic is still in beta

5. Bright is expensive

6. The main xCAT weakness (lousy UI) will be mitigated by standardization

and adding a custom OpenStack dashboard to xCAT

ClusterVision Engineer Innovate Integrate

Trinity

Why Docker?

Docker is an operating system level virtualization framework.

1. Used extensively by Google and other industry giants

2. Lightweight

3. Much faster than full machine virtualization

4. Good support for image management and versioning

5. Pluggable virtualization drivers (lxc, OpenVZ)

6. Pluggable storage virtualization (AUFS, devicemapper, VFS)

ClusterVision Engineer Innovate Integrate

Current status of Trinity

Phase 1 of the project is completed

We support the following core features

1. Cluster partitioning (virtual clusters)

2. Multiple OS and applications versions between clusters

3. Isolated (sandboxed) applications between clusters

4. Setup OpenStack insideTrinity

Will require manual customizations and improvisation to support this “in the

field”. Overall dashboard is absent.

POC Q3 2014: Bristol & Cambridge University.

ClusterVision Engineer Innovate Integrate

A new approach: Container-based virtualization A new approach: Container-based virtualization

Linux containers

LXC is an operating system level

virtualization method for running

multiple isolated Linux systems

(containers) on a single control

host.

- Lightweight

- Fast provisioning

- Workload isolation

- Near bare metal performance

Docker

is an open-source project that

automates the deployment of

applications inside containers,

by providing an additional layer

of abstraction and automation.

ClusterVision Engineer Innovate Integrate

Adopting cloud computing is easier Adopting cloud computing is easier

• Lack of infrastructure

• Cost of equipment

• Cost of operation

• Lack of expertise

• Lack of experience

ClusterVision Engineer Innovate Integrate

Current status of HARP

Phase 2 (Q3 2014):

1. Allow remote management (RSA inside OpenStack dashboard)

2. Allow self-service partitioning by customers (use case 5)

Phase 3 (Q4 2014):

1. Allow automated elasticity (meta-scheduling, repartition resources

based on a calendar- use case 2)

2. Allow sharing of resources between HPC clusters

3. Allow self-service of jobs by customers of our customers (SaaS model)