wekaio: extracting performance from modern hpc hardware · derek burke, regional account manager...

24
1 WekaIO Confidential © 2018 All rights reserved. WekaIO: Extracting Performance from Modern HPC Hardware Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer

Upload: others

Post on 29-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

1WekaIO Confidential © 2018 All rights reserved.

WekaIO: Extracting Performance from Modern HPC Hardware

Derek Burke, Regional Account ManagerChris Weeden, Senior Systems Engineer

Page 2: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

2WekaIO Confidential © 2018 All rights reserved.

Company Overview

COMPANY

Founded in 2013

US HQ in San Jose, CA

Eng, in Tel Aviv, Israel

• Founding team who built XIV, acquired by IBM

• Strong patent portfolio

• 54 Patents Filed

• 10 issued

• Matrix became GA in July 2017

• Showcased at AWS ReInvent in Dec. 2017

• Partner focused

• Rapidly growing customer base

$40.35M

INVESTMENTOUR ACCOLADES

WekaIO Matrix is the fastest, most scalable parallel file system for AI and

technical compute workloads, on premises and in the public cloud

WHAT WE DO

Page 3: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

3WekaIO Confidential © 2018 All rights reserved.

Data Lake

MatrixFS

Matrix Client

Single Namespace for Multiple Workloads

Ethernet or InfiniBand Network

Storage Servers

Unified Namespace

S3

Public

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

SMBSMBNFSNFS

APP APP

GPU GPU

APP APP

GPU GPU

APP APP

GPU GPU

APP APP

GPU GPU

APP APP

GPU GPU

APP APP

GPU GPU

APP APP

GPU GPU

APP APP

GPU GPU

Microscopy Genomics Imaging Device Video Editing Media Rendering HPC OLTP

DatabasesData Mining Real-time

Analytics

Financial

Processing

& Analyses

Fraud

Detection

Geospatial

Research

Page 4: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

4WekaIO Confidential © 2018 All rights reserved.

Focused On Performance Use Cases

• Seismic

• Reservoir simulation

• Analytics

• Machine Learning\ AI

• Real-time analytics

• IOT

• Genomics sequencing and analytics

• Drug discovery

• Microscopy

• Algorithmic trading

• Business analytics (SAS Grid)

• Risk analysis (Monte Carlo simulation)

• CFD, Simulations

• EDA• Media rendering

• Transcoding

• Visual Effects (VFX)

Page 5: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

5WekaIO Confidential © 2018 All rights reserved.

We Make a Bold Claim…

From one rack of commodity hardware with 100GbE Network

Approximately:

• 400GB/s

• 30 Million IOPS

Page 6: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

6WekaIO Confidential © 2018 All rights reserved.

SPEC SFS2014 vda Results

Lower and Longer Are Better

1.8ms at highest latency

Test is 90% write intensive

Page 7: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

7WekaIO Confidential © 2018 All rights reserved.

SPEC SFS2014 eda Results

Lower and Longer Are Better

.45ms latency

Frontend is 50% stats test

Backend is 50% R and W

Page 8: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

8WekaIO Confidential © 2018 All rights reserved.

SPEC SFS2014 vdi Results

Lower and Longer Are Better

.42ms latency

Test is 60% random write

20% random read

Page 9: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

9WekaIO Confidential © 2018 All rights reserved.

SPEC SFS2014 db Results

Lower and Longer Are Better

.29ms latency

Test is 80% random read

20% random write

Page 10: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

10WekaIO Confidential © 2018 All rights reserved.

How do we do it…

Page 11: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

11WekaIO Confidential © 2018 All rights reserved.

Why Data Locality is Irrelevant

o Modern networks on 10Gbit Ethernet are 30 times faster than SSD for reads and 10 times faster than SSD for writes

o With right networking stack, shared storage is faster than local storage

0 20 40 60 80 100 120

100Gbit (EDR)

10Gbit (SDR)

SSD Write

SSD Read

Time it takes to Complete a 4KB Page Move

Microseconds

Page 12: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

12WekaIO Confidential © 2018 All rights reserved.

The Kernel is designed for multi-tasking…• The system needs to be notified of the new packet and pass the data onto a specially allocated buffer

sk_buff struct (Linux allocates these buffers for every packet).

• To do this, Linux uses an interrupt mechanism: an interrupt is generated several times when a new

packet enters the system. The packet then needs to be transferred to the user space.

• As more packets have to be processed, more resources are consumed negatively impacting the overall

system performance.

• sk_buff struct: the Linux network stack was originally designed to be compatible with as many protocols

as possible. As such, metadata for all of these protocols is included in the sk_buff struct, but that’s

simply not necessary for processing specific packets. Because of this overly complicated struct,

processing is slower than it could be.

• When an application in the user space needs to send or receive a packet, it executes a system call.

The context is switched to kernel mode and then back to user mode.

This consumes a significant amount of system resources

Page 13: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

13WekaIO Confidential © 2018 All rights reserved.

Context Switching is wasteful…

Page 14: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

14WekaIO Confidential © 2018 All rights reserved.

DPDK by-passes the Kernel…

Page 15: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

15WekaIO Confidential © 2018 All rights reserved.

Components of DPDK:

•Environment Abstraction Layer (EAL) : It is responsible for gaining access to low-level resources such as hardware and memory space.

•Memory Manager: Responsible for allocating pools of objects in memory. A pool is created in huge page memory space and uses a ring to store free objects.

•Buffer Manager: Reduces by a significant amount the time the operating system spends allocating and de-allocating buffers using advanced techniques such as Bulk Allocation, Buffer Chains, Per Core Buffer Caches etc.

•Queue Manager: Implements safe lockless queues, instead of using spinlocks, that allow different software components to process packets, while avoiding unnecessary wait times.

•Packet Flow Classification: DPDK Flow Classifier implements hash based flow classification to quickly place packets into flows for processing.

•Poll Mode Drivers: Instead of using Interrupts and wasting CPU attention, PMD uses polling (scanning the NIC whether packets arrived or not), and doesn't disturb the CPU at all!

Page 16: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

16WekaIO Confidential © 2018 All rights reserved.

We also use SR-IOV and SPDKMultiple Virtual NICs Benefits of SPDK

Page 17: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

17WekaIO Confidential © 2018 All rights reserved.

Software Architecture

Runs in LXC Container for isolation

Kernel Module

▪ VFS Interface

Front End

▪ POSIX Client

▪ Other protocol access

Back End

▪ Data placement

▪ Data protection

▪ File system metadata

▪ Tiering

Networking

▪ SRIOV for network stack

▪ I/O bypasses network stack

Storage Agent

▪ I/O bypasses the kernel

Front EndClient

Access

Networking

Flash

Storage

Device

Agent

Kernel Module

Back EndStorage

Services

Application

Hardware

User

Space

Kernel

NFS

ke

rne

l

SR

-IO

V

Page 18: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

18WekaIO Confidential © 2018 All rights reserved.

WekaIO Data Path

Front EndClient

Access

Networking

Flash

Storage

Device

Agent

Kernel Module

Back EndStorage

Services

Application

Hardware

User

Space

Kernel

NFS

ke

rne

l

SR

-IO

V

▪ Application IO (file operations)

▪ Access WekaIO Client as Local FS

▪ User-Space, Low-Latency

▪ POSIX-complete, high-perf

▪ Kernel Module for VFS integrationOR

▪ Client-side NFS

▪ Bottlenecked by Kernel

▪ Handled by WekaIO’s Front End

Page 19: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

19WekaIO Confidential © 2018 All rights reserved.

WekaIO Data Path

Front EndClient

Access

Networking

Flash

Storage

Device

Agent

Kernel Module

Back EndStorage

Services

Application

Hardware

User

Space

Kernel

ke

rne

l

▪ Application IO (file operations)

▪ Access WekaIO Client as Local FS

▪ User-Space, Low-Latency

▪ POSIX-complete, high-perf

▪ Kernel Module for VFS integrationOR

▪ Client-side NFS

▪ Bottlenecked by kernel

▪ Handled by WekaIO’s Front End

▪ WekaIO Front-Ends are Cluster-Aware

▪ Incoming Read Requests optimized

re Location & Loading Conditions

▪ Incoming Writes can go anywhere

▪ Metadata fully distributed

▪ No redirects required

▪ SR-IOV optimizes Network access

NFS

WekaIO Cluster

SR

-IO

V

Page 20: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

20WekaIO Confidential © 2018 All rights reserved.

WekaIO Data Path

Front EndClient

Access

Networking

Flash

Storage

Device

Agent

Kernel Module

Back EndStorage

Services

Application

Hardware

User

Space

Kernel

ke

rne

l

SR

-IO

V

▪ Application IO (file operations)

▪ Access WekaIO Client as Local FS

▪ User-Space, Low-Latency

▪ POSIX-complete, high-perf

▪ Kernel Module for VFS integrationOR

▪ Client-side NFS

▪ Bottlenecked by kernel

▪ Handled by WekaIO’s Front End

▪ WekaIO Front-Ends are Cluster-Aware

▪ Incoming Read Requests optimized

re Location & Loading Conditions

▪ Incoming Writes can go anywhere

▪ Metadata fully distributed

▪ No redirects required

▪ SR-IOV optimizes Network access

▪ WekaIO directly accesses NVMe flash

▪ Bypassing kernel, better perf

NFS

WekaIO Cluster

Page 21: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

21WekaIO Confidential © 2018 All rights reserved.

File System Scales Linearly with Cluster Size

K

1M

2M

3M

4M

5M

6M

7M

8M

0 30 60 90 120 150 180 210 240

IOP

S

Cluster size

Linear Scalability - IOPS

100% random write (IOPS) 100% random read (IOPS)

0.2

0.3

0.4

0.5

0.6

0 30 60 90 120 150 180 210 240

Mill

iseconds

Cluster size

Linear Scalability – Latency (QD1)

read latency (ms) write latency (ms)

0

10

20

30

40

50

60

70

80

90

100

0 30 60 90 120 150 180 210 240

GB

/Second

Cluster size

Linear Scalability - Throughput

100% write throughput (GB) 100% read throughput (GB)

Test Environment – 30-240 R3.8xlarge cluster, 1 AZ, utilizing 2 cores, 2 local SSD drives & 10GB of RAM on each instance. About 5% of CPU/RAM.

~30K OPS/AWS Instance

~375MB/sec/AWS Instance

<400 microsecond latency

Page 22: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

22WekaIO Confidential © 2018 All rights reserved.

WekaIO Matrix #1 File System on SPEC

WekaIO

Oracle

NetApp

DDN

0

1000

2000

3000

4000

5000

6000

7000

8000

Video Streams SW Build Database EDA VDI

Ben

chm

ark

Me

tric

Summary of SPEC SFS 2014 Testing

WekaIO #2 Vendor

Benchmark #1

Position

Score ORT

(ms)

#2 Position Score ORT

(ms)

Software build WekaIO 5700 0.26 NetApp 4200 0.78

Database WekaIO 4480 0.34 Oracle 2240 0.78

Engineering design WekaIO 2000 0.48 Oracle 900 0.61

Video streams WekaIO 6800 1.56 DDN 3400 50.07

Virtual desktop WekaIO 1600 0.48 DDN 800 2.58

Page 23: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

24WekaIO Confidential © 2018 All rights reserved.

#1 File System on the IO500 Test

o 31% better than World’s largest Supercomputer

o 85% better than Lustre

0

10

20

30

40

50

60

70

DDN Lustre IBM Spectrum Scale WekaIO

IO500 Ten Node Challenge Score

Test measures bandwidth

and metadatahttps://www.vi4io.org/io500/list/19-01/10node

Page 24: WekaIO: Extracting Performance from Modern HPC Hardware · Derek Burke, Regional Account Manager Chris Weeden, Senior Systems Engineer. ... SPEC SFS2014 db Results Lower and Longer

25WekaIO Confidential © 2018 All rights reserved.

https://docs.weka.io https://start.weka.io

Thank you.