customer-oriented storage performance management · 2019-12-21 · performance monitoring in spite...

26
Customer-Oriented Storage Performance Management Dany Felzenszwalbe Intel

Upload: others

Post on 11-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Customer-Oriented Storage Performance Management

Dany Felzenszwalbe Intel

Page 2: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Legal Notices

This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Copyright © 2014, Intel Corporation. All rights reserved.

2

Page 3: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

About the Presenter

Dany Felzenszwalbe

Joined Intel in 2001 Working in PDIT / EC Product Development IT Engineering Computing

Primary focus is NAS/NFS data and storage solutions

3

Page 4: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Agenda Intel Design NAS environment overview Performance monitoring challenges Performance management overview Reactive performance management Proactive performance management Lessons learnt Call for Action

4

Page 5: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Intel’s Design NAS Environment

5

43 Petabytes of NFS capacity • On > 1000 fileservers • Multiple vendors / platforms • In 44 locations

91% in the largest 10 data centers • 58% in the largest 4 data centers • From 1 to 190 fileservers in a site

More than 1000 projects • About 30,000 Users (aka Customers) • 20-30 million batch jobs running weekly

Page 6: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Motivation

6

All of my rsync jobs are stuck for a few hours !

“I’m trying to check-out a model and it’s taking 30 minutes instead of 2… !!!”

“My Opus GUI got stuck - any idea why… ?!”

“My team’s Batch jobs are failing all of the sudden !”

“Are there NFS problems AGAIN with our tools ?”

“The regression today takes much longer than usually - all Netbatch jobs ran slower. Any idea why ?”

“a simple ls takes ages to complete - why ?”

“We are aware of the issue and have already suspended Batch jobs to resolve it…”

Impact: Interactive users work gets interrupted Batch jobs run slower, cause longer waits which wastes cycles Batch jobs get suspended as a resolution and waste cycles

We want to be as proactive as possible in identifying and solving NFS performance issues

Page 7: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

NFS Performance

4 performance (and cost) tiers are defined for NFS

The design teams/activities are expected to use the appropriate tiers

When performance is “bad” the tier “doesn’t matter”

Understanding when it happens is challenging

Severe NFS performance issues could impact a project’s TTM

Until 2007 we relied on customers as our Monitoring

7

Page 8: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Customer-Oriented Storage Performance Management

8

Proactive Performance Management Reactive Performance Management – Real-time

Automatic Analysis

Server Client

Resources Utilization

Filesystem Latency

Monitoring

User Experience

Custom Use-cases

Solutions Tool-Box

And Optimizations

Future work Analytics

Long-term Trending

Performance Management

Visualization

OPS Latency Monitoring

Resolution

Page 9: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Resources Utilization Monitoring

Monitoring the fileservers’ CPU utilization, disk-drives utilization, network collisions and such

Used Cricket (Open Source from SourceForge) for tracking and as an alerting mechanism

No correlation between high utilization and users reporting performance issues

9

Page 10: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Client-Side User-Experience Monitoring Simulate user activity to identify problems Slow mounts and reads

Using home grown monitoring framework 2 clients Different subnets Duration matters Time of day Server type/tier

10

Two clients mount and copy a 10MB

file from every fileserver every minute

Threshold is crossed

Read latency

Mount latency

Page 11: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Custom Use-Case Performance Monitoring

In spite of the successful monitoring capabilities some undetected events remained painful

Still needed to identify problems before customers Cloning with GIT and a similar tool were chosen

and monitors were created Hard to cover many use-cases Very project and customer dependent Not very granular

Looking to generalize and ease implementation

11

Page 12: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Latencies Monitoring

Latency is a key metric that should provide visibility into customer impact

Fileserver reported latencies should correlate with the user-experience monitoring and should also match other performance issues (“misses”)

Different platforms offer different capabilities We have been experimenting with “disk”

latencies and NFS operations latencies

12

Page 13: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

NFS operations latencies

13

link_latency

Page 14: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Analysis and Resolution

We know that a NFS server is slow – now what ?

We need to promptly clear the impact We need the same solutions for any platform

14

!

Page 15: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Yana – Yet Another NFS Analyzer Analyzes the NFS traffic in a packet trace and

checks fileserver and Batch environment info Dynamically triggered when monitoring identifies

performance degredation

Analysis

15

Server model, tier and groups using it

Workload (OPS mix)

Top-hitter user with OPS mix and Batch jobs amounts and

paths being used

Fileserver name, severity (duration), copy/mount

Page 16: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Resolution

The most common reason for “NFS Slowness” are multiple Batch jobs saturating the server

The “Yana” information tells us what to do “Rasta” – Review and Suspend Tool/Advisor Manually run Suspends batch jobs Resumes jobs gradually Exclusion for “Special” users supported

16

Page 17: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Automatic Resolution

Different approach – “autosuspend” No human intervention required Starts at the first sign of slowness Hooks on “yana” and suspends jobs by clients’ traffic

Tradeoff between NFS Slowness and Batch jobs suspending

17

Rasta suspending

jobs

Autosuspend takes action

after 90 seconds

Page 18: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Analytics

Can't see the forest for the trees Many sites, servers, admins, customers, jobs Needed trending and visualization Needed to focus on the “big” top-hitters

18

BI Needed !

Page 19: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Trending

19

Page 20: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Top-Hitters

20

The fileservers with the most slowness and the users with the longest job suspends are what we want to take action on !

Page 21: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Drill-down for action

21

We can tell which problems are reoccurring by fileserver, specific export/path and activity

Page 22: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Our Solutions Tool-Box

Data migration – to appropriate tier or platform Batch dispatch configurations – slow ramp Local disk caching enablement Smarter Allocation – consider slowness history Customer Flow changes Refresh of old HW Backups tuning

22

Page 23: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Future Plans

Checking latencies usability Integration of new platforms Analyzing non-NFS traffic Monitoring SMB performance QoS implementation Automation for reoccurring top-hitters

23

Page 24: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Lessons Learnt

User-experience monitoring Sometimes the simplest solution works best Automate everything, with caution

NFS performance issues can be resolved Sometimes a downtime may be required

24

Cost Tier

Page 25: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Call For Action

What we need from the NAS Vendors and/of the open source community: Identify “Hot files/users/clients” Alerts on counters with thresholds When there is a performance problem ? And Why would be ideal –

CPU ? Cache ? Disks ? Network ? Standards and API ?

25

Page 26: Customer-Oriented Storage Performance Management · 2019-12-21 · Performance Monitoring In spite of the successful monitoring capabilities some undetected events remained painful

2014 Storage Developer Conference. © Intel. All Rights Reserved.

Q&A

Any questions or feedback, please contact me at [email protected]

26