reproducible and scalable experiments with skyhookdm ...benchmarks for skyhookdm ceph. discovered...

16
Reproducible and Scalable Experiments with SkyhookDM Ceph Jayjeet Chakraborty Carlos Maltzahn, Ivo Jimenez, Jeff LeFevre UC Santa Cruz 1

Upload: others

Post on 10-Aug-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

Reproducible and Scalable Experiments with SkyhookDM

CephJayjeet Chakraborty

Carlos Maltzahn, Ivo Jimenez, Jeff LeFevreUC Santa Cruz

1

Page 2: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

Build, Deploy and Run experiments

Boot VMs or bare metal Nodes Prepare plots and

notebooks

Doing manually is time consuming and error-prone !2

Generic Systems Experimentation Workflow

Page 3: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

High-Level Workflow for SkyhookDM Experiments

Benchmark throughput by

running queries on hundreds of GBs of

tabular data.

Spawn nodes and deploy k8s. Get the config.

Benchmark Network and Disk I/O .

Study the Jupyter notebooks, Plots

and Grafana snaps.

Deploy Ceph and run Rados

benchmarks.

3

Should be platform independent and automated !

Page 4: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

4

If done manually,

Page 5: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

Overview of Containers

5

● Less resource usage than VMs● Platform independent and portable software● Consistent operation across environments● Greater efficiency

Page 6: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

Containerizing Commands

6

$ docker run -e BLOCKDEVICE=sdb -e IODEPTH=32 -v $PWD:/workspace --rm --entrypoint /bin/bash -w /workspace bitnami/kubectl:1.17.4 ./run_benchmarks.sh

Solves platform dependency. But still lacks automation !

Page 7: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

Hardware

Popper

Containers

Operating System

7

Introducing Popper

Page 8: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

8Slide borrowed from Ivo Jimenez

Page 9: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

9

“Popperizing” the SkyhookDM Experimentation Workflow

Page 10: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

Implementation Highlights

● Highly configurable and scalable workflows with parameter sweeps.

● Monitoring infrastructure with Prometheus + Grafana.

● Everything in Kubernetes. Rook.io, kube-prometheus, kubespray, kubestone, etc.

10

Page 11: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

Workflows for Every High-level Step

11

Page 12: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

Monitoring with Prometheus and Grafana

12

Page 13: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

Jupyter Notebooks for further Exploration

13

Page 14: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

Case Study: Benchmarking SkyhookDM on River SSL

● Baselined the River SSL Kubernetes cluster and performed performance benchmarks for SkyhookDM Ceph.

● Discovered bottleneck in Network I/O for 10GbE links.

● Discovered unbalanced CPU usage in OSDs due to unbalanced PGs.

New avenues for further investigation !14

Page 15: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

Future Work

● Try capturing and creating workflows for other categories of Ceph benchmarks like CephFS and Ceph RBD benchmarks.

● Create experiment/benchmark workflows for other popular systems e.g. Key-value stores like RocksDB, databases like PostgreSQL, etc.

15

Page 16: Reproducible and Scalable Experiments with SkyhookDM ...benchmarks for SkyhookDM Ceph. Discovered bottleneck in Network I/O for 10GbE links. Discovered unbalanced CPU usage in OSDs

Thank you !

Questions ?

Visit https://github.com/uccross/skyhookdm-workflows/

16https://twitter.com/heyjc25

[email protected]