cluster schedulerの紹介

30
Cluster Scheduler の紹介 春来 (りゅう しゅんらい) 2015-02-18@Container勉強会, 東京

Upload: chuenlye-leo

Post on 14-Jul-2015

2.324 views

Category:

Internet


2 download

TRANSCRIPT

Page 1: Cluster schedulerの紹介

Cluster Schedulerの紹介

劉 春来 (りゅう しゅんらい)

2015-02-18@Container勉強会, 東京

Page 2: Cluster schedulerの紹介

自己紹介

DevOps@Cloud team, Cyberagent

mail: [email protected]: @chunlai_226

Page 3: Cluster schedulerの紹介

Cluster Schedulingの話なぜこのトーク?

コンテナでいろんなworkloadsが流れます:batch系、service系(おおまかな二分法)↓実際にどこで実行するのかの問題(Cluster scheduling)↓PaaS検証の背景:multiple workloads, multiple tenantsのPaaS上マルチClustersのresource sharing問題

(Dynamic sharingのcluster scheduler)

Page 4: Cluster schedulerの紹介

Dynamic sharingって何?

まず反対のStatic partitioningを見てみよう

Page 5: Cluster schedulerの紹介

Static partitioningWeb cluster、DB cluster、Hadoop ClusterなどのClusterは独自のサーバー群を持っていてsharingしない

● hard to utilize machines● hard to scale elastically● hard to deal with failures

絵でわかる(p30~p40):https://speakerdeck.com/benh/apache-mesos-nyc-meetup

Page 7: Cluster schedulerの紹介
Page 8: Cluster schedulerの紹介

Dynamic sharing

Running multiple frameworks in a single cluster can● maximize utilization ● sharing data between frameworks● simplify the infrastructure

Page 9: Cluster schedulerの紹介

Dynamic sharingの課題

Dynamic sharingのメリットは大きい一方で、Cluster schedulingは複雑化になります:

● a wide range of requirements and policies have to be taken into account

● clusters and their workloads keep growing and since the scheduler's workload is roughly proportional to the cluster size, the scheduler is at risk of becoming a scalability bottleneck.

Page 10: Cluster schedulerの紹介

代表のふたつ:Mesos とOmega● Mesosはresearch projectから生まれたOSS、paperあり、TwitterやAirbnbなど

の大規模運用実績ありTwitterはMesosで3万以上のserversを管理している(http://www.centurylinklabs.com/interviews/making-clustered-infra-look-like-one-big-server-with-mesosphere/)

● Omega(OSSではない):1) Googleのnext-generation cluster management platform(前身はBorgと

いうシステム、数年間の運用実績)   参照:https://www.usenix.org/cluster-management-google

2) Omegaというpaper:Googleのcluster scheduler, 2013※ Mesos paperの共著者のひとりもOmegaの共著者です。

Page 11: Cluster schedulerの紹介

Cluster schedulersの三つtype● Monolithic schedulers:

Omegaの前身であるBorgのscheduler、Apache Hadoop YARN(Omega Paperより)

● Two-level schedulers:Mesos、Hadoop-on-Demand

● Share-state schedulers:Omega

Page 12: Cluster schedulerの紹介

Scheduler architectures

Page 13: Cluster schedulerの紹介

Monolithic scheduleruse a single, centralized scheduling algorithm for all jobs.

Google's current(2013) cluster scheduler is effectively monolithic, acquired many optimizations over the years: provide internal parallelism and multi-threading to address head-of-line blocking and scalability.

Page 14: Cluster schedulerの紹介

Two-level scheduler(Mesos)

Mesos: controls resource allocations to schedulers

Schedulers: make decisions about what to run given allocated resources

Page 15: Cluster schedulerの紹介

Mesos architecture

Page 16: Cluster schedulerの紹介

Mesos: Example of resource offer

Page 17: Cluster schedulerの紹介

Two-level scheduler(Mesos)An obvious fix to the issues of static partition is to adjust the allocation of resource to each scheduler dynamically, using a central coordinator to decide how many resources each sub-cluster can have.

Mesos works best when 1) tasks are short-lived2) relinquish resources frequently3) job sizes are small compared to the size of the cluster

Page 18: Cluster schedulerの紹介

なぜgoogleは不採用?Monolithic schedulerとtwo-level schedulerはgoogleのニーズに満たせない:

0) Googleのニーズは何?

Page 19: Cluster schedulerの紹介

Clusterのworkloads

simple two-way split:● batch jobs: perform a computation and then finish. For

simplicity we put all low priority jobs and those marked as "best effort" or "batch" into the batch category

● service jobs: long-running service jobs that provide end user operations(e.g., web services) and internal infrastructure services(e.g. storage service, naming service, locking service)

Page 20: Cluster schedulerの紹介

Cluster traces from Google

● most(>80%) jobs are batch jobs● the majority of resources (55-80%) are

allocated to service jobs● service jobs typically run for much longer(20-

40% of them run for over a month) and have fewer tasks than batch jobs

※ YahooとFacebookのworkloadsも似ている

Page 21: Cluster schedulerの紹介

Googleのニーズ● Many batch jobs are short, and fast turnaround is important, so a lightweight, low-quality

approach to placement works just fine.● Long-running, high-priority service jobs must meet stringent availability and performance targets,

so careful placement of their tasks is needed to maximize resistance to failures and provide good performance.

● "head of line blocking" problem: while it is very reasonable to spend a few seconds making a decision whose effects last for several weeks, it can be problematic if an interactive batch job has to wait for such a calculation. This problem can be avoided by introducing parallelism.

つまりGoogleのニーズ:require a scheduler architecture that● can accommodate both types of jobs● flexibly support job-specific policies● and also scale to an ever-growing amount of scheduling work.

Page 22: Cluster schedulerの紹介

なぜgoogleは不採用?Monolithic schedulerとtwo-level schedulerはgoogleのニーズに満たせない:1) Monolithic scheduler:● It complicates an already difficult job: the scheduler has to minimize the

time a job spends waiting before it starts running.● It is surprisingly difficult to support a wide range of policies in a sustainable

manner using a single-algorithm implementation.This kind of software engineering consideration, rather than performance scalability implementation, was our primary motivation to move to an architecture that supported concurrent, independent scheduling components. performance scalabilityよりsoftware engineeringの考えですね!

Page 23: Cluster schedulerの紹介

なぜgoogleは不採用?Monolithic schedulerとtwo-level schedulerはgoogleのニーズに満たせない:2) Two-level scheduler:● No global view of the overall cluster state● Lock issue: pessimistic concurrency control● Assumptions that resource become available frequently and scheduler

decisions are quick, so works best when short tasks/relinquish resource frequently/small job size compared to the size of the cluster: but google's cluster workloads do not have these properties, especially in the case of service jobs

Page 24: Cluster schedulerの紹介

Share-state scheduler(Omega)● each scheduler can full access to the entire cluster● use optimistic concurrency controlThis immediately eliminate two of the issues of the two-level scheduler approach:➔ limited parallelism due to pessimistic concurrency

control➔ restricted visibility of resources in a scheduler

framework

Page 25: Cluster schedulerの紹介

Share-state scheduler(Omega)● No central resource allocator in Omega(be simplified to a persistent data store)● All of the resource-allocation take place in the schedulers.● "cell state": a resilient master copy of the resource allocation maintained in the cluster. Each

scheduler is given a private, local, frequently-updated copy of cell state for making scheduling decisions. The scheduler can see the entire state of the cell.

● Omega schedulers operate completely in parallel and do not have to wait for jobs in other schedulers and there is no inter-scheduler head of line blocking.

The performance viability of the share-state approach is ultimately determined by the frequency at which transactions fail and the costs of such failures.

The batch scheduler is the main scalability bottleneck, the Omega model can scale to a high workload while still providing good behavior for service jobs.

Page 26: Cluster schedulerの紹介

cluster schedulersの比較

Approach Resource Choice

Interference Alloc. granularity

Cluster-wide policies

Monolithic all available none(serialized) global policy strict priority(preemption)

Statically partitioned fixed subnet none(partitioned)

per-partition policy

scheduler-dependent

Two-level(Mesos) dynamic subnet pessimistic hoarding strict fairness

Shared-state(Omega) all available optimistic per-scheduler policy

free-for-all, priority preemption

Page 27: Cluster schedulerの紹介

MesosとPaaSの話PaaS検証の背景(p3):multiple workloads, multiple tenantsのPaaS上マルチClustersのresource sharing問題

(Dynamic sharingのcluster scheduler)

PaaS上のworkloads:long running processes/one-off tasks/scheduled jobsservice jobsの割合はより高く、service jobsのschedulingはもっと重要

Mesos frameworks for Long running services:Aurora/Marathon/SingularityなどありますがOmegaのpaper(2013)が指摘したMesosの問題(特にService jobsの問題)Mesosの最新状況や各 frameworksの対応はどうになっているか

Page 28: Cluster schedulerの紹介

MesosとPaaSの話Kubernetesについて

Run Kubernetes on Mesos:https://github.com/mesosphere/kubernetes-mesos

Run Kubernetes on Hadoop YARN:http://hortonworks.com/blog/docker-kubernetes-apache-hadoop-yarn/

Page 29: Cluster schedulerの紹介

ReferencesMesos paper:http://mesos.berkeley.edu/mesos_tech_report.pdf

Mesos presentations:http://mesos.apache.org/documentation/latest/mesos-presentations/

Omega paper:http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf

Page 30: Cluster schedulerの紹介

ご静聴、ありがとうございます