osdc 2015: bernd mathiske | why the datacenter needs an operating system
TRANSCRIPT
Dr. Bernd MathiskeSenior Software Architect
Mesosphere
Why the Datacenter needs an Operating System
1
Bringing Google-Scale
Computing to Everybody
A Slice of Google Tech Transfer History
2005: MapReduce -> Hadoop (Yahoo)
2007: Linux cgroups for lightweight isolation (Google)
2009: BigTable -> MongoDB
2009: “The Datacenter as a Computer” - Barroso, Hölzle (Google)2009: Mesos - a distributed operating system kernel (UC Berkeley)
2010: Large scale production Mesos deployment (Twitter)
since 2010: Many more frameworks and quite a few meta-frameworks
Notable Operating System Developments
Single-something => multi-something: user, tasking, threading, core, …
More: bits, memory, storage, bandwidth…
OS virtualization => lightweight virtualization (cgroups, LXCs, jails, …)
Packaging => containers (docker, rkt, lmctfy, …)
Static libraries => dynamic libraries => static libraries
4
Cluster Operating Systems (Hardware Clustering)Researched since the 1980s
Trying to provide (the illusion of) a single system image
Aiming at HA, load balancing, location transparency (e.g. for storage)
Many systems: Amoeba, ChorusOS, GLUnix, Hurricane, MOSIX, Plan9, RHCS, Spring, Sprite, Sumo, QNX, Solaris MC, UnixWare, VAXclusters, …
Relatively low scale (up to 100s of nodes)
Complicated to manage, less dynamic than software clustering
5
From HPC Grid to Enterprise Cloud
Condor, LSF, Maui, Moab, Quartz, SLURM, …
Typically for batch jobs
Also cover services => SOA => more job schedulers
=> grid computing => grid middleware … => cloud stacks
6
From Server Virtualization to App Aggregation
Cloud Era:Big apps, small servers
Client-Server Era:Small apps, big servers
Server
Virtualization
App App App AppApp
Aggregation
Serv Serv Serv Serv
Cloud Computing
SaaS: Salesforce demonstrated success, then many followed
PaaS: Deis, Dotcloud, OpenShift, Heroku, Pivotal, Stackato, …
IaaS: AWS, Azure, DigitalOcean, GCE…
Private cloud stacks including IaaS: Eucalyptus, CloudStack, Joyent, OpenStack, SmartCloud, vSphere, …
8
Datacenter
✴ A facility used to house computer systems and associated components (e.g. networking, storage, cooling, sensors)
✴ In this talk we focus on how to manage and use a single production cluster of networked computers in a datacenter
✴ Such clusters range in size from 10s to 10000s of nodes
✴ Why should we and how can we end up with just one production cluster?
9
Datacenter Services
✴ LAMP (Linux, Apache, MSQL, PHP) or similar
✴ MEAN (MongoDB, Express.js, Angular.js, Node.js) or similar
✴ Cassandra, ElasticSearch, Exelixi, Hadoop, Hypertable, Jenkins, Kafka, MPI, Spark, Storm, SSSP, Torque, …
✴ Private PaaS: Deis, …
✴ …
10
Operate your Laptop like your Datacenter?
From Static Partitioning to Elastic Sharing
Static Partitioning
Elastic Sharing
WEB HADOOPCACHE
WASTED
FREEFREEHADOOP
WEB
CACHE
WASTED WASTED100% —
100% —
Software Clustering
Layer between node OS and application frameworks
Scale
Multi-tenancy
High availability
Available Open Source Components
✴ 2-level scheduler: Apache Mesos
✴ Meta-frameworks / schedulers: Aurora, Chronos, Marathon, Kubernetes, Swarm, …
✴ Service discovery: Consul, HAProxy, Mesos DNS, …
✴ Highly available configuration: zk, etcd, …
✴ Storage: HDFS, Ceph, …
✴ Node OSs: lots of Linux variants
✴ Lots of app frameworks: Sparc, Storm, Cassandra, Kafka, …14
2-Level Scheduling
Scale: from 1 node to at least 10000s of nodes
Optimizing resource management
End-to-end principle: “application-specific functions ought to reside in the end nodes of a network rather than intermediary nodes”
-> Requirement for general multi-tenancy
-> Requirement for having only one production cluster
15
App
How Mesos Works
�16
Framework
Scheduler Master Slave
Master
Master
Master
Executor
Executor
Task
Task
Task
Task
zk/etcd
Ways to Run an Application
1. Vanilla job
• Employ meta-framework for invocation: Chronos, Aurora, Kubernetes, …
2. Application of an adapted framework
• Hadoop, Sparc, Storm, ElasticSearch, Cassandra, Kafka, many more…
3. Non-adapted services
• Employ meta-framework for invocation: Marathon, Aurora, Kubernetes, …
• Provide (select) a service discovery solution
4. Program your own scheduler (and executor)17
The Mesos Framework API
✴ Currently like internal Mesos communication:
• protobuf messages over HTTP
✴ Soon:
• JSON messages over HTTP (stream)
=> no need to link with binary Mesos library and/or less to reimplement
ca. a dozen programming languages => any language
18
How to implement a framework
✴ Scheduler interface: 1 half of 2-level scheduling
• The framework knows best when to do what with what kind of resources
• About a dozen callbacks, main functionality in 2 of them:- receive resource offers
- receive task status updates
✴ Executor interface: task life-cycle management and monitoring
• Command line executor included in Mesos
• Docker executor included in Mesos
• Custom executors often not needed19
Scheduler SPI (implemented by Framework)
20
public interface Scheduler {
void registered(SchedulerDriver driver, FrameworkID frameworkId, MasterInfo masterInfo);
void reregistered(SchedulerDriver driver, MasterInfo masterInfo);
void resourceOffers(SchedulerDriver driver, List<Offer> offers);
void offerRescinded(SchedulerDriver driver, OfferID offerId);
void statusUpdate(SchedulerDriver driver, TaskStatus status);
void frameworkMessage(SchedulerDriver driver, ExecutorID executorId, SlaveID slaveId, byte[] data);
void disconnected(SchedulerDriver driver);
void slaveLost(SchedulerDriver driver, SlaveID slaveId);
void executorLost(SchedulerDriver driver, ExecutorID executorId, SlaveID slaveId, int status); void error(SchedulerDriver driver, String message);}
Minimal Scheduler Implementationclass MyFrameworkScheduler implements Scheduler { …
private TaskGenerator _taskGen;
public void resourceOffers(SchedulerDriver driver, List<Offer> offers) { if (_taskGen.doneCreatingTasks()) { for (offer : offers) { driver.declineOffer(offer.getId()); } } else { for (offer : offers) {
List<TaskInfo> taskInfos = _taskGen.generateTaskInfos(offer); driver.launchTasks(offer.getId(), taskInfos, _filters); } } }
public void statusUpdate(SchedulerDriver driver, TaskStatus status) { _taskGen.observeTaskStatusUpdate(taskStatus); if (_taskGen.done()) { driver.stop(); } } … }
21
The Developer’s Perspective
✴ Focus on application logic, not datacenter structure
✴Avoid networking-related code
✴Reuse of built-in fault-tolerance and high availability
✴Reuse distributed (infrastructure) frameworks (e.g., storage)
=> API, SDK for datacenter services
22
The Operations Engineer’s Perspective
✴ Ease of deployment/management
✴ Uniformity of deployment/management
✴ Hardware utilization rate
✴ Scaling up as business grows
✴ Scaling out sporadically
✴ Cost and time for moving to a different datacenter
✴ High availability and fault-tolerance of system services
✴ Monitoring
✴ Trouble shooting
23
Necessary Multi-Tenancy Features
Task containerization
Resource isolation
Resource and task attributes
Static and dynamic resource reservations
Reservation levels
Meta-frameworks
Dynamic scheduler update and reconfiguration
Security24
Desirable Multi-Tenancy Features
Optimistic offers
Oversubscription
Task preemption, migration, resizing, reconfiguration
Rate limiting
Auto-scaling => hybrid cloud
Infrastructure frameworks
25
Using Docker Containers in Mesos
26
Mesos Master Server
init | + mesos-master | + marathon |
Mesos Slave Server
init | + docker | | | + lxc | | | + (user task, under container init system) | | | + mesos-slave | | | + /var/lib/mesos/executors/docker | | | | | + docker run … | | |
DockerRegistry
When a user requests a container…
Mesos, LXC, and Docker are tied together for launch
21
3
4
5
6
7
8
Other Schedulers as Meta-Frameworks in a 2-level Scheduler
YARN => https://github.com/mesos/myriad
Kubernetes => https://github.com/mesosphere/kubernetes-mesos
Swarm => Swarm on Mesos (new project)
=> run everything in one cluster
27
Myriad : Virtual YARN Clusters on Mesos
28
◦ POST /api/clusters: Registers a new YARN ◦ GET /api/clusters: Lists all registered clusters ◦ GET /api/clusters/{clusterId}: Lists the cluster with {clusterId} ◦ PUT /api/clusters/{clusterId}/flexup: Expands the size of cluster with {clusterId} ◦ PUT /api/clusters/{clusterId}/flexdown: Shrinks the size of cluster with {clusterId} ◦ DELETE /api/clusters/{clusterId}: Unregisters YARN cluster with {clusterId}. Also, kills all the nodes.
Node
Master
Mesos
Slave
Mesos
YARN
Myriad Scheduler RM
Myriad Executor
1. Launch NodeManager
1
1
1
2.5 CPU 2.5 GB
1
NM
YARN
flexU
p
2.0 CPU 2.0 GB
C1
C2
29
Kubernetes in Mesos
Portability
30
Mesos
Public Cloud Managed Cloud Your Own DC
Framework Apps
Meta-Frameworks
Vanilla Apps
Infrastructure Frameworks
The Application User’s Perspective
✴ Focus on apps, services, parameters, results
✴ Avoid dealing with datacenter operations/management
✴ Avoid adjusting system settings
✴ High availability
✴ Throughput
✴ Responsiveness
✴ Predictiveness
✴ Run everything I need
✴ Return on and safety of investment31
The Datacenter is the new form factor
✴ 2-level scheduler => single production cluster
✴ scalability and portability => avoiding hardware/cloud lock-in
✴ built-in container support => running containers at scale
✴ automation => operator efficiency
✴ repositories => apps/services readily available
✴ API and SDK => productive/quick app/service development
32
33
Above the Clouds
with Open Source!