how to run from a zombie: cloudstack distributed process management
DESCRIPTION
Exploration of CloudStack's distributed process management requirements and the challenges they present in the context of CAP theorem. These challenges will be addressed through a distributed process model that emphasizes efficiency, fault tolerance, and operational transparency.TRANSCRIPT
HOW TO RUN FROM A ZOMBIE: CLOUDSTACK DISTRIBUTED PROCESS
MANAGEMENT John Burwell
([email protected] | [email protected]@john_burwell)
Tuesday, June 25, 13
I Am Not A Zombie
• Apache CloudStack PMC Member
• Consulting Engineer @ Basho Technologies
• Ran operations and designed automated provisioning for hybrid analytic/virtualization clouds
• Led architectural design and server-side development of a SaaS physical security platform
Tuesday, June 25, 13
Current Process Management
• No consistent system-wide model
• Fail slowly, fail quietly
• Resource overcommitment issues
• Lack of instrumentation
Tuesday, June 25, 13
What is a cloud?
Tuesday, June 25, 13
Tuesday, June 25, 13
Hopefully not ...
Tuesday, June 25, 13
Tuesday, June 25, 13
Tuesday, June 25, 13
Tuesday, June 25, 13
Hosts
VirtualRouters
VirtualMachines
PrimaryStorage
NetworksSecondaryStorage
Load����������� ������������������ Balancers
Zone
Cluster Pod
Tuesday, June 25, 13
ResourceProcess State
A����������� ������������������ “thing”����������� ������������������ with����������� ������������������ a����������� ������������������ bounded����������� ������������������ capacity
PartitionOrchestration
Tuesday, June 25, 13
At it’s core, CloudStack ...
Integrates infrastructure components
Manages resources
Tuesday, June 25, 13
Tuesday, June 25, 13
Consistency
AvailabilityPartition����������� ������������������ Tolerance
PICK 2
Tuesday, June 25, 13
CloudStack provides zones, clusters, and pods to partition resources.
Tuesday, June 25, 13
Orchestration operations are eventually consistent
Tuesday, June 25, 13
Tuesday, June 25, 13
... but resource operations must be consistent & serialized.
Tuesday, June 25, 13
Tuesday, June 25, 13
A system can not be simultaneouslyconsistent and available.
Tuesday, June 25, 13
Orchestration����������� ������������������ ProcessesAP
CP Resource����������� ������������������ Management����������� ������������������ Processes
Tuesday, June 25, 13
CP Resource?
• Ordered/Serialized operations
• Prevent overcommitment
• Execution location independent
• Lock free
Tuesday, June 25, 13
Orchestration Coordination
1. Build a list of commands to be executed against a resource
2. Enqueue the list of commands to the resource management layer for execution
3. A process applies the commands to the resource
4. Aggregate the results from the reply
Tuesday, June 25, 13
ResourceProcess State
Queue
1
1
Unit����������� ������������������ of����������� ������������������ Work
1
1
ExclusiveConsumer
Tuesday, June 25, 13
Unit Of Work (UoW)
• Definition: A ordered list of commands executed against a one and only one resource.
• Created in the Orchestration layer
• Executed by processes in the resource management layer
• Failure of a command halts UoW execution
Tuesday, June 25, 13
Instrumentation
• Collect and report statistics on a per resource basis
• Inspect and remove pending UoWs for a resource
• Kill a running process
• View a history of UoWs completed by a resource
Tuesday, June 25, 13
• Process execution fails
• Resources become unavailable
• Slow consumers
When Gravity Fails
Tuesday, June 25, 13
Fail Fast; Fail Loudly
• If the resource can be returned to a consistent state, reply with the process failure
• If the resource can not be returned to a consistent state, change the transition the resource to a failure state, drain the queue of pending UoWs, and reply with the process failure for each UoW
• The orchestration layer will determine the appropriate recovery strategy (e.g. retry request on another resource)
Tuesday, June 25, 13
Preventing A Logjam
• Bounded Queues
• Request and Message Timeouts
• A failure to enqueue a request or a request timeout trigger a the resource’s circuit breaker
Tuesday, June 25, 13
How could we implement this model?
Tuesday, June 25, 13
Lightweight Threads
A thread that is not scheduled by theoperating system -- avoiding context
switch overhead.
Tuesday, June 25, 13
Actor Model
• An actor represents state and behavior
• Communicate by message passing
• Each actor is allocated a lightweight thread and mailbox
• Location independent
Tuesday, June 25, 13
Mailbox
ResourceActor
FSM
Orchestration
Unit����������� ������������������ of����������� ������������������ Work
Tuesday, June 25, 13
Java Actor Frameworks
• Akka (http://akka.io)
• Quasar (https://github.com/puniverse/quasar)
Tuesday, June 25, 13
Summary
• Orchestration and Resource Management must be properly divided to satisfy CAP
• To provide resource serialization guarantees, assign a queue and a process to each resource
• Fast fast, fail loudly
• An Actor Model based on lightweight threads may provide the scalability required to dedicate a queue and process per resource
Tuesday, June 25, 13
Thoughts? Questions?
Tuesday, June 25, 13
Thank you!
Slides available @ http://speakerdeck.com/jburwell
Tuesday, June 25, 13