dude, this isn't where i parked my instance?

36
DUDE, THIS ISN’T WHERE I PARKED MY INSTANCE? Moving instances around your OpenStack cloud for fun and profit. Stephen Gordon (@xsgordon) Sr. Technical Product Manager, Red Hat October 29th, 2015

Upload: stephen-gordon

Post on 14-Apr-2017

847 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Dude, This Isn't Where I Parked My Instance?

DUDE, THIS ISN’T WHERE I PARKED MY INSTANCE?

Moving instances around your OpenStack cloud for fun and profit.

Stephen Gordon (@xsgordon)Sr. Technical Product Manager, Red Hat

October 29th, 2015

Page 2: Dude, This Isn't Where I Parked My Instance?

2

● What are we moving? *● Why are we moving instances?● How are we moving instances?● What new enhancements do we get in:

○ Liberty?○ Mitaka?

* #spoileralert: instances

AGENDA

Page 3: Dude, This Isn't Where I Parked My Instance?

WHAT ARE WE MOVING?

Page 4: Dude, This Isn't Where I Parked My Instance?

4

GUEST CONFIGURATION

● Guest configuration including vCPUs, memory, devices etc.

GUESTSTORAGE

● Initial image or volume.

WHAT ARE WE MOVING?What is an instance (“server”)?

All paths for moving instances involve moving some subset of these elements.

GUESTSTATE

● In-memory state.● On-disk state.

Page 5: Dude, This Isn't Where I Parked My Instance?

WHY ARE WE MOVING INSTANCES?

Page 6: Dude, This Isn't Where I Parked My Instance?

6

WHEN PERFORMING NODE MAINTENANCE

● Adding hardware● Updating software● Response to imminent

failure

IN REACTION TO NODE FAILURE

● Host lost power● Host lost connectivity● Host otherwise went

down (e.g. DC fire)

FOR CAPACITY MANAGEMENT

● Consolidate or spread instances to save power or avoid resource contention issues respectively.

WHY ARE WE MOVING INSTANCES?Moving instances is an operational tool for use...

Page 7: Dude, This Isn't Where I Parked My Instance?

HOW ARE WE MOVING INSTANCES?

Page 8: Dude, This Isn't Where I Parked My Instance?

8

$ nova help | grep -E '(migrat|evacuat)'

evacuate Evacuate server from failed host.

live-migration Migrate running server to a new machine.

migrate Migrate a server. The new host will be..

migration-list Print a list of migrations.

host-servers-migrate Migrate all instances of the specified host to...

host-evacuate Evacuate all instances from failed host.

host-evacuate-live Live migrate all instances of the specified host to...

MECHANISMS FOR MOVING INSTANCESLet me google that for you!

Page 9: Dude, This Isn't Where I Parked My Instance?

9

$ nova help | grep -E '(migrat|evacuat)'

evacuate Evacuate server from failed host.

live-migration Migrate running server to a new machine.

migrate Migrate a server. The new host will be..

migration-list Print a list of migrations.

host-servers-migrate Migrate all instances of the specified host to...

host-evacuate Evacuate all instances from failed host.

host-evacuate-live Live migrate all instances of the specified host to...

MECHANISMS FOR MOVING INSTANCESLet me google that for you!

Page 10: Dude, This Isn't Where I Parked My Instance?

10

EVACUATE

Rebuild an instance that is currently on a compute node

that is down on a different compute node.

MIGRATE

Rebuild* an instance that is currently on a compute node

that is up on a different compute node**.

LIVE-MIGRATION

Move an instance to a different compute node

without downtime.

MECHANISMS FOR MOVING INSTANCES

* By rebuild we really mean resize.

** Where this behavior will change if you turn on resizing to the same host (off by default)

Page 11: Dude, This Isn't Where I Parked My Instance?

11

HOST-EVACUATE

Rebuild all instances that are currently on a compute node

that is down on another compute node.

HOST-SERVERS-MIGRATE

Rebuild* all instances that are currently on a compute node

that is up on another compute node**.

HOST-EVACUATE-LIVE

Move all instances on a compute node to another

compute node without downtime.

HELPERS FOR MOVING INSTANCES

* By rebuild we really mean resize.

** Where this behavior will change if you turn on resizing to the same host (off by default)

Page 12: Dude, This Isn't Where I Parked My Instance?

EVACUATION

Page 13: Dude, This Isn't Where I Parked My Instance?

13

● Works when compute node hosting instance fails due to a hardware failure or other issue.

● Rebuilds instance on a new compute node either selected by the scheduler or optionally the user initiating the evacuation.○ Benefit over and above starting afresh is keeping same UUID, IP etc.

● Requires that Nova recognizes the source compute node is down.● Requires shared storage to maintain user data on disk (not mandatory).● Allows injecting a new admin password (if shared storage is not being used).

EVACUATION nova evacuate [--password <password>] [--on-shared-storage] <server> [<host>]

Page 14: Dude, This Isn't Where I Parked My Instance?

14

$ nova evacuate instance-001

+-----------+--------------+

| Property | Value |

+-----------+--------------+

| adminPass | pjaDV46p94Nz |

+-----------+--------------+

$

EVACUATION nova evacuate [--password <password>] [--on-shared-storage] <server> [<host>]

Page 15: Dude, This Isn't Where I Parked My Instance?

COLD MIGRATION

Page 16: Dude, This Isn't Where I Parked My Instance?

16

● Works when compute node hosting instance is up (at least to start with…).● Rebuilds instance on a new host selected by the scheduler.

○ Actually uses the resize path in the code base.○ Shuts down instance.○ Copies disk to the new compute node.○ Starts the instance there and removes it from the source hypervisor.

● Instance’s current host must be operational.● Like resize requires a manual confirmation step.● Unlike evacuation and live migration doesn’t allow specification of target host to

override scheduler.

COLD MIGRATIONnova migrate [--poll] <server>

Page 17: Dude, This Isn't Where I Parked My Instance?

17

$ nova migrate instance-001 --poll

Server migrating... 100% complete

Finished

$ nova list

+--------------+--------------+---------------+------------+-------------+ ...

| ID | Name | Status | Task State | Power State | ...

+--------------+--------------+---------------+------------+-------------+ ...

| 5819a2e0-... | instance-001 | VERIFY_RESIZE | - | Running | ...

+--------------+--------------+---------------+------------+-------------+ ...

$ nova resize-confirm instance-001

COLD MIGRATIONnova migrate [--poll] <server>

Page 18: Dude, This Isn't Where I Parked My Instance?

LIVE MIGRATION

Page 19: Dude, This Isn't Where I Parked My Instance?

19

● Moves powered on virtual machine to a new compute node without any (noticeable) downtime.

● Two approaches to live migration:○ Using shared storage (including volume-based).

■ Requires either /var/lib/nova/instances/ to be on shared storage (e.g. NFS, GlusterFS, Ceph, etc.)across all compute nodes in the migration domain; or

■ Volume-backed instances■ Still requires memory state transfer/sync

○ Using block migration.■ Direct transfer/sync of not just memory state but also disks from source

compute node to destination

LIVE MIGRATION$ nova live-migration [--block-migrate] [--disk-over-commit] <server> [<host>]

Page 20: Dude, This Isn't Where I Parked My Instance?

20

1. Scheduler selects destination host, unless user specified2. Check migration source and destination (disk, ram, cpu model, mapped volumes)3. Iterative pre-copy, copying memory pages from the active virtual machine on the source

to a new paused instance on the destination4. Source instance is paused while remaining memory pages and CPU state is copied.5. Destination instance is started, source is cleaned up

LIVE MIGRATION - HOW IT WORKS

Page 21: Dude, This Isn't Where I Parked My Instance?

21

● Maximum performance is obtained by exposing as many host CPU features to the guest as possible

● Live migration will fail if destination host is not able to expose the same CPU features to guests as the source host

● Performance versus Flexibility trade-off● Nova provides configuration keys, including libvirt_cpu_mode, for deployers to make

the performance versus flexibility trade-off for their environment○ host-passthrough○ host-model○ custom

LIVE MIGRATION - HOW IT DOESN’T WORKCPU mode/model compatibility

Page 22: Dude, This Isn't Where I Parked My Instance?

22

$ virsh cpu-models x86_64

...

SandyBridge

Westmere

Nehalem

...

$ grep ‘libvirt_cpu_mode’ /etc/nova/nova.conf

libvirt_cpu_mode = custom

libvirt_cpu_model = Sandybridge

LIVE MIGRATION - HOW IT DOESN’T WORKCPU mode/model compatibility

Can also use qemu-kvm -cpu help

Page 23: Dude, This Isn't Where I Parked My Instance?

23

● Incompatible QEMU machine types● Inconsistent networking configuration

○ Source hypervisor must be able to hit destination’s live_migration_uri and vice versa (live_migration_uri = qemu+tcp://%s/system)

● Inconsistent clocks○ Synchronize clocks using ntp or chronyd

● Incompatible VNC listening addresses● Incompatible or no SSH tunnelling configuration

LIVE MIGRATION OTHER WAYS TO FAIL

Page 24: Dude, This Isn't Where I Parked My Instance?

24

● Migrations take too long or fail to complete.● Many common user operations are not supported during migration (e.g. pause).● Need to use virsh, bypassing Nova, to:

○ Control a running migration (e.g. throttle or cancel)○ Monitor a running migration○ Tune migration max downtime

● Certain instance configurations can not be migrated.○ Use a config drive (e.g. config_drive_format=iso9960) or mix local/remote

storage○ Use passed through devices associated with them (SR-IOV, GPU, etc.)

● Live migration doesn’t correctly account for overcommit when checking destination host validity.

● Tenant admin initiating needs to know if shared or block storage available.

LIVE MIGRATION - OTHER OPERATOR ISSUES

Page 25: Dude, This Isn't Where I Parked My Instance?

LIBERTY

Page 26: Dude, This Isn't Where I Parked My Instance?

26

● Primary factors in determining how long it will take to migrate a guest:○ Amount of guest RAM○ Speed with which guest RAM is being dirtied○ Speed of the migration network

● Previously live migrations in OpenStack ran with fixed maximum downtime as determined by QEMU.

● As of Liberty:○ The downtime allowable is scaled up exponentially (to a limit) to allow a better

chance for completion.○ The number of concurrent outbound live migrations is limited○ The number of concurrent inbound build requests is limited

● QEMU endeavors to estimate when the number of dirty pages is low enough to finalize

LONG RUNNING LIVE MIGRATIONSI’m gonna let you finish...but...

Page 27: Dude, This Isn't Where I Parked My Instance?

27

● Scaling downtime to finalize migration:○ live_migration_downtime - Maximum permitted guest downtime for switchover (minimum

100ms)○ live_migration_downtime_steps - Number of incremental steps to reach max downtime

value (minimum 3)○ live_migration_downtime_delay - Time to wait, in seconds, between each step in increase

of max downtime (minimum 10s)● Timeouts:

○ live_migration_completion_timeout - Time to wait (in seconds) for migration to complete (default 800 seconds, 0 means no timeout) - is scaled by GB of guest RAM

○ live_migration_progress_timeout - Time to wait (in seconds) for migration to make forward progress (default 150 seconds).

LONG RUNNING LIVE MIGRATIONSNew configuration keys to control this behavior...

Page 28: Dude, This Isn't Where I Parked My Instance?

28

● Concurrent operations:○ max_concurrent_live_migrations - Maximum outbound live migrations to run concurrently,

defaults to 1. Do not change unless absolutely sure.○ max_concurrent_builds - Maximum inbound instance builds to run concurrently, defaults to

10.

LONG RUNNING LIVE MIGRATIONSNew configuration keys to control this behavior...

Page 29: Dude, This Isn't Where I Parked My Instance?

29

● Delay between steps is set to 30 * 3 (seconds of delay * GB of RAM).○ 0 seconds -> set downtime to 37ms○ 90 seconds -> set downtime to 38ms○ 180 seconds -> set downtime to 39ms○ 270 seconds -> set downtime to 42ms○ 360 seconds -> set downtime to 46ms○ 450 seconds -> set downtime to 55ms○ 540 seconds -> set downtime to 70ms○ 630 seconds -> set downtime to 98ms○ 720 seconds -> set downtime to 148ms○ 810 seconds -> set downtime to 238ms○ 900 seconds -> set downtime to 400ms

LONG RUNNING LIVE MIGRATIONS EXAMPLE400 millisecond max, 10 steps, 30 second delay, 3 GB guest

Page 30: Dude, This Isn't Where I Parked My Instance?

30

● Liberty provides a mechanism for external tools to report into Nova when a node has failed (“mark host down”/”force down” API call)

● As soon as host has been explicitly marked down evacuation can commence, triggered by the external tool.

● Used to provide “instance high availability” using e.g. Pacemaker.○ http://redhatstackblog.redhat.com/2015/09/24/highly-available-virtual-

machines-in-rhel-openstack-platform-7/

MARK HOST DOWN API CALL

Page 31: Dude, This Isn't Where I Parked My Instance?

MITAKA AND BEYOND

Page 32: Dude, This Isn't Where I Parked My Instance?

32

Short Term

● CI coverage● Improve API documentation● Support for migrating instances with mixed storage● Support for pausing (and perhaps cancelling) migrations● Better resource tracking● Use Libvirt storage pools instead SSH for migrate/resize.

○ Enabler for other work including migrating suspended instances.● Correct memory overcommit handling for live migration.

Mid to Long Term

● TLS encryption (work underway in QEMU)● Auto-convergence - adjusting instance activity to help complete migration● Post copy migration - start instance at destination and then copy memory over on demand

CURRENTLY UNDER DISCUSSION

Page 33: Dude, This Isn't Where I Parked My Instance?

Q & A

Page 34: Dude, This Isn't Where I Parked My Instance?

34

● Where can I find the slides?○ http://www.slideshare.net/sgordon2

● Where can I submit anonymised feedback?○ Session Feedback Survey in the official OpenStack Summit App

● Where can I contact you?○ Twitter: @xsgordon○ Email: [email protected]○ IRC: sgordon on irc.freenode.net

● How can I get involved?○ https://etherpad.openstack.org/p/mitaka-live-migration

FAQ

Page 35: Dude, This Isn't Where I Parked My Instance?

THANK YOU

plus.google.com/+RedHat

linkedin.com/company/red-hat

youtube.com/user/RedHatVideos

facebook.com/redhatinc

twitter.com/RedHatNews

@xsgordon - Stephen Gordon

Page 36: Dude, This Isn't Where I Parked My Instance?

36

● Outstanding work items:○ Etherpad: https://etherpad.openstack.org/p/mitaka-live-migration○ Bug list: https://docs.google.

com/spreadsheets/d/19MFatOpjePS4JtkVHXCh6Qa8XUf6T2t0Igy1PucZ3Zk/edit#gid=2127877307

● Past presentations:○ Live Migration at HP Public Cloud:

■ https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/live-migration-at-hp-public-cloud

○ Intel Dive into VM Live Migration:■ https://www.openstack.org/summit/vancouver-2015/summit-

videos/presentation/dive-into-vm-live-migration

RECOMMENDED READING, VIEWING, AND REFERENCES