20110702 vmfs intro

36
VMFS Introduction VMFS Introduction [email protected] [email protected]

Upload: linuxfb

Post on 12-May-2015

749 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: 20110702 vmfs intro

VMFS IntroductionVMFS Introduction

[email protected]@linuxfb.org

Page 2: 20110702 vmfs intro

AgendaAgenda

�� ESX IntroductionESX Introduction

�� VMFS Design GoalsVMFS Design Goals

�� VMFS ArchitectureVMFS Architecture

�� SAN ImpactSAN Impact

�� ConclusionConclusion

Page 3: 20110702 vmfs intro

ESX System SetupESX System Setup

Page 4: 20110702 vmfs intro

Guest Memory LayersGuest Memory Layers

Shadow page tables (VA-MA).

Page sharing (BA-MA).

Page 5: 20110702 vmfs intro

ESX IO StackESX IO Stack

Average IO requests just involves offset remapping.

Page 6: 20110702 vmfs intro

AgendaAgenda

�� ESX IntroductionESX Introduction

�� VMFS Design GoalsVMFS Design Goals

�� VMFS ArchitectureVMFS Architecture

�� SAN Influence and ImpactSAN Influence and Impact

�� ConclusionConclusion

Page 7: 20110702 vmfs intro

Use CaseUse Case

�� Small number of files (30~100 per VM)Small number of files (30~100 per VM)

�� Files either very small (~a few Files either very small (~a few KBsKBs), or very ), or very

large (many large (many GBsGBs))

�� SAN storage is the underlying substrate.SAN storage is the underlying substrate.

�� All storage exported by these storage systems All storage exported by these storage systems

is shared among all ESX serversis shared among all ESX servers

Page 8: 20110702 vmfs intro

Design GoalsDesign Goals

�� Metadata overhead should be very lowMetadata overhead should be very low

�� VM IO throughput and latency should be as VM IO throughput and latency should be as good as directly attached raw devicegood as directly attached raw device

�� A clustered lock manager for moderating A clustered lock manager for moderating access to files among ESX servers access to files among ESX servers

�� Help VM deterministically react to transient Help VM deterministically react to transient and nonand non--transient SAN events and error transient SAN events and error conditions.conditions.

Page 9: 20110702 vmfs intro

AgendaAgenda

�� ESX IntroductionESX Introduction

�� VMFS Design GoalsVMFS Design Goals

�� VMFS ArchitectureVMFS Architecture

�� SAN Influence and ImpactSAN Influence and Impact

�� ConclusionConclusion

Page 10: 20110702 vmfs intro

VMFS ArchitectureVMFS Architecture

�� A volume is an aggregation of resources and onA volume is an aggregation of resources and on--disk disk locks.locks.

�� A resource is either an inode, a file block, a subA resource is either an inode, a file block, a sub--block or an indirect block.block or an indirect block.

�� Each lock moderates access to a subset of resources. Each lock moderates access to a subset of resources. Hosts negotiate access to resource by acquiring Hosts negotiate access to resource by acquiring relevant locks.relevant locks.

�� VMFS = a clustered lock manager + a resource VMFS = a clustered lock manager + a resource manager + a journaling module + a data mover + a manager + a journaling module + a data mover + a VM IO manager + POSIX system call VM IO manager + POSIX system call frantendfrantend

Page 11: 20110702 vmfs intro

VMKernelVMKernel Logical VolumeLogical Volume

VMFS are by default created inside VMKernellogical volumes. VMKernel logical volumes can be spanned across multiple devices.

Page 12: 20110702 vmfs intro

VMFS on disk LayoutVMFS on disk Layout

Page 13: 20110702 vmfs intro

Four ResourcesFour Resources

�� file blocksfile blocks

�� subsub--blocks blocks

�� pointer blocks pointer blocks

�� file descriptorsfile descriptors

Resources are grouped together into collections called Resources are grouped together into collections called

CLUSTERsCLUSTERs and clusters are further grouped together and clusters are further grouped together

into CLUSTER GROUPS.into CLUSTER GROUPS.

Page 14: 20110702 vmfs intro

Block MappingBlock Mapping

�� Packed inside inodePacked inside inode

�� Sub block addressingSub block addressing

�� File block addressingFile block addressing

�� Pointer block addressingPointer block addressing

Can upgrade automatically.Can upgrade automatically.

Page 15: 20110702 vmfs intro

System FilesSystem Files

System files are created at file system format time, and each manages one type of resources.

Page 16: 20110702 vmfs intro

System FilesSystem Files

�� Use file blocks.Use file blocks.

�� Same read/write method as regular files. Same read/write method as regular files.

Checking file data consistency essentially Checking file data consistency essentially

provides metadata consistency.provides metadata consistency.

Page 17: 20110702 vmfs intro

Cluster GroupsCluster Groups

� Cluster groups are repeated to create a file system.

� An existing VMFS volume grows over unused space on the disk or spans new disks by laying out new cluster groups that refer to the newly added space.

� VMFS resource manager makes hosts operate on different and distant cluster groups within a system file. This reduces the possibility of mutiple hosts contending on the same lock(s) and increases the efficiency of the clustered lock manager.

Page 18: 20110702 vmfs intro

OnOn--disk Lockdisk Lock

�� A single sector data A single sector data

structure.structure.

�� Locking is based on lease.Locking is based on lease.

�� Atomic disk operations (SCSI Atomic disk operations (SCSI

reservereserve--readread--modifymodify--writewrite--

SCSI release)SCSI release)

Page 19: 20110702 vmfs intro

OnOn--disk Lock Data Structuredisk Lock Data Structure

�� HostIDHostID: This is a 128: This is a 128--bit unique identifier that identifies the ESX host that bit unique identifier that identifies the ESX host that owns the lock at a given point in time. All zeros means no ownerowns the lock at a given point in time. All zeros means no owner..

�� Mode: A set of nonMode: A set of non--zero values to indicate whether a lock is free, held zero values to indicate whether a lock is free, held exclusively, held by multiple hosts for shared read access, or hexclusively, held by multiple hosts for shared read access, or held by eld by multiple hosts for shared read and write access.multiple hosts for shared read and write access.

�� Generation: A monotonically increasing counter, updates every tiGeneration: A monotonically increasing counter, updates every time a lock me a lock is acquired, released or broken. While the is acquired, released or broken. While the hostIDhostID field sufficiently field sufficiently disambiguates operations on a lock from different hosts, this fidisambiguates operations on a lock from different hosts, this field eld disambiguates multiple operations on a lock by the same host.disambiguates multiple operations on a lock by the same host.

�� HBregionHBregion: For each valid : For each valid hostIDhostID (if any) currently using the lock, a pointer (if any) currently using the lock, a pointer to the on disk heartbeat region of the host.to the on disk heartbeat region of the host.

�� HBgenHBgen: A generation number to validate the : A generation number to validate the HBregionHBregion reference as being reference as being current or stale. It disambiguates locks held by a given host becurrent or stale. It disambiguates locks held by a given host before and fore and after a host crash and before and after a storage outage.after a host crash and before and after a storage outage.

Page 20: 20110702 vmfs intro

OnOn--disk Heartbeatdisk Heartbeat

�� A single sector data structureA single sector data structure

�� Every host accessing a VMSF volume acquires Every host accessing a VMSF volume acquires

a heartbeat on disk to declare a heartbeat on disk to declare livenessliveness to to

other hosts.other hosts.

�� Allocated from a 1MB reserved region of the Allocated from a 1MB reserved region of the

volume. 2048 concurrent hosts access.volume. 2048 concurrent hosts access.

Page 21: 20110702 vmfs intro

HB Failure HandlingHB Failure Handling

�� Hosts are free to break locks if heartbeatHosts are free to break locks if heartbeat’’s s

timestamp does not change for 20 second. Should timestamp does not change for 20 second. Should

replay journal when taking stale lock.replay journal when taking stale lock.

�� If failing to update heartbeat timestamp in five HB If failing to update heartbeat timestamp in five HB

period (about 15 sec and 40 HB IO tries), host will period (about 15 sec and 40 HB IO tries), host will

fence itself and abort all fence itself and abort all inflightinflight IOsIOs..

�� Lock manager tries to rejoin the cluster if IO error is Lock manager tries to rejoin the cluster if IO error is

not permanent, and reclaims HB slot.not permanent, and reclaims HB slot.

Page 22: 20110702 vmfs intro

OnOn--disk Lock & HBdisk Lock & HB

�� Each host can join a cluster by acquiring a onEach host can join a cluster by acquiring a on--

disk HB.disk HB.

�� It can also hold thousands of onIt can also hold thousands of on--disk locksdisk locks

Page 23: 20110702 vmfs intro

JournalingJournaling

�� Each host maintains its own journal on the Each host maintains its own journal on the

volume.volume.

�� HB region on disk stores journal location.HB region on disk stores journal location.

Page 24: 20110702 vmfs intro

Transaction State MachineTransaction State Machine

Page 25: 20110702 vmfs intro

Optimistic LockingOptimistic Locking

� All hosts in a VMFS cluster generally operate on mutually exclusive subsets of locks on the volume.

� A host that is interested in acquiring a given lock will typically find it to be free on disk.

� In stead of acquiring all locks, host first reads all locks, if they are free, modify in memory metadata and then upgrade locks and commit.

Page 26: 20110702 vmfs intro

Transaction State Machine w/ op lockTransaction State Machine w/ op lock

Page 27: 20110702 vmfs intro

Transaction State Machine w/ op lockTransaction State Machine w/ op lock

Upgrade LockUpgrade Lock

� 1: reserve disk;

� 2: issue asynchronous (async) reads of all required locks;

� 3: if any lock is acquired by remote host, abort and fall back to normal TSM;

� 4: issue async writes of all required locks;

� 5: wait for all async writes to complete;

� 6: release disk;

Page 28: 20110702 vmfs intro

AgendaAgenda

�� ESX IntroductionESX Introduction

�� VMFS Design GoalsVMFS Design Goals

�� VMFS ArchitectureVMFS Architecture

�� SAN Influence and ImpactSAN Influence and Impact

�� ConclusionConclusion

Page 29: 20110702 vmfs intro

Adaptive SANAdaptive SAN--aware retriesaware retries

� For some SAN errors, instead of letting guest OS retry IO, VMkernel retries the IO after an optimal time.

Page 30: 20110702 vmfs intro

Adaptive SANAdaptive SAN--aware retriesaware retries

Page 31: 20110702 vmfs intro

Data MoverData Mover

� clone(srcFileHandle, srcFileOffset, dstFileHandle, dstFileOffset, length, policies)

Page 32: 20110702 vmfs intro

Data MoverData Mover

Page 33: 20110702 vmfs intro

Directive SCSI CMDDirective SCSI CMD

� operator(VMID, source_blocklist, destination_blocklist)

� Zero, clone, delete

Page 34: 20110702 vmfs intro

Directive SCSI CMDDirective SCSI CMD

� atomic_test_and_set(block_number, old_image, new_image)

� For VMFS lock manager, new lock algorithm: reads a lock image from disk, and if the lock is free, issues an atomic_test_and_set with a new_imagecontaining host specific hostID, generation and heartbeat information.

� 4 IOs -> 2 IOs

Page 35: 20110702 vmfs intro

AgendaAgenda

�� ESX IntroductionESX Introduction

�� VMFS Design GoalsVMFS Design Goals

�� VMFS ArchitectureVMFS Architecture

�� SAN Influence and ImpactSAN Influence and Impact

�� ConclusionConclusion

Page 36: 20110702 vmfs intro

PerformancePerformance