block reclamation

Block Reclamation

Why ‘Block Space Reclamation’ needed?

To ensure Thin storage environment stays Thin

[email protected] July, 2014

mailto:[email protected]

Background: Thin provisioning allows administrators to allocate logical capacity that is greater than a storage system‟s total physical capacity. It does so by using „on-demand‟ block allocation of data based on host writes versus allocating all of the blocks during the initial volume creation. As a result of this on-demand approach to allocating actual physical storage capacity, customers can realize significant economic benefits by over-provisioning or thin provisioning their storage. By and large, this is due to not having to commit considerable storage capacity up front (as with thick provisioning) to users or business groups that often consume only a fraction of the allocated physical capacity. Thin provisioning broke the direct relationship between storage purchasing and provisioning, which otherwise had in the past led to shockingly low levels of capacity utilisation but high levels of investment. So, how did it all begin: Traditional storage provisioning maintains a one-to-one map between internal disk drives and the capacity used by servers. In the world of block storage, a server would “see” a fixed-size drive, volume or LUN, and every bit of that capacity would exist on hard disk drives residing in the storage array. The 100GB C drive in a Windows server, for example, would access 100 GB of reserved RAID-protected capacity on a set of disk drives in a storage array. The simplest implementation of thin provisioning is straightforward - Storage capacity is aggregated into “pools” of same-sized pages, which are then allocated to servers on demand rather than on initial creation. In our example, the 100 GB C drive might contain only 10GB of files, and this space alone would be mapped to 10 GB of capacity in the array. As new files are written, the array would pull additional capacity from the free pool and assign it to that server. This type of “allocate-on-write” thin provisioning is fairly widespread today. Most midrange and enterprise storage arrays, and some smaller devices, include this capability either natively or as an added-cost option. All was good, until the problem became apparent - Such systems are only THIN for a time. Most file-systems use 'clear' space for new files to avoid fragmentation; deleted content is simply marked unused at the file-system layer rather than zeroed out or otherwise freed up at the storage layer. These systems will eventually end up occupying entire allocation of storage even without much additional new writes.

Root cause - is basically a lack of communication between applications and storage systems. File-systems aren't generally thin-aware, and no mechanism exists to report when capacity is no longer needed. The key to effective thin provisioning is discovering opportunities to reclaim unused capacity. That is, as soon as the data is deleted from the host file system, the application should automatically reclaim the freed storage space. Hence, even though Thin Provisioning worked wonders at the beginning, but as files got deleted/moved on the host filesystem, it eventually started filling up the storage LUN/Volume on the storage array and basically defeated the very purpose of „thin provisioning‟. So, what did it lacked? Thin provisioning lacked space reclamation mechanism. It didn‟t how to reclaim the dead space on the storage array. Let’s examine what exactly we mean by ‘dead space’? In order to fully appreciate dead space reclamation, one must examine the host front-end and the storage back-end. Once a host writes to a thin provisioned volume, physical capacity is allocated to the host file system, sounds good so far! But… Unfortunately, if the host deletes the file, only the host file system frees up that space.

As seen in the illustration above, the physical capacity of the storage system remains unchanged. In other words, the storage system does not free up the capacity for the deleted host file which is commonly referred to as “dead space” or „hole punching‟. NetApp for long made use of SnapDrive plug-in: Basically, storage managed by SnapDrive on the Server/Host system logically appears to come from a locally attached storage subsystem. In reality, the capacity comes from the NetApp system. One advantage of this is that it allows NetApp to use interfaces to the Windows API (specifically by becoming part of the device driver layer and using the IOCTL functions) to watch for file system changes on the host, and inform the NetApp system of these changes via new and additional SCSI commands.

To address this thin provisioning limitation, the SCSI T10 Technical Committee established the T10 SCSI Block Command 3 (SBC3) specification which defines the “UNMAP” command for a diverse spectrum of storage devices including hard disk drives (HDDs) and numerous other storage media. Using SCSI UNMAP, IT administrators can now reclaim host file system space and back-end storage dead space. However, not only does SCSI UNMAP require T10 SBC3 compliant SCSI hardware, it also requires necessary software application programming interfaces (APIs) such as those now included in Windows Server 2012 or

Windows 8. That being said, previous Windows OS releases do not support the necessary APIs. Unmapping - In simple words, is basically de-allocating relationship between and LBA and a physical block in a logical unit. This is also known as – HOLE PUNCHING on filesystem side. What is Hole punching? Hole punching in file-systems is to mark a portion of a file as being unneeded and the associated storage to that file portion can then be freed. Background: Block reclamation was always a SAN Hardware feature as it was easier to get control over entire stack [OS/Kernel/filesystem/Driver/Storage]. Whereas, when the LUN is attached to the Host System, Host takes over the LUN and formats it with either open source or a proprietary filesystem such as NTFS. For something like NTFS, where the data structures are proprietary and not officially documented, a Storage vendor generally provides their own plug-in to take advantage of 'block reclamation', but not any longer. Good News is that – Number of popular OS/Filesystem application vendors are now [working towards] providing this feature „Natively‟ in their new product releases by adapting to "T10 SBC3 specification”. For more information on T10 please see the last page.

OS & Storage vendors adapting to T10 Standards to reclaim space

Microsoft Adapts T10 Standard with Windows 2012/Windows8 NTFS Filesystem : As files are added to an NTFS volume, more entries are added to the MFT and so the MFT increases in size. When files are deleted from an NTFS volume, their MFT entries are marked as free and may be reused, but the MFT does not shrink. Thus, space used by these entries is not reclaimed from the disk. Microsoft solved this problem by adapting to T10 Standard with WIN2012/WIN8 By default, Windows 8 and 2012 enable real-time space reclamation using SCSI UNMAP. That means, it does not require any third party API's to reclaim the dead space. So, what's new with Windows 2012 that allows space reclaim? A new API implementation known as an IOCTL DSM allocation which retrieves the logical block address (LBA) status of thin provisioned LUNs. All logical blocks are grouped into slabs or clusters which are classified into mapped, de-allocated or anchored states which Windows considers unmapped states. This is transparent to users and ensures the Windows thin provisioning framework, which includes space reclamation, performs as intended. For further information about the Windows 2012 Thin Provisioning features, reference the following link: http://msdn.microsoft.com/en-us/library/windows/hardware/hh770514.aspx

http://msdn.microsoft.com/en-us/library/windows/hardware/hh770514.aspx

Caution: As previously described, anytime a large file is deleted, multi-level space reclamation occurs. However, this may impact performance depending on how often users or applications delete large files. Proper planning should help to alleviate any real-time space reclamation performance impacts and can be accomplished establishing performance baselines. If Windows space reclamation planning identifies a high probability of performance impact, consider the following option: Real-time space reclamation can be disabled for large file deletions in the following Windows registry. 1. HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem 2. Set the DisableDeleteNotification value to 1 Note: This Windows registry setting affects all LUNs on that particular Windows Host. For further information, visit the following: Plan and Deploy Thin Provisioning http://technet.microsoft.com/en-us/library/jj674351.aspx

http://technet.microsoft.com/en-us/library/jj674351.aspx

Symantec Storage Foundation provides middleware functionality for its software-based dynamic disks which is beyond typical Windows NTFS volume definitions. This differentiation carries over to its alternate space reclamation approach. Rather than using SCSI UNMAP commands, Symantec Storage Foundation employs SCSI WRITE SAME commands to achieve the same end result. Thin Reclamation using Thin Reclamation API and Thin Provisioning Reclamation Add-on: http://www.symantec.com/business/support/index?page=content&id=HOWTO78517 http://public.dhe.ibm.com/common/ssi/ecm/en/tsw03164usen/TSW03164USEN.PDF

http://www.symantec.com/business/support/index?page=content&id=HOWTO78517

http://www.symantec.com/business/support/index?page=content&id=HOWTO78517

http://public.dhe.ibm.com/common/ssi/ecm/en/tsw03164usen/TSW03164USEN.PDF

http://public.dhe.ibm.com/common/ssi/ecm/en/tsw03164usen/TSW03164USEN.PDF

Red Hat Enterprise Linux 6 introduced the SCSI UNMAP command to the ext4 file systems to support releasing space on SAN platforms that also implemented the UNMAP command. Linux kernel uses discard requests to inform the storage that a given range of blocks is no longer in use. How to Use ‘Discard’ option: Create a new ext4 file system and mount it using the new „discard‟ option. This is the piece that tells Red Hat to send the SCSI UNMAP command to Storage Centre when it is done with blocks of storage. [root@redhat ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.0 (Santiago) [root@redhat ~]# mkfs.ext4 –L DemoVol /dev/sdb [root@redhat ~]# mount -o discard LABEL=DemoVol /files/ Fore more details, please see the DELL community Page. http://en.community.dell.com/techcenter/b/techcenter/archive/2011/06/29/native-free-space-recovery-in-red-hat-linux.aspx

http://en.community.dell.com/techcenter/b/techcenter/archive/2011/06/29/native-free-space-recovery-in-red-hat-linux.aspx

http://en.community.dell.com/techcenter/b/techcenter/archive/2011/06/29/native-free-space-recovery-in-red-hat-linux.aspx

VMware began supporting SCSI UNMAP commands when it introduced the VMware vSphere 5.0 storage APIs for Array Integration (VAAI) primitives. However, VMware discovered issues which affected their Storage vMotion and VM Snapshot consolidation that led them to alter their SCSI UNMAP support for its newest release – vSphere 5.1. Specifically, vSphere 5.1 does not provide proactive or automatic space reclamation for SCSI UNMAP commands. Manual user intervention or scripts must be implemented to realize the SCSI UNMAP benefits preferably outside of peak business hours. SUMMARY

Note: With T10 SBC3 adaption by both OS & Storage vendors, propriety API‟s will not be required.

Demonstration to test the block reclamation theory Using NetApp SnapDrive API

Scenario: [All Thick] Thick Volume = 4GB Thick LUN = 2GB Windows 2008 R2 NTFS Filesystem = 1.97GB [After formatting] This is how VOLUME usage looks like in the Oncommand System Manager.

We copied 1.33GB of data on this LUN.

This is how LUN space usage stands after copying 1.33GB to the LUN.

We created a snapshot

Now, we intend to delete 655MB from the NTFS file system on host OS.

Shift + delete -> New folder (5) of 655MB size.

We checked the new available size on the volume (1.33 – 655 = 703 MB) and looks like NTFS has unmapped the blocks and gained space back.

However, when we checked the space usage on the LUN [storage array], looks like the blocks freed on the host hasn‟t reflected on the LUN side. It means the Storage Array [Filer] has no idea about the blocks that have been freed on the file system and it‟s still showing the old allocated space.

How NetApp uses SnapDrive API to tackle this problem How to reclaim the unused BLOCKS? NetApp SnapDrive: snapdrive runs space reclaimer scanner and informs the NetApp filer that these blocks should be freed from the storage sub-system.

Snapdrive predicts based on the initial scan that it can free up to 670MB worth of blocks back to the storage POOL. [You remember we deleted 655MB of data on the host NTFS file system]

Enable the check box if you wish to [it is quite self-explanatory] and then click ok to begin Space Reclamation.

Once the „space reclamation‟ process was finished, we checked the LUN space on the Filer. It looks like we have regained space on the LUN.

Bottom line – With THICK LUN, there is no benefit as far as „reclaiming‟ the dead space, b‟cos you can‟t really give that space to any other volume as it is fixed to that volume. However, it does improve „space reporting’ on the NetApp System Manager and/or other reporting tools. Hence, you would no longer see „LUN‟ 100 % usage on the reporting tool, in spite of having plenty of space on the HOST FILE SYSTEM. Unfortunately, as I said earlier the space we re-claimed above is just gone back to the „LUN‟ but not to the volume or aggregate. The only way we could have made use of the shared space was by giving it back to either:

Volume [By making LUN „THIN‟ & VOLUME „THICK‟] Usually, as a best practice we create a one-2-one mapping between LUN and VOLUME and hence, in this case, even if the space was given back to the volume, we cannot actually share the re-gained space with other LUNs. Unless, you have multiples LUNS sharing the same volume. Or,

Aggregate [By making both LUN & VOLUME „THIN‟]

VMware introduced Space Reclamation as part of VAAI (VMware Storage APIs for Array Integration) vSphere 5.0 introduced the VAAI Thin Provisioning Block Space Reclamation (UNMAP) Primitive. This feature was designed to efficiently reclaim deleted space to meet continuing storage needs. ESXi 5.x issues UNMAP commands for space reclamation during several operations. When is 'UNMAP' called? When you delete virtual machine files from a VMFS datastore, or migrate them through Storage vMotion, the datastore frees blocks of space and informs the storage array via UNMAP command, so that the blocks can be reclaimed. Soon problem was discovered with UNMAP: Poor system performance As a result of this, VMware recommended disabling UNMAP on ESXi 5.0 hosts with thin-provisioned LUNs. For this reason, the UNMAP operation has been disabled by default in ESXi500-201112001 (ESXi 5.0 Patch 02) and ESXi 5.0 Update 1. This is now a manual process. This means that tasks such as Storage Migration and Snapshot Consolidation do not automatically attempt UNMAP on the back end LUN. If you continue to use an unpatched ESXi 5.0 host, you must manually disable UNMAP on all hosts. For more information, see Disabling VAAI Thin Provisioning Block Space Reclamation (UNMAP) in ESXi 5.0 (2007427). ESXi 5.0 Update 1 includes an updated version of vmkfstools that provides an option (-y) to send the UNMAP command to the storage arrays, regardless of the ESXi host's global setting. This option also exists on earlier ESXi versions, but does not reclaim the space when run. Note: When you run vmkfstools --help, the -y option is not displayed in the help output. To avoid the use of UNMAP commands on Thin Provisioned LUNs:

1. Log in to your host using the Tech Support mode. For more information on using Tech Support mode, see Tech Support Mode in ESXi 4.1 and 5.0 (1017910).

2. From your ESXi 5.0 host, run this command:

esxcli system settings advanced set --int-value 0 --option /VMFS3/EnableBlockDelete

3. To verify this setting, run this command : esxcli system settings advanced list --option /VMFS3/EnableBlockDelete Path: /VMFS3/EnableBlockDelete Type: integer Int Value: 0 <<<<<<<<<< 0 means Disabled Default Int Value: 1 Min Value: 0 Max Value: 1 String Value: Default String Value: Valid Characters: Description: Enable VMFS block delete KB article: Disabling VAAI Thin Provisioning Block Space Reclamation http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2007427 KB article: Using vmkfstools to manually reclaim VMFS deleted block http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2014849 Note: To verify that you have a T10 storage array, consult the VMware Compatibility Guide. http://www.vmware.com/resources/compatibility/search.php List of VAAI capable storage arrays: http://v-reality.info/2010/10/list-of-vaai-capable-storage-arrays/

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2007427




http://www.vmware.com/resources/compatibility/search.php

http://v-reality.info/2010/10/list-of-vaai-capable-storage-arrays/

FAQ On THIN PROVISIONING NetApp

WHAT IS BASIC THIN PROVISIONING (GENERIC DEFINITION)? Answer: Thin provisioning provides the ability to allocate space from a pool of storage to a volume or LUN only when the data is written, rather than preallocating the space. This allows the storage to be purchased incrementally as it is needed, rather than purchasing large amounts of storage upfront based on guesses about storage requirements.

SO, WHAT IS THICK PROVISIONING? Answer: Thick provisioning is the traditional approach of fully preallocating all space to a volume or LUN on its creation, rather than waiting for data to be written to the volume or LUN.

WHAT ARE THE KEY BENEFITS OF USING NETAPP THIN PROVISIONING? Answer: NetApp thin provisioning can increase storage utilization while providing the flexibility to address the challenges in a dynamic IT environment. Since space is not taken from the storage pool until data is written to a volume or LUN, the unused space is available to any thin-provisioned volume or LUN using that common shared pool. For more details about NetApp thin provisioning, refer to TR-3563, “NetApp Thin Provisioning Increases Storage Utilization with on-Demand Allocation.” http://media.netapp.com/documents/tr-3563.pdf

CAN I GROW OR SHRINK THE SHARED STORAGE POOL (AGGREGATE)? Answer: The aggregate can be expanded, but cannot be reduced.

CAN I ALLOCATE MORE STORAGE TO VOLUMES AND LUNS THAN IS AVAILABLE IN THE AGGREGATE?

Answer: Yes, this is possible when volumes or LUNs use thin provisioning. This is known as overcommitment.

IS THERE AN ADVANTAGE TO THIN PROVISIONING A LUN WITHIN A VOLUME, BUT NOT THIN PROVISIONING THE VOLUME? WHEN WOULD IT BE USED?

Answer: Doing this is useful if it is desirable to have the LUNs use the volume as the shared pool of guaranteed space instead of the aggregate.

http://media.netapp.com/documents/tr-3563.pdf

CAN I USE THIN PROVISIONING WITH OTHER NETAPP STORAGE EFFICIENCY FEATURES?

Answer: Yes. As a matter of fact, using other NetApp storage efficiency features, such as deduplication and FlexClone®, can provide even greater storage utilization.

CAN THIN PROVISIONING BE DISABLED AT ANY TIME? Answer: Yes. It is possible to turn off NetApp thin provisioning at any time for volumes or LUNs.

WHERE WILL I SEE THE INCREASE IN STORAGE UTILIZATION WHEN I USE THIN PROVISIONING?

Answer: The easiest way to recognize the increase in storage utilization as a result of using NetApp thin provisioning is to measure storage utilization with the Operations Manager Storage Efficiency Dashboard and/or the storage efficiency section of My AutoSupport.

CAN I SET THRESHOLDS BASED ON THE FULLNESS OF THE AGGREGATE? Answer: Yes. In Operations Manager, use aggrFullThreshold and aggrNearlyFullThreshold.

CAN I SET THRESHOLDS BASED ON THE FULLNESS OF THE VOLUME? Answer: Yes. In Operations Manager, use volFullThreshold and volNearlyFullThreshold.

CAN I SET THRESHOLDS BASED ON THE LEVEL OF OVERCOMMITMENT OF THE AGGREGATE?

Answer: Yes. In Operations Manager, use aggrOvercommittedThreshold and aggrNearlyOvercommittedThreshold.

CAN I SET THRESHOLDS BASED ON THE LEVEL OF OVERCOMMITMENT OF THE VOLUME?

Answer: Yes. In Operations Manager, use volOvercommittedThreshold and volNearlyOvercommittedThreshold.

CAN I USE BOTH THIN AND THICK PROVISIONING TOGETHER? Answer: Yes. It is possible to have thin-provisioned and thick-provisioned volumes and LUNs within the same aggregate. Thin provisioning can be enabled at any time without any performance impact

Virtual machine disk provisioning methods (VMware) VMDKs can be provisioned using two different methods, namely thick provisioning and thin provisioning. Thick provisioning can be categorized into two methods: 1. Lazy zeroed thick 2. Eagerzeroed thick Before we define what these two are, it is important to understand what 'zeroing' is. Zeroing - Is a process of writing zeroes to the disk blocks corresponding to a VMDK, to make sure that the existing data in those blocks, if any, are not exposed via the new VMDK.

Eager Zeroes Thick provisioning: An eager zeroed thick disk, when created, will get all of the space allocation it needs, and all of the disk blocks allocated to it, are zeroes out at the time of creation. Therefore, it takes longer as compared to lazy zeroed or thin-provisioned disk.

However, it offers better first write performance, this is due to the fact that the disk blocks corresponding to an eager zeroes disk are already zeroed out during its creation.

Lazy Zeroed thick provisioning: A lazy zeroed thick disk will also get all of the space allocation it needs at the time of creation, but unlike eager zeroed disk, it DOES NOT write zeroes to all of the disk blocks. Each disk block is zeroes out only during the first write. Although, it doesn‟t offer the first write performance like eager zeroed disk, all of the subsequent writes to the zeroed blocks will have the same performance.

Thin-provisioning disk: This type of disk will not use all of the disk space assigned to it during creation. It will only consume the disk space needed by the data on the disk. For example - If you create a think VMDK of 100GB, it will not use 100GB of space at the back-end. If a 100MB file is added to the VMDK, then the VMDK will grow by 100MB only.

T10 Technical Committee For more information on SCSI T10 SBC3 UNMAP 1. Go to T10 website: www.t10.org/ 2. Click – Search docs as shown in the figure below.

3. In the search box – enter „unmap‟ and click „search‟ or press return key.

http://www.t10.org/

4. Search should return quite a few documents that you can refer to for more information on „UNMAP‟ command and how it works internally.

Informational articles: http://www.13thmonkey.org/documentation/SCSI/spc3r23.pdf http://www.snia.org/sites/default/files2/SDC2011/presentations/monday/FrederickKnight_File_Systems_Thin_Provisioning.pdf https://communities.netapp.com/community/netappblogs/efficiency/blog/2010/08/04/punching-holes Thin Provisioning: [Must Read] http://msdn.microsoft.com/enus/library/windows/hardware/dn265487(v=vs.85).aspx Note: For any current working draft, you will need to be a member of T10.

PS: This document is my own small effort to shed light on thin provisioning & block reclamation and therefore there may be some information which is incorrect and I hope the reader will point to it. Thanks! Courtesy: T10 org, Symantec, IBM, Redhat, VMware, Dell, Microsoft & NetApp

[email protected] July, 2014

http://www.13thmonkey.org/documentation/SCSI/spc3r23.pdf

http://www.snia.org/sites/default/files2/SDC2011/presentations/monday/FrederickKnight_File_Systems_Thin_Provisioning.pdf

http://www.snia.org/sites/default/files2/SDC2011/presentations/monday/FrederickKnight_File_Systems_Thin_Provisioning.pdf

https://communities.netapp.com/community/netappblogs/efficiency/blog/2010/08/04/punching-holes

https://communities.netapp.com/community/netappblogs/efficiency/blog/2010/08/04/punching-holes

http://msdn.microsoft.com/enus/library/windows/hardware/dn265487(v=vs.85).aspx

http://msdn.microsoft.com/enus/library/windows/hardware/dn265487(v=vs.85).aspx

mailto:[email protected]

block reclamation

Technology

considerable storage

storage layer

storage media

storage lunvolume

storage environment

freed storage space

end storage dead space

gb of capacity