flash architecture and effects

38
© Copyright 2010 EMC Corporation. All rights reserved. Flash Architecture and Effects Dave Zeryck

Upload: rajesh-nambiar

Post on 05-Dec-2014

1.582 views

Category:

Technology


3 download

DESCRIPTION

Flash Architecture and Effects

TRANSCRIPT

Page 1: Flash Architecture and Effects

© Copyright 2010 EMC Corporation. All rights reserved.

Flash Architecture and Effects

Dave Zeryck

Page 2: Flash Architecture and Effects

2© Copyright 2010 EMC Corporation. All rights reserved.

EMC makes no representation and undertakes no obligations with regard to product planning information, anticipated product characteristics, performance specifications, or anticipated release dates (collectively, “Roadmap Information”).

Roadmap Information is provided by EMC as an accommodation to the recipient solely for purposes of discussion and without intending to be bound thereby.

Roadmap information is EMC Restricted Confidential and is provided under the terms, conditions and restrictions defined in the EMC Non-Disclosure Agreement in place with your organization.

Roadmap Information Disclaimer

Page 3: Flash Architecture and Effects

3© Copyright 2010 EMC Corporation. All rights reserved.

Agenda

The “Rule of Thumb” and where it came from

Flash Architecture

Flash Write Details– A drama in 3 acts

Flash & SP Write Cache

Best Practices/Best Use

Page 4: Flash Architecture and Effects

4© Copyright 2010 EMC Corporation. All rights reserved.

The Rule Of Thumb

Flash Manufacturers say 30K IOPS/Drive, EMC says 2500, why?

– YES you can get 30,000 IOPS from one drive under special cases

Very small I/O size

Reads

– 2500 IOPS is the expected performance of certain drives under adverse conditions

Flash IOPS vary greatly with several factors

– Drive model

– Read/Write ratio (writes are slower to process)

– IO size (has a very large impact on IOPS)

– Concurrency (more requests get more IOPS)

Applications may have to be adjusted to get high IOPS from Flash drives

Un-cached IOPS from six spindles,

ORION (Oracle) Benchmark

Flash IOPS Under Varying Read/Write Ratios

31912

12853

5691

0

5000

10000

15000

20000

25000

30000

35000

100% READ 90% READ 50% READ

Page 5: Flash Architecture and Effects

5© Copyright 2010 EMC Corporation. All rights reserved.

The Rule Of Thumb

What‟s behind the Rule of Thumb and Flash behavior?– That‟s what we‟ll investigate in this presentation

Who is the Rule of Thumb aimed at?– The average user: not performance critical, modest SLA, does not want to „architect‟ storage

Who is this presentation for?– Those who want, or need, to know just how far you can push the technology

– Those who must know how to meet an SLA with the utmost in economy and precision

Understanding Flash drives can help you target their use better– Know your results before you run the workload

– Help set expectations with your users

– Know which drives will work best for high-priority applications

What we‟ll cover in this presentation– Why writes are so different than reads

– What differences affect performance between models

– How to get the most out of your Flash drives

Page 6: Flash Architecture and Effects

6© Copyright 2010 EMC Corporation. All rights reserved.

Flash Architecture: Anatomy

Flash drives are like little storage systems– Front-end ports (dual-ported)

– Controller logic

To determine location of LBA

Manage housekeeping

– Fast buffer storage

For writes, metadata

– Multiple paths to storage

Fibre

Channel

Logic

BufferNAND

(up to 16 chips) Up to 16

channels

Page 7: Flash Architecture and Effects

7© Copyright 2010 EMC Corporation. All rights reserved.

Flash Architecture: Anatomy

The fast memory buffer holds:– Index of all locations

– Incoming writes

Incoming writes buffered– Status returned immediately in most cases

– Incoming writes are gathered into blocks

– Blocks are written to the NAND asynchronously

Index: map of all

LBA locations

New Writes

Metadata

User data

Buffer Flash NAND ChipsSelf-identifying data

Metadata

copy

Page 8: Flash Architecture and Effects

8© Copyright 2010 EMC Corporation. All rights reserved.

Flash Architecture: Anatomy

Flash Resiliency– Power capacitors maintain power to the buffer in the event of a system power failure

Contents are written to the persistent store (flash) if power fails

– Index table is backed up to flash when powered down.

– On power up, the table is reloaded into buffer and a consistency check is run.

If the table is found to be inconsistent, the table is rebuilt. This is done by reading all of the flash metadata and reconstructing the data. All of the flash data is self identifying.

Buffer

Index: map of all

LBA locations

New Writes

Flash NAND ChipsSelf-identifying data

User data

On power fail, all data is secured to persistent Flash

Metadata

Page 9: Flash Architecture and Effects

9© Copyright 2010 EMC Corporation. All rights reserved.

Flash Architecture: Pages

Architecture of a Flash drive affects the operation– Cells are addressed by pages

73 GB and 200 GB drives use 4 KB pages

400 GB drives have 16 KB pages

Page contents are a contiguous address space, like SP cache pages– Like an SP Cache page, small IO will be held within 1 page

– Like a disk sector, the entire page must be valid before writing

The drive cannot write a partial page to the NAND chip

4 KB I/O in a 4

KB Flash page

Two 2 KB I/O in

a 4 KB Flash

page; must be

contiguous WRT

LBA

LBA 0x0400

LBA 0x1400

LBA 0x13FF

LBA 0x23FF

Page 10: Flash Architecture and Effects

10© Copyright 2010 EMC Corporation. All rights reserved.

Flash Architecture: Blocks

The NAND storage is mapped, like a filesystem

Pages are grouped together into blocks (example: 256 KB)– Not to be confused with:

SCSI “block” which is a sector on a HDD

Filesystem block/page

– Multiple pages in a block “jumbled” together

– The addresses of pages in a block do not have to be contiguous

Page: 4 KB

or 16 KB

Block: 256 KBThe pages in this block

can be from random

locations in the LBA

map; the Flash keeps a

map of each page, its

location in the Flash and

the LBA to which it

corresponds

Logical map of pages

LBA 0x0400

LBA 0x2400

LBA 0x4400

LBA 0x6400

Page 11: Flash Architecture and Effects

11© Copyright 2010 EMC Corporation. All rights reserved.

Flash Architecture: Channels and Devices

Channels are paths to physical devices (chips)– Flash drives have multiple channels: discrete devices can be read from or written to

simultaneously

– Large I/O is striped across the channels

So, parts of a large I/O are split between multiple blocks

Block Image: 256 KB

x 2 (in drive buffer)

Channel2

Host I/O 512 KB

Channel1

ChannelX

. . . Chip2

Chip1

ChipX

Buffer Channel NAND

A Block Image is accessed by a specific

channel, to a specific NAND chip

Page 12: Flash Architecture and Effects

12© Copyright 2010 EMC Corporation. All rights reserved.

Flash Architecture: Blocks

Important!

Writes to NAND are done at the block level

Block Images are held in buffer until the block is full, then written to a previouslyerased block on disk

There must be an erased block available for the write– We‟ll cover how blocks are erased…

Page: 4 KB

or 16 KB

Block, in buffer

The entire block is written to an erased block on the Flash

chips

Blocks, on NAND

Page 13: Flash Architecture and Effects

13© Copyright 2010 EMC Corporation. All rights reserved.

Flash as Mapped Device– Workload can affect page state

– Page state can affect availability of blocks

– Availability of free (erased) blocks determines write performance

Part 1: Page States– Page States have three possibilities

Valid page: page contains good data (referenced by host and Flash)

Invalid page: page contains „stale‟ data. One of:

• Overwritten in the host filesystem

• Moved/coalesced by the Flash itself

Erased: pages in an erased block; the block is not in use

– Pages become randomized due to random writes

A block may have a mix of valid, invalid, and erased pages

Flash Write Details: Page States

Page 14: Flash Architecture and Effects

14© Copyright 2010 EMC Corporation. All rights reserved.

A block‟s pages are either valid or invalid

– If referenced by the flash metadata, it is valid

– How does it become invalid? Let‟s follow the fate of a small file

LBA 0x2A1

LBA 0x040

LBA 0x240

LBA 0x040

Flash Block on drive

LBA 0x240

Block1

Flash Write Details: Page States

LBA 0xCF0

LBA 0x0FA

A small file (8 KB)

is mapped to 2

filesystem blocks

of 4 KB each

The 8 KB of

file data fits

in 2 pages of

4 KB each

Logical view on host

Legend

Valid

Invalid

Erased

File

Page 15: Flash Architecture and Effects

15© Copyright 2010 EMC Corporation. All rights reserved.

How a page becomes invalid– The first 4 KB of the file is overwritten by

host, to the existing location (LBA)

– The new value is stored in a page in buffer (block image)

– The old page in NAND is marked invalid

Block

On Chip

EXAMPLE: Step 1,

User updates file,

host overwrites

existing filesystem

page

Block image

Flash Write Details: Page States

LBA 0x2A1

LBA 0x040

LBA 0x240

LBA 0xCF0

LBA 0x0FA

LBA 0x040

LBA 0x240

NANDBUFFER

Legend

Valid

Invalid

Erased

File

Logical view on host

0x040 New

Page 16: Flash Architecture and Effects

16© Copyright 2010 EMC Corporation. All rights reserved.

EXAMPLE: Step 2,

Flash drive stores

data in a block

image in the drive’s

buffer

Flash Write Details: Page States

Block

On Chip

Block image

LBA 0x2A1

LBA 0x040

LBA 0x240

LBA 0xCF0

LBA 0x0FA

0x040 New

LBA 0x240

0x040 New

NANDBUFFER

Legend

Valid

Invalid

Erased

Logical view on host

File

How a page becomes invalid– The first 4 KB of the file is overwritten by

host, to the existing location (LBA)

– The new value is stored in a page in buffer (block image)

– The old page in NAND is marked invalid

Page 17: Flash Architecture and Effects

17© Copyright 2010 EMC Corporation. All rights reserved.

Block

On Chip

EXAMPLE: Step 3, Flash

invalidates old block on chip by

setting a bit in the mapping

database; at some point the new

block image in buffer is written to

chip in a different block

Flash Write Details: Page States

Block image

LBA 0x2A1

LBA 0x240

LBA 0xCF0

LBA 0x0FA

0x040 New

LBA 0x240

0x040 New

Data is left in place,

but reference

removed in index

NANDBUFFER

Legend

Valid

Invalid

Erased

File

Logical view on host

How a page becomes invalid– The first 4 KB of the file is overwritten by

host, to the existing location (LBA)

– The new value is stored in a page in buffer (block image)

– The old page in NAND is marked invalid

Page 18: Flash Architecture and Effects

18© Copyright 2010 EMC Corporation. All rights reserved.

Flash Write Details: Reserve Capacity

Part 2: How Flash drives provide good write performance

Some percentage of the drive‟s capacity is reserved– It is not included in the “user addressable” capacity

– HOWEVER – this capacity will be used to provide ready blocks for incoming writes

Example: binding a LUN does not „fill‟ the drive– Even if you bind the full available capacity

Reserve BlocksAddressable capacity

As soon as the

LUN is bound, we

write zeros to all

sectors.

This dirties ‘every’

block.

Example Flash drive with 50% of capacity reserved

But the reserve

blocks stand ready

to accept

‘incoming’ writes.

Page 19: Flash Architecture and Effects

19© Copyright 2010 EMC Corporation. All rights reserved.

Flash Write Details: Reserve Capacity

Sustained heavy writes can saturate a Flash drive

We will take a simple example – “4 MB Flash Drive”– Flash „example‟ has 16 blocks addressable, 16 blocks reserve

– 1 block = 256 KB; 16 blocks = 4 MB

– User binds 4 MB LUN - consumes all addressable blocks

– User writes 1 MB updates continuously

Reserve BlocksAddressable capacity

Example Flash: 4 MB

addressable; state new

drive, before LUNs bound

Valid

Invalid

Legend

Erased

Page 20: Flash Architecture and Effects

20© Copyright 2010 EMC Corporation. All rights reserved.

Flash Write Details: Reserve Capacity

Reserve BlocksAddressable capacity

Example Flash: 4 MB

LUN boundAll blocks of addressable

capacity have been written by

zero process

Reserve BlocksAddressable capacity

User writes 1 MBFlash writes to erased blocks in reserve,

invalidates existing blocks

User overwrites 1 MB of

existing dataFlash again uses erased blocks in reserve

Valid

Invalid

Legend

Erased

Page 21: Flash Architecture and Effects

21© Copyright 2010 EMC Corporation. All rights reserved.

Flash Write Details: Reserve Capacity

Reserve BlocksAddressable capacity

User writes two more 1

MB I/O to the disk

Before any additional data is written, some blocks on disk must be erased

That‟s pretty simple in the case of large regions rendered invalid– Entire blocks are invalid so they can be erased quickly

– Random write performance is very good for large I/O

The drive will do this as part of normal housecleaning, with available cycles

Page 22: Flash Architecture and Effects

22© Copyright 2010 EMC Corporation. All rights reserved.

Flash Write Details: Reserve Capacity

If small random writes are

sustained at a high rate,

over time all blocks will

end up with some amount

of valid and invalid pages

Valid

Invalid

Legend

Erased

What about small sustained writes?

Now the drive has a bigger job to do

In order to clear space for additional writes, blocks must be consolidated

Page 23: Flash Architecture and Effects

23© Copyright 2010 EMC Corporation. All rights reserved.

Flash Write Details: Consolidation

Part 3: Erasing Blocks– The drive will erase blocks during „idle‟ periods when incoming I/O is at a low rate

To be erased, a block must have all invalid pages– Every valid page in a block must first be written to another block

– That requires additional activity Read in pages to a buffered block

Erase old locales in NAND

Write out consolidated block to NAND

Example: Two sparse blocks being consolidated (housekeeping)

1. Read valid pages into

buffer

Buffer

Note: We are only showing 16

pages in each block to keep the

graphics a reasonable size

NAND

Page 24: Flash Architecture and Effects

24© Copyright 2010 EMC Corporation. All rights reserved.

Flash Write Details: Consolidation

Example: Two sparse blocks being consolidated (housekeeping)

2. Erase blocks on

chip

3. Write consolidated

block to chip NANDBuffer

1. Read valid pages into

buffer

Part 3: Erasing Blocks– The drive will erase blocks when incoming I/O rate allows

To be erased, a block must have all invalid pages– Every valid page in a block must first be written to another block

– That requires additional activity Read in pages to a buffered block

Erase old locales in NAND

Write out consolidated block to NAND

Page 25: Flash Architecture and Effects

25© Copyright 2010 EMC Corporation. All rights reserved.

Flash Write Details: Consolidation

Do Flash drives „slow down‟ over time?– As we have seen, free space is a factor, but so is time

Total capacity utilization can affect the response time of sustained writes– Higher capacity utilization results in more valid pages in each block

– Over time, distribution of valid pages becomes more random, and capacity utilization increases

– If blocks have a high percentage of valid pages, it is more difficult to consolidate and erase a block

– The drive needs more time to do housekeeping (sustained writes may use up free blocks)

Case 1: 50% total capacity utilized (reserved+user)

Each block averages 50% valid pages

Read only 2 blocks into buffer to free 1 block

Case 2: 75% capacity utilized

Each block averages 75% valid pages

Read 4 blocks into buffer to free 1 block

Page 26: Flash Architecture and Effects

26© Copyright 2010 EMC Corporation. All rights reserved.

Flash Write Details: Backfill

Small writes and Backfill aka „Write amplification‟– Writing an I/O smaller than the page requires read-modify-write

– The existing page on the Flash chip must be read into the buffer

– Once read, the old page will be invalidated, as the new page will contain the current (merged) version of the data

Existing Page on

Chip: 4 KBWrite: 2 KB

Block image in disk buffer w/

4K pages

(only 8 pages shown)

Read existing page

Page complete

in buffer

Existing page on

chip is invalidated

Page 27: Flash Architecture and Effects

27© Copyright 2010 EMC Corporation. All rights reserved.

Flash and Write Cache

Original Guidance: “Flash does not need SP cache”– Conservative: avoid side effects from full cache

New guidance: “Flash can help SP cache in many cases”

Experience: many uses of Flash + SP Cache in the field– No major problems encountered, many benefits seen

Examples– Allows consolidation of I/O (necessary for log files)

– Improves response time for writes A write cache mirror operation is faster than parity writes to Flash

RAID5 Stripe

Processor memory

Parity Data1 Data2 Data3 Data4

Data1 Data2 Data3 Data4ParityHost writes

sequentially

Page 28: Flash Architecture and Effects

28© Copyright 2010 EMC Corporation. All rights reserved.

Best Practices

Our goal is to show the best potential for the drives– There is no load which will break the drive, or overheat it, or tire it out

The following slides have two themes: Best Use, and OK to Use

Best Use are those applications that get the maximum performance advantage from the drives

– High random read rates

– Smaller IO

– IO Patterns that are not optimal for cached FC implementations

Why spend Flash prices for IO that FC + Flare Cache handle just fine?

OK to Use are profiles that will do just fine with Flash, but:– Cached FC could do it as well

– Do not give you the big “Flash advantage” you might expect

All of these are field-tested designs

Page 29: Flash Architecture and Effects

29© Copyright 2010 EMC Corporation. All rights reserved.

Best Practices: Best Use

Databases (most common use of Flash): 4 to 15 Flash drives typical– Indexes and busy tables: “10% of the table spaces that do 80% of all I/O”

Biggest disk-for-disk increase in read-heavy tables (10 – 20X)

– TEMP space

BUT – turn on SP write cache

– Some clients using Flash for write-heavy tables

Use SP write cache for better response time

Flash flushes cache faster, better results for other (FC-based) tables as well.

Write Cache (90%)

Before: disks busy, cache full,

some I/O waiting on cache

All FC drives are busy

Write Cache (40%)

After: disks less busy, cache flushes

faster to Flash and FC as well

FC drive queues lower Flash for the

heavy writers

Small footprint,

big effect

Page 30: Flash Architecture and Effects

30© Copyright 2010 EMC Corporation. All rights reserved.

Best Practices: Best Use

Really Big Databases are a little different– We see up to 30 Flash in larger DBs

– Some users have SP write caching OFF for Flash – to maximize write throughput

Oracle ASM 11gR2– An ASM instance can be presented with different ASM disk groups (pools)

– The user can designate a group as FAST, AVERAGE, or SLOW

– We suggest you designate Flash as “FAST”

SPA

Before: all writes mirrored, SP & cache

busy

FC drives need write cache

SPB

After: Busy tables to UNCACHED Flash drives:

less mirror traffic, better response time

FC drives cached, Flash

uncached

SPA SPB

Heavy cache mirroring Reduced cache mirroring

Page 31: Flash Architecture and Effects

31© Copyright 2010 EMC Corporation. All rights reserved.

Best Practices: Best Use

Messaging (Exchange, Notes) benefits from the same effects– Move some of the databases to Flash, and all users benefit

– Use RAID 5 for Exchange on Flash

Turn on SP write cache

Writes flush to RAID 5 on Flash faster than RAID 1/0 on FC

Reads are likely better distributed than from RAID 1/0 on Flash

Flash rebuilds faster than FC and impact is less

Write Cache (90%)

Before: disks busy, cache full,

some I/O waiting on cache

All FC drives are busy

Write Cache (40%)

After: disks less busy, cache flushes

faster to Flash and FC as well

FC drive queues lower Flash for the

heavy writers

Small footprint,

big effect

Page 32: Flash Architecture and Effects

32© Copyright 2010 EMC Corporation. All rights reserved.

Best Practices: OK to Use – But Why?

Databases:– Oracle Flash Recovery: SATA do fine here, more economical

– Redo logs: FC is sufficient, less cost

Turn SP Write Cache on for Redo LUNs, even if Redos are on Flash

– Archive Logs: FC, even SATA do fine here

Media: Mostly FC used here– Editing configurations are the best fit for Flash in media

Flash is very quick to serve the small metadata operations

– Some advantage to using Flash with multistream access

Large reads and writes in parallel (sharing disks among streams) does not suffer from “disk seek inflation” as seen on rotating media

– FC will still give more predictable write performance at a micro level, due to Flash‟s internal structure

Any time power/cooling issues are #1

Page 33: Flash Architecture and Effects

33© Copyright 2010 EMC Corporation. All rights reserved.

Best Practices: Flash and SP Write Cache

Original guidance was no SP write cache with Flash drives– Flash is fast even without it

– We did not want Flash LUNS to hit force flushes in cache

Extensive field use shows Flash + SP cache is very effective– No pathological cases encountered, due to conservative guidance

– Please avoid heavy writes to SATA when using write cache and Flash

SP Cache is very effective with Flash drives– Write caching of sequential writes, to optimize RAID 5 updates

– Faster response time of small writes

Page 34: Flash Architecture and Effects

34© Copyright 2010 EMC Corporation. All rights reserved.

Best Practices: Flash Drives, EMC FAST Cache, and EMC FAST

The smaller Flash drives handle writes better than the largest (400 GB)– The 400 has a large page size as well as modest reserve space

We expect the FAST Cache to be write-intensive– The 73 GB drive will be the best performer in FAST Cache

– The 200 GB drive will also do well and is appropriate for large FAST Cache implementations

– The 400 GB drive would perform better in a tiered pool than in FAST Cache

Migration once a day vs. constantly adjusting

In FAST pools, more spindles is always better, even when they don‟t rotate

– You cannot short stroke certain drives like you can with a Flare LUN, so the 73 GB drive will be the star performer in pools

– The 200 and the 400 will follow in performance both due to behavior and drive count

Page 35: Flash Architecture and Effects

35© Copyright 2010 EMC Corporation. All rights reserved.

Summary

Flash drives are revolutionary: truly random access storage, so different behavior

There are implementation details to flash that make them behave differently as well

– Writes take time to absorb

– Constant write loads could show differences between the models

– Our ROT is a good guide for conservative design

SP write cache is effective with Flash drives– Field-tested designs validate many benefits

Best practices use SP cache with some applications, not with others– Fit the solution to the problem

Page 36: Flash Architecture and Effects

36© Copyright 2010 EMC Corporation. All rights reserved.

Win Iomega® Portable Hard Drive

Provide Feedback and Design with Us!

We are interested in feedback from

CLARiiON, Celerra & Centera customers.

Go to

www.emc.com/umsgsurveyfor a short survey.

One winner will be randomly selected for an Iomega portable

hard drive and select respondents will also get an opportunity to

participate in usability studies.

Page 37: Flash Architecture and Effects

37© Copyright 2010 EMC Corporation. All rights reserved.

Related EMC World SessionsLectures

CLARiiON Flash Cache: Automated Performance Optimization

Going FAST with CLARiiON

Leveraging CLARiiON FAST & FAST Cache with Enterprise Applications

The New Features and Functions of CLARiiON FAST

Performance as a Function of Utilization on CLARiiON

Hands-on Workshops

CLARiiON Navisphere Analyzer - Hands-on Workshop

CLARiiON VM-aware Unisphere, FAST and EMC Virtual Storage Integrator Hands-on lab with VMware vSphere 4

Please help us improve by filling out session evaluations!

Flash Architecture and Effects

Dave Zeryck

Page 38: Flash Architecture and Effects

38© Copyright 2010 EMC Corporation. All rights reserved.