spyglass: fast, scalable metadata search for large-scale … · 2019-02-25 · • cost &...

Andrew W Leung • Ethan L MillerUniversity of California, Santa Cruz

Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems

Minglong Shao • Timothy Bisson • Shankar PasupathyNetApp

7th USENIX Conference on File and Storage TechnologyFebruary 24-27, 2009

The data management problem

• Large-scale storage systems are difficult to manage• Management questions become difficult or impossible to ask• Must sift through billions of files

• Impacts users and administrators• User - “Where are my recently modified presentation files?”• Admin - “Which apps and users are consuming the most space?”

• Challenge: need fast answers to difficult questions• Example -- smart data restore

• A buggy script deletes a number of users’ files• “Which files should be restored from backup?”

• Search current and previous versions• Locate the affected files to restore• Find the files to restore first

2

What can be done?

• Need to quickly gather and search info from the storage system• Ad hoc queries over file properties• I.e., metadata search capabilities

• Metadata includes file inode and extended attributes• Formatted as <attribute, value> pairs

• Metadata search becoming more common• Desktop - Spotlight, Beagle, Windows search• Small-scale enterprise - IndexEngines, Google Enterprise, FAST

3

How can this be done at very large-scales?

Large-scale search challenges

4

Challenge Existing Solutions Why Is This Difficult?

Cost & resources Require dedicated hardware. Can’t afford dedicated hardware.

Must share resources.

MetadataCollection

Slow to collect changes.Impacts the storage system. Collect millions of changes.

Scale & performance

Use general-purpose DBMSs.Few storage optimizations. Searching 1010 - 1011 pairs.

• Large-scale search challenges are not currently addressed• Requires a specialized solution

Spyglass overview

• How should a solution address these challenges?• Leverage metadata search properties

• Metadata query properties• Conducted user survey

• File metadata properties• Analyzed 3 storage systems

• Spyglass uses new index designs:• Hierarchical partitioning• Signature files• Partition versioning - in the paper• Snapshot-based metadata crawling

5

Storagesystem

Spyglass

Cache Crawler

Internal File System

Partitions

Version 1

Version n

Index

Query

Results

Metadata query properties

• Surveyed over 30 users and administrators• Asked “how would you use metadata search?”

6

Property Description Example Solution

Multiple attributes

Includes multiple metadata attributes.

Where are my recently modified presentations?

Use all attributes in query execution.

Localized Includes a directory pathname.

Which files in my home directory can be deleted?

Include namespace knowledge.

Back-in-time Searches multiple metadata versions.

Which files have grown the most in the last week?

Version index changes.

File metadata properties

• Analyzed 3 storage systems at NetApp• Web server -- 15 million files• Engineering server -- 60 million files• Home directory servers -- 300 million files

7

Property Description Example Solution

Spatial locality

Values clustered in namespace.

Andrew’s files mostly in/home/andrew

Allow index control using the namespace.

Skewed frequencies

A few values occur frequently.

Intersections are more uniform.

Many PDFs or Andrew’s files.

Fewer Andrew’s PDF files.

Query execution using intersections.

Hierarchical partitioning

• Partition the index using the namespace• Parts of the namespace are indexed in separate partitions

• Exploits spatial locality• Allows index control at the granularity of sub-trees• Uses a simple greedy algorithm

• Partitions are stored sequentially on disk• Fast access to each partition

8

/

home proj usr

andrew jim distmeta reliability include

thesis scidac src experiments

Query execution

• Performance is tied to the number of partitions searched• How do we determine which partitions to search?

• Users can localize their queries• Users control the size of the search space

• Signature files• Automatically determines which partitions to search• Exploits attribute value intersections

• Searches only partitions containing all query values

9

Signature files

• Compactly summarizes a partition’s contents• A signature file for each attribute in the partition• Small enough to fit in memory• Created as files are inserted

• Only search a partition if all tested bits are 1

• False positives can be reduced by• Increasing signature size• Changing the hashing function

10

1 0 1 1 0 1 0 1•••

hash() mod

doc xls cppt

py pl h mpgjpg

mov

1 1 0 1 0 0 1 1•••hash()

<1 1-4 5-31 32-127

128-255

256-511

100MB-500MB

>500MBpy 32-127

plpy

Partition design

• Each partition stores metadata in a KD-tree• Not explicitly tied to a particular index structure

• KD-trees• A multi-dimensional binary tree• Provides fast, multi-dimensional search• Allows a single index structure to be used

• Performance is bound by reading partitions from disk• Partitions are managed by a caching sub-system

• Uses LRU• Captures the likely Zipf-like query distributions• Ensures popular partitions are in-memory

11

Metadata collection

• Metadata collection must• Scale to millions of changes• Not degrade storage system performance

• Calculates the difference between to two snapshots• Leverages the inode file in WAFL snapshots• Only re-crawls metadata for changed files

12

Block 1

Snapshot 1

Block 2 Block 3

Block 4 Block 5 Block 6

Block 1

Block 2 Block 3

Block 4 Block 5 Block 6

Snapshot 2

InodeChangeBlock 7

Block 8

Block 9

Evaluation

• Evaluate performance, scalability, and efficiency• Using our real-world traces from NetApp

• In the talk we evaluate• Metadata collection

• Compare performance to a simple straw-man• Search performance

• Compare performance to PostgreSQL and MySQL

• In the paper we also evaluate• Update performance• Space requirements• Index locality• Versioning overhead

13

Collection performance

• Compare snapshot-based collection (SB)• To a parallel file system crawl (SM)

14

0 50

100 150 200 250 300 350

0 20 40 60 80 100

Tim

e(M

in)

Files (Millions)SMSB

• Baseline crawl

• Reads entire inode file

• 10x faster that SM

0 50

100 150 200 250 300 350

0 20 40 60 80 100Ti

me

(Min

)

Files (Millions)SM-10%SM-5%SM-2%

SB-10%SB-5%SB-2%

• Incremental crawl

• 2%, 5%, and 10% change

• Finishes in less than 45 min

10x> 4 hr

Search performance

• Set 1: 75% queries finish in less than 1 second• Many partitions are eliminated from search

• Set 2: Localization significantly improves performance

15

Query Selectivity

.000001 .00001 .0001 .001 .01 .1

Qu

ery

Exe

cu

tio

n T

ime

100us

1ms

10ms

100ms

1s

10s

Spyglass System X System Y

Figure 4.10: Comparison of Selectivity Impact. The selectivity of queries in my query set is plotted against

the execution time for that query. I find that query performance in Spyglass is much less correlated to the

selectivity of the query predicates than the DBMSs, which are closely correlated with selectivity.

Set Search Metadata Attributes

Set 1 Which user and application files consume the most space? Sum sizes for files using owner and ext

Set 2 How much space in this part do files from query 1 consume? Use query 1 with an additional directory path.

Table 4.5: Query Sets. A summary of the searches used to generate my evaluation query sets.

to store the base table and index structures. Figure 4.9 shows that this approach can incur over a 100%

overhead.

Selectivity Impact. In this experiment, I evaluated the effect of metadata selectivity on the performance of

Spyglass and the DBMSs. I again generated query sets of ext and owner from the Web trace with varying

selectivity—the ratio of the number of results to all records. Figure 4.10 plots query selectivity against query

execution time. I found that the performance of PostgreSQL and MySQL is highly correlated with query

selectivity. However, this correlation is much weaker in Spyglass, which exhibits much more variance.

For example, a Spyglass query with selectivity 7 ! 10!6 runs in 161ms while another with selectivity

8! 10!6 requires 3ms. This variance is caused by the higher sensitivity of Spyglass to hierarchical locality

and query locality, as opposed to simple query selectivity. This behavior is unlike that of a DBMS, which

accesses records from disk based on the predicate it thinks is the most selective. The weak correlation with

selectivity in Spyglass means it is less affected by the highly skewed distribution of storage metadata which

makes determining selectivity difficult.

Search Performance. To evaluate Spyglass search performance I generated sets ofquery derived from real-

world queries in my user study; there are, unfortunately, nostandard benchmarks for file system search. My

first set is based on a storage administrator searching for the user and application files that are consuming the

most space—an example of a simple two attribute search. The second set is an administrator localizing the

same search to only part of the namespace, which shows how localizing the search changes performance.

57

• Evaluate 100 queries generated from 300 million file trace

Query Execution Time

100ms1s 5s 10s 25s 100s

Frac

tion

of Q

uerie

s

00.20.40.60.8

1

Spyglass Postgres MySQLQuery Execution Time

100ms1s 5s 10s 25s 100s

Frac

tion

of Q

uerie

s

00.20.40.60.8

1

Spyglass Postgres MySQL

Conclusions

• Metadata search can greatly improve how we manage data• Large-scale storage systems present unique challenges

• Cost & resources, metadata collection, performance & scalability

• There are opportunities to leverage query and file properties• Conducted a survey of real users and administrators• Analyzed real-world large-scale storage systems

• Spyglass is a new metadata search design• Hierarchical partitioning• Partition versioning• Snapshot-based metadata collection

16

Thank you! Questions?

17

Thanks to our sponsors:

Partition versioning

• Index versioning provides• Back-in-time search capabilities• Fast, out-of-place index updates

• Each partition manages its own versions with a version vector• Exploits file update locality

• Each version is a batch of index updates• Represents the state of metadata at a given time• Absorbs frequent file re-modifications• However, creates a stale index

18

Versioning design

• Versions contain incremental metadata changes• Changes roll results forward

• Stored sequentially with the partition• Updates are fast - small sequential writes• Search overhead is low - Longer sequential reads

19

T0 T1 T2 T3

Baselineindex

T0 T2T0 T0 T2 T3

Incrementalindexes

/

home proj usr

john jim distmeta reliability include

thesis scidac src experiments

Update performance

• Between 8x and 44x faster than DBMSs• DBMS load table and build indexes

• Scales linearly

20

Web Eng Home

Upda

te T

ime

(s)

0

25

250

2500

25000

250000

3m52s

48m2s

45m6s 20m

12s

2h44m43s

14h50m5s 1h

38m26s

18h7m22s

2d11h31m33s

Spyglass PostgreSQL TablePostgreSQL Index

MySQL TableMySQL Index

• Build baseline for each full snapshot

Index locality

• Attribute intersections reduce the search space• 50% of queries access less than 2% of partitions

• Selective attributes improve cache hit ratio• 95% of queries have 95% cache hits

21

• Evaluate how partitions are queried and cached

• Generate queries based on attribute distributions

Percent of Queries

0 20 40 60 80 100

Perc

ent o

f Sub−t

ree

Parti

tions

Que

ried

0

20

40

60

80

100

ext owner ext/ownerPercent of Queries

0 20 40 60 80 100

Perc

ent o

f Cac

he H

its

020406080

100

ext owner ext/owner

Versioning overhead

• Each versions adds 10% runtime overhead• Overhead is not evenly distributed

• 50% of queries have less than a 5 ms overhead• A few queries contribute most to overhead

22

• Evaluate overhead of 1 to 3 days of changes

Number of Versions0 1 2 3

Tota

l Run

Tim

e (s

)

0100200300400500

Query Overhead

1ms 10ms 100ms 1s 10s

Frac

tion

of Q

uerie

s

00.20.40.60.8

1

1 Version 2 Versions 3 Versions

spyglass: fast, scalable metadata search for large-scale … · 2019-02-25 · • cost &...

Documents