spyglass: fast, scalable metadata search for large-scale … · 2019-02-25 · • cost &...
TRANSCRIPT
Andrew W Leung • Ethan L MillerUniversity of California, Santa Cruz
Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems
Minglong Shao • Timothy Bisson • Shankar PasupathyNetApp
7th USENIX Conference on File and Storage TechnologyFebruary 24-27, 2009
The data management problem
• Large-scale storage systems are difficult to manage• Management questions become difficult or impossible to ask• Must sift through billions of files
• Impacts users and administrators• User - “Where are my recently modified presentation files?”• Admin - “Which apps and users are consuming the most space?”
• Challenge: need fast answers to difficult questions• Example -- smart data restore
• A buggy script deletes a number of users’ files• “Which files should be restored from backup?”
• Search current and previous versions• Locate the affected files to restore• Find the files to restore first
2
What can be done?
• Need to quickly gather and search info from the storage system• Ad hoc queries over file properties• I.e., metadata search capabilities
• Metadata includes file inode and extended attributes• Formatted as <attribute, value> pairs
• Metadata search becoming more common• Desktop - Spotlight, Beagle, Windows search• Small-scale enterprise - IndexEngines, Google Enterprise, FAST
3
How can this be done at very large-scales?
Large-scale search challenges
4
Challenge Existing Solutions Why Is This Difficult?
Cost & resources Require dedicated hardware. Can’t afford dedicated hardware.
Must share resources.
MetadataCollection
Slow to collect changes.Impacts the storage system. Collect millions of changes.
Scale & performance
Use general-purpose DBMSs.Few storage optimizations. Searching 1010 - 1011 pairs.
• Large-scale search challenges are not currently addressed• Requires a specialized solution
Spyglass overview
• How should a solution address these challenges?• Leverage metadata search properties
• Metadata query properties• Conducted user survey
• File metadata properties• Analyzed 3 storage systems
• Spyglass uses new index designs:• Hierarchical partitioning• Signature files• Partition versioning - in the paper• Snapshot-based metadata crawling
5
Storagesystem
Spyglass
Cache Crawler
Internal File System
Partitions
Version 1
Version n
Index
Query
Results
Metadata query properties
• Surveyed over 30 users and administrators• Asked “how would you use metadata search?”
6
Property Description Example Solution
Multiple attributes
Includes multiple metadata attributes.
Where are my recently modified presentations?
Use all attributes in query execution.
Localized Includes a directory pathname.
Which files in my home directory can be deleted?
Include namespace knowledge.
Back-in-time Searches multiple metadata versions.
Which files have grown the most in the last week?
Version index changes.
File metadata properties
• Analyzed 3 storage systems at NetApp• Web server -- 15 million files• Engineering server -- 60 million files• Home directory servers -- 300 million files
7
Property Description Example Solution
Spatial locality
Values clustered in namespace.
Andrew’s files mostly in/home/andrew
Allow index control using the namespace.
Skewed frequencies
A few values occur frequently.
Intersections are more uniform.
Many PDFs or Andrew’s files.
Fewer Andrew’s PDF files.
Query execution using intersections.
Hierarchical partitioning
• Partition the index using the namespace• Parts of the namespace are indexed in separate partitions
• Exploits spatial locality• Allows index control at the granularity of sub-trees• Uses a simple greedy algorithm
• Partitions are stored sequentially on disk• Fast access to each partition
8
/
home proj usr
andrew jim distmeta reliability include
thesis scidac src experiments
Query execution
• Performance is tied to the number of partitions searched• How do we determine which partitions to search?
• Users can localize their queries• Users control the size of the search space
• Signature files• Automatically determines which partitions to search• Exploits attribute value intersections
• Searches only partitions containing all query values
9
Signature files
• Compactly summarizes a partition’s contents• A signature file for each attribute in the partition• Small enough to fit in memory• Created as files are inserted
• Only search a partition if all tested bits are 1
• False positives can be reduced by• Increasing signature size• Changing the hashing function
10
1 0 1 1 0 1 0 1•••
hash() mod
doc xls cppt
py pl h mpgjpg
mov
1 1 0 1 0 0 1 1•••hash()
<1 1-4 5-31 32-127
128-255
256-511
100MB-500MB
>500MBpy 32-127
plpy
Partition design
• Each partition stores metadata in a KD-tree• Not explicitly tied to a particular index structure
• KD-trees• A multi-dimensional binary tree• Provides fast, multi-dimensional search• Allows a single index structure to be used
• Performance is bound by reading partitions from disk• Partitions are managed by a caching sub-system
• Uses LRU• Captures the likely Zipf-like query distributions• Ensures popular partitions are in-memory
11
Metadata collection
• Metadata collection must• Scale to millions of changes• Not degrade storage system performance
• Calculates the difference between to two snapshots• Leverages the inode file in WAFL snapshots• Only re-crawls metadata for changed files
12
Block 1
Snapshot 1
Block 2 Block 3
Block 4 Block 5 Block 6
Block 1
Block 2 Block 3
Block 4 Block 5 Block 6
Snapshot 2
InodeChangeBlock 7
Block 8
Block 9
Evaluation
• Evaluate performance, scalability, and efficiency• Using our real-world traces from NetApp
• In the talk we evaluate• Metadata collection
• Compare performance to a simple straw-man• Search performance
• Compare performance to PostgreSQL and MySQL
• In the paper we also evaluate• Update performance• Space requirements• Index locality• Versioning overhead
13
Collection performance
• Compare snapshot-based collection (SB)• To a parallel file system crawl (SM)
14
0 50
100 150 200 250 300 350
0 20 40 60 80 100
Tim
e(M
in)
Files (Millions)SMSB
• Baseline crawl
• Reads entire inode file
• 10x faster that SM
0 50
100 150 200 250 300 350
0 20 40 60 80 100Ti
me
(Min
)
Files (Millions)SM-10%SM-5%SM-2%
SB-10%SB-5%SB-2%
• Incremental crawl
• 2%, 5%, and 10% change
• Finishes in less than 45 min
10x> 4 hr
Search performance
• Set 1: 75% queries finish in less than 1 second• Many partitions are eliminated from search
• Set 2: Localization significantly improves performance
15
Query Selectivity
.000001 .00001 .0001 .001 .01 .1
Qu
ery
Exe
cu
tio
n T
ime
100us
1ms
10ms
100ms
1s
10s
Spyglass System X System Y
Figure 4.10: Comparison of Selectivity Impact. The selectivity of queries in my query set is plotted against
the execution time for that query. I find that query performance in Spyglass is much less correlated to the
selectivity of the query predicates than the DBMSs, which are closely correlated with selectivity.
Set Search Metadata Attributes
Set 1 Which user and application files consume the most space? Sum sizes for files using owner and ext
Set 2 How much space in this part do files from query 1 consume? Use query 1 with an additional directory path.
Table 4.5: Query Sets. A summary of the searches used to generate my evaluation query sets.
to store the base table and index structures. Figure 4.9 shows that this approach can incur over a 100%
overhead.
Selectivity Impact. In this experiment, I evaluated the effect of metadata selectivity on the performance of
Spyglass and the DBMSs. I again generated query sets of ext and owner from the Web trace with varying
selectivity—the ratio of the number of results to all records. Figure 4.10 plots query selectivity against query
execution time. I found that the performance of PostgreSQL and MySQL is highly correlated with query
selectivity. However, this correlation is much weaker in Spyglass, which exhibits much more variance.
For example, a Spyglass query with selectivity 7 ! 10!6 runs in 161ms while another with selectivity
8! 10!6 requires 3ms. This variance is caused by the higher sensitivity of Spyglass to hierarchical locality
and query locality, as opposed to simple query selectivity. This behavior is unlike that of a DBMS, which
accesses records from disk based on the predicate it thinks is the most selective. The weak correlation with
selectivity in Spyglass means it is less affected by the highly skewed distribution of storage metadata which
makes determining selectivity difficult.
Search Performance. To evaluate Spyglass search performance I generated sets ofquery derived from real-
world queries in my user study; there are, unfortunately, nostandard benchmarks for file system search. My
first set is based on a storage administrator searching for the user and application files that are consuming the
most space—an example of a simple two attribute search. The second set is an administrator localizing the
same search to only part of the namespace, which shows how localizing the search changes performance.
57
• Evaluate 100 queries generated from 300 million file trace
Query Execution Time
100ms1s 5s 10s 25s 100s
Frac
tion
of Q
uerie
s
00.20.40.60.8
1
Spyglass Postgres MySQLQuery Execution Time
100ms1s 5s 10s 25s 100s
Frac
tion
of Q
uerie
s
00.20.40.60.8
1
Spyglass Postgres MySQL
Conclusions
• Metadata search can greatly improve how we manage data• Large-scale storage systems present unique challenges
• Cost & resources, metadata collection, performance & scalability
• There are opportunities to leverage query and file properties• Conducted a survey of real users and administrators• Analyzed real-world large-scale storage systems
• Spyglass is a new metadata search design• Hierarchical partitioning• Partition versioning• Snapshot-based metadata collection
16
Thank you! Questions?
17
Thanks to our sponsors:
Partition versioning
• Index versioning provides• Back-in-time search capabilities• Fast, out-of-place index updates
• Each partition manages its own versions with a version vector• Exploits file update locality
• Each version is a batch of index updates• Represents the state of metadata at a given time• Absorbs frequent file re-modifications• However, creates a stale index
18
Versioning design
• Versions contain incremental metadata changes• Changes roll results forward
• Stored sequentially with the partition• Updates are fast - small sequential writes• Search overhead is low - Longer sequential reads
19
T0 T1 T2 T3
Baselineindex
T0 T2T0 T0 T2 T3
Incrementalindexes
/
home proj usr
john jim distmeta reliability include
thesis scidac src experiments
Update performance
• Between 8x and 44x faster than DBMSs• DBMS load table and build indexes
• Scales linearly
20
Web Eng Home
Upda
te T
ime
(s)
0
25
250
2500
25000
250000
3m52s
48m2s
45m6s 20m
12s
2h44m43s
14h50m5s 1h
38m26s
18h7m22s
2d11h31m33s
Spyglass PostgreSQL TablePostgreSQL Index
MySQL TableMySQL Index
• Build baseline for each full snapshot
Index locality
• Attribute intersections reduce the search space• 50% of queries access less than 2% of partitions
• Selective attributes improve cache hit ratio• 95% of queries have 95% cache hits
21
• Evaluate how partitions are queried and cached
• Generate queries based on attribute distributions
Percent of Queries
0 20 40 60 80 100
Perc
ent o
f Sub−t
ree
Parti
tions
Que
ried
0
20
40
60
80
100
ext owner ext/ownerPercent of Queries
0 20 40 60 80 100
Perc
ent o
f Cac
he H
its
020406080
100
ext owner ext/owner
Versioning overhead
• Each versions adds 10% runtime overhead• Overhead is not evenly distributed
• 50% of queries have less than a 5 ms overhead• A few queries contribute most to overhead
22
• Evaluate overhead of 1 to 3 days of changes
Number of Versions0 1 2 3
Tota
l Run
Tim
e (s
)
0100200300400500
Query Overhead
1ms 10ms 100ms 1s 10s
Frac
tion
of Q
uerie
s
00.20.40.60.8
1
1 Version 2 Versions 3 Versions