Download - Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Path Processing usingSolid State Storage
Manos Athanassoulis, DIAS, EPFL*Mustafa Canim, IBM Watson Research LabsKenneth Ross, IBM Watson Research Labs, Columbia UniversityBishwaranjan Bhattacharjee, IBM Watson Research Labs
*work done during an internship at IBM.
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Why Path Processing?
Increasing capacity Exponential increase
Follows Moore’s law Read performance
OOM faster than disks
Random read performance Crucial for path processing
New technologies Flash already mature
Phase Change Memory (PCM)
… more tech’s are coming
2
Why Solid State Storage (SSS)?
App’s use linkage information Social
Scientific
Government
Financial
Knowledge Watson (Jeopardy Champ)
Graph processing not enough Link type modeled by RDF
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Path processing
3
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Path processing
4
1) Cannot prefetch 2) Retrieve-data-then-follow-link
3) A lot of useless data are retrieved
How can Solid State Storage help?
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Path processing (and Solid State Storage)
5
1) Small access latency2) Read mostly usefull data
3) Efficient random IO accesses4) Can we do something better?
Build SSS-aware systems
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
In the rest of the talk …
RDF data model and systems
Solid State Storage for Path Processing Technology
Flash vs PCM
Storing and managing RDF data over Solid State Storage
Conclusions
6
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Resource Description Framework (RDF) meta-data model
Data is represented in Statements each one comprised by a tripleStatement: <Subject, Predicate, Object>
Each statement describes a property of a subject:<“IBM”, “is-a”, “Corporation”>
or a connection between to objects:<“Manos”, “interned-at”, “IBM”>
or a value of a Property of a Subject:<“Manos”, “born-in”, “1984”>
The notation is more complex:
• Subjects are Universal Resource Identifiers (URIs)
• Predicates are URIs
• Objects are either URIs or literals
7
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
RDF data management
Two alternatives are used to store data Relational RDF storage
• Use existing relational stores
• Create relational tables
• Basic approach: A triple-store
• One big table with three columns
Native RDF storage• Tailored to the needs of the specific workload
• No underlying system assumed
8
Can we take the best of both worlds?
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Outline
RDF data model and systems
Solid State Storage for Path Processing Technology
Flash vs PCM
Storing and managing RDF data over Solid State Storage
Conclusions
9
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Solid State Storage facts We have access to a PCI-based PCM prototype (compared with fusionIO)
PCM prototype vs Flash state-of-the-art
10
4K accesses PCM prototype* FusionIO
Read BW 800MB/s 700MB/s
Read latency (HW) 20µs (4KB) 50µs (512B)
Write BW 40MB/s 550MB/s
Write latency (HW) 250µs (4KB) <200µs (4KB)
Endurance 1M cycles 100K cycles
Write Cap (TB/GB) 1000 590
4K acceses PCM prototype* fusionIO
Read Latency (SW+HW) 36µs 72µs
STDEV 3% 60%
Write Latency (SW+HW) 386µs 241µs
STDEV 11% 20%
*Very early Micron PCM prototype
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Exploiting Solid State Storage for path processing
Path-processing involves link-following queries • Access latency is critical
Solid State Storage is tailored for path-processing:• OOM lower read latency than traditional storage
• Very fast random-read performance
PCM is expected to outperform Flash in read performance
Next in this talk:• PCM vs Flash when running link-following queries
• Storing and managing RDF data on Solid State Storage
11
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
PCM vs Flash in path processing
Prototype implementation of link-following queries
Workload: Given a randomly generated graph, execute link-following queries of variable length without buffering
Graph generation5GB synthetic data with random number of edges (between 3 and 30
edges per vertex) Querying Parameters
Number of threads (1, 2, 4, 8, 16, 32, 64, 96, 128, 192)
Pagesize (4K, 8K, 16K, 32K)
Length of the query (2, 4, 10, 100 accesses per query) Hypothesis: PCM can offer important performance improvements
12
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
PCM vs Flash
13
Query length: 100 hops
PCM performs consistently better for smaller page granularities
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
PythiaAn RDF repository for Solid State Storage
14
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Building a SSS-aware RDF repository
We focused on building a graph-based RDF repository
We need to design a new system which:• Takes into account the graph-structure of the data
• Supports any RDF-based query
We introduce Pythia, a new RDF repository, which uses:• The notion of RDF-tuple
• New internal structures
• New data layout
15
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
RDF-tuple
<Subject>, <Predicate1>, {<Object1_1>, <Object1_2>, …},<Predicate2>, {<Object2_1>, <Object2_2>, …},…<PredicateN>, {<ObjectN_1>, <ObjectN_2>, …},
The RDF-tuple design:• allows us to locate within a page the most important information of a
Subject.
• allows us to avoid repeating redundant information (Subject and Predicate resources)
• This is further optimized by the URL Dictionary
16
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Pythia
17
Hash Index Hash Index
Main storage: S, P, O
Aux storage: O, P, S
Literals
Dictionary
URL Dictionary
Query Engine
–Repository for Very Large Objects
DRAM
SSS
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Data layout on Pythia
18
Tuple 0
Tuple 1
Tuple 2
Tuple 3
Tuple Metadata
Subject Resource
Predicates dictionary IDs
Objects: (if literal) Literal dictionary ID
Objects: (else) Object Resource and pageID, tupleID
LEN Sptr nofP Optr dicID Optr dicID … … …
… nofO local ORpt pID tID local ORpt pID tID
… … nofO local ORpt pID tID … … …
<Subject> <Object1_1> <Object1_2>
… … … <Object2_1> …
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Storing Yago2 using Pythia
Yago2 is a semantic knowledge base, introduced by Max-Planck Institute in 2007, derived from wikipedia, WordNet, and GeoNames (currently ~10M entries, 460M facts).
Yago2 in Pythia Initial data: 2.3GB Main DB files: 1.3GB Large objects: 192MB• Can be aggressively decreased with page-level compression (tuples will move to
main file as well) Indexes: 121MB (hash-based, in memory) Dictionaries: 569MB• Possible optimization: Take into account type of literal (now string)
More than 99% of the SPO tuples can fit in a single 4K page
19
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Evaluating Pythia (Setup & Dataset)
Prototype C++ implementation System Setup
• 24-core Intel XEON X560 with linux x86_64 (2.6.32-28)
• 32GB of memory
• 12GB PCM card (Micron prototype card)
• 74GB Flash card (fusionIO)
Workload: Yago2
Queries: a mix of 6 queries with randomized parameters
20
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation21
How often can you ask Pythia?
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation22
How fast does Pythia answer?
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Pythia vs RDF-3X RDF – 3X is the de facto research state-of-the-art Data in a virtual table and accessed through compressed indexes
6 indexes (all permutations of S,P,O) and 3 aggregate indexes
23
SPOSOP
OSP OPS
PSO
POS
S, CountS, Count
P, Count
O, Count
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Pythia vs RDF-3X
Q1: Find all male citizens of Greece. Q2: Find all OECD member economies that Switzerland deals with. Q3: Find all mafia films that Al Pacino acted in.
Size on disk for Yago2: Raw data 2.3GB Pythia: 2.2GB (no compression)
1.5GB db files (on disk)
0.7GB dictionaries/indexes (loaded in memory during startup) RDF-3X: 2.2GB (aggressive compression)
2.2GB a single file (on disk)
24
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Conclusions
Solid State Storage is naturally tailored for path processing PCM, Flash and more new technologies
PCM comparative advantage against flash is lower read latency 1.5x-2.5x speedup in a workload with dependent reads
Pythia: A solid-state-storage-aware path-processing system 1.5x – 2.5x high bandwidth on PCM compared to Flash
1.5x – 2.0x lower response times on PCM compared to Flash
Competitive against state-of-the-art (RDF-3X)
25
Path Processing using Solid State Storage
August 2012 © 2012 IBM Corporation
Thank you!
Pythia (Greek: Πυθία; IPA pɪθiːɑː), commonly known as the Oracle of Delphi, was the priestess at the Temple of Apollo at Delphi, located on the slopes of Mount Parnassus, delivering prophecies.
26