xfilter and distributed data storage zachary g. ives university of pennsylvania cis 455 / 555 –...
TRANSCRIPT
XFilter and Distributed Data Storage
Zachary G. IvesUniversity of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
April 20, 2023
Some portions derived from slides by Raghu Ramakrishnan
Readings & Reminders
Homework 2 Milestone 1 due 3/1
Please read for Wednesday: Stoica et al. “Chord” Write a 1 paragraph summary of the key ideas and
post to the discussion list
Next week: Monday will be an abbreviated (1 hour) lecture Wednesday: guest lecture by Marie Jacob on Q –
search across databases
2
3
Recall: XFilter [Altinel & Franklin 00]
4
How Does It Work?
Each XPath segment is basically a subset of regular expressions over element tags Convert into finite state automata
Parse data as it comes in – use SAX API Match against finite state machines
Most of these systems use modified FSMs because they want to match many patterns at the same time
5
Path Nodes and FSMs
XPath parser decomposes XPath expressions into a set of path nodes
These nodes act as the states of corresponding FSM A node in the Candidate List denotes the current state The rest of the states are in corresponding Wait Lists
Simple FSM for /politics[@topic=“president”]/usa//body:
politics usa body
Q1_1 Q1_2 Q1_3
6
Decomposing Into Path Nodes
Query IDPosition in state machineRelative Position (RP) in tree:
0 for root node if it’s not preceded by “//”
-1 for any node preceded by “//”
Else =1+ (no of “*” nodes from predecessor node)
Level:If current node has fixed
distance from root, then 1+ distance
Else if RP = –1, then –1, else 0Finaly, NextPathNodeSet points to
next node
Q1=/politics[@topic=“president”]/usa//body
Q1 Q1 Q1
1 2 3
0 1 -1
1 2 -1Q1-1 Q1-2 Q1-3
Q2 Q2 Q2
1 2 3
-1 2 1-1 0 0
Q2-1 Q2-2 Q2-3
Q2=//usa/*/body/p
Thinking of XPath Matching as Threads
7
What Is a Thread?
It includes a promise of CPU scheduling, plus context
Suppose we do the scheduling based on events
… Then the “thread” becomes a context Active state What’s to be matched next Whether it’s a final state
8
9
Query Index Query index entry
for each XML tag Two lists:
Candidate List (CL) and Wait List (WL) divided across the nodes
“Live” queries’ states are in CL; “pending” queries + states are in WL
Events that cause state transition are generated by the XML parser
politics
usa
body
p
Q1-1
Q2-1
Q1-3 Q2-2
Q2-3
X
X
X
X
X
X
X
X CLWL
Q1-2
10
Encountering an Element
Look up the element name in the Query Index and all nodes in the associated CL
Validate that we actually have a match
Q1
1
0
1Q1-1politics
Q1-1X
X
WL
startElement: politics
CL
Query IDPositionRel.
PositionLevelEntry in Query Index:
NextPathNodeSet
11
Validating a Match
We first check that the current XML depth matches the level in the user query: If level in CL node is less than 1, then ignore
height else level in CL node must = height
This ensures we’re matching at the right point in the tree!
Finally, we validate any predicates against attributes (e.g., [@topic=“president”])
12
Processing Further Elements
Queries that don’t meet validation are removed from the Candidate Lists
For other queries, we advance to the next state We copy the next node of the query from the
WL to the CL, and update the RP and level When we reach a final state (e.g., Q1-3), we
can output the document to the subscriber
When we encounter an end element, we must remove that element from the CL
A Simpler Approach
Instantiate a DOM tree for each document Traverse and recursively match XPaths
Pros and cons?
13
14
Publish-Subscribe Model Summarized
XFilter has an elegant model for matching XPaths A good deal more complex than HW2, in that it
supports wildcards (*) and //
Useful for applications with RSS (Rich Site Summary or Really Simple Syndication)
Many news sites, web logs, mailing lists, etc. use RSS to publish daily articles
An instance of a more general concept, a topic-specific crawler
Revisiting Storage and Crawlingwith a Distributed Spin
In recent weeks: Index structures primarily intended for single
machines B+ Trees
Data / document formats Basics of single-machine crawling
Seed URLs, robots.txt, etc.
Now: let’s revisit (most of) the above in a setting where there are multiple machines working together! First: storage
15
16
How Do We Distribute a B+ Tree?
We need to host the root at one machine and distribute the rest
What are the implications for scalability? Consider building the
index as well as searching
17
Eliminating the Root
Sometimes we don’t want a tree-structured system because the higher levels can be a central point of congestion or failure
Two strategies: Modified tree structure (e.g., BATON, Jagadish
et al.) Non-hierarchical structure
18
A “Flatter” Scheme: Hashing
Start with a hash function with a uniform distribution of values: h(name) a value (e.g., 32-
bit integer)
Map from values to hash buckets Generally using mod (#
buckets)
Put items into the buckets May have “collisions” and
need to chain
0
1
2
3
0
4812
…
buckets
{h(x) values
overflow chain
19
Dividing Hash Tables Across Machines
Simple distribution – allocate some number of hash buckets to various machines Can give this information to every client, or provide
a central directory Can evenly or unevenly distribute buckets Lookup is very straightforward
A possible issue – data skew: some ranges of values occur frequently Can use dynamic hashing techniques Can use better hash function, e.g., SHA-1 (160-bit
key)
20
Some Issues Not Solved withConventional Hashing
What if the set of servers holding the inverted index is dynamic? Our number of buckets changes How much work is required to reorganize the
hash table?
Solution: consistent hashing
21
Consistent Hashing – the Basis of “Structured P2P”
Intuition: we want to build a distributed hash table where the number of buckets stays constant, even if the number of machines changes Requires a mapping from hash entries to nodes Don’t need to re-hash everything if node joins/leaves Only the mapping (and allocation of buckets) needs to
change when the number of nodes changes
Many examples: CAN, Pastry, Chord For this course, you’ll use Pastry But Chord is simpler to understand, so we’ll look at it
22
Basic Ideas
We’re going to use a giant hash key space SHA-1 hash: 20B, or 160 bits We’ll arrange it into a “circular ring” (it wraps
around at 2160 to become 0)
We’ll actually map both objects’ keys (in our case, keywords) and nodes’ IP addresses into the same hash key space “abacus” SHA-1 k10 130.140.59.2 SHA-1 N12
23
Chord Hashes a Key to its Successor
N32
N10
N100
N80
N60
Circularhash
ID Space
Nodes and blocks have randomly distributed IDs Successor: node with next highest ID
k52
k30
k10
k70
k99
Node ID k112
k120
k11
k33k40
k65
Key Hash
24
Basic Lookup: Linear Time
N32
N10
N5
N20
N110
N99
N80
N60
N40
“Where is k70?”
“N80”
Lookups find the ID’s predecessor Correct if successors are correct
25
“Finger Table” Allows O(log N) Lookups
N80
½¼
1/8
1/161/321/641/128
Goal: shortcut across the ring – binary search Reasonable lookup latency
26
Node Joins
How does the node know where to go?(Suppose it knows 1
peer)
What would need to happen to maintain connectivity?
What data needs to be shipped around?
N32
N10
N5
N20
N110
N99
N80
N60
N40
N120
27
A Graceful Exit: Node Leaves
What would need to happen to maintain connectivity?
What data needs to be shipped around?
N32
N10
N5
N20
N110
N99
N80
N60
N40
28
What about Node Failure?
Suppose a node just dies?
What techniques have we seen that might help?
29
Successor Lists Ensure Connectivity
N32
N10
N5
N20
N110
N99
N80
N60
Each node stores r successors, r = 2 log N Lookup can skip over dead nodes to find objects
N40
N10, N20, N32
N20, N32, N40
N32, N40, N60
N40, N60, N80
N60, N80, N99
N80, N99, N110
N99, N110, N5
N110, N5, N10
N5, N10, B20
30
Objects are Replicated as Well
When a “dead” peer is detected, repair the successor lists of those that pointed to it
Can take the same scheme and replicate objects on each peer in the successor list Do we need to change lookup protocol to find
objects if a peer dies? Would there be a good reason to change lookup
protocol in the presence of replication?
What model of consistency is supported here? Why?
31
Stepping Back for a Moment:DHTs vs. Gnutella and Napster 1.0
Napster 1.0: central directory; data on peers Gnutella: no directory; flood peers with requests Chord, CAN, Pastry: no directory; hashing scheme
to look for data
Clearly, Chord, CAN, and Pastry have guarantees about finding items, and they are decentralized
But non-research P2P systems haven’t adopted this paradigm: Kazaa, BitTorrent, … still use variations of the Gnutella
approach Why? There must be some drawbacks to DHTs..?
32
Distributed Hash Tables, Summarized
Provide a way of deterministically finding an entity in a distributed system, without a directory, and without worrying about failure
Can also be a way of dividing up work: instead of sending data to a node, might send a task Note that it’s up to the individual nodes to do
things like store data on disk (if necessary; e.g., using B+ Trees)
33
Applications of Distributed Hash Tables
To build distributed file systems (CFS, PAST, …)
To distribute “latent semantic indexing” (U. Rochester)
As the basis of distributed data integration (U. Penn, U. Toronto, EPFL) and databases (UC Berkeley)
To archive library content (Stanford)
34
Distributed Hash Tables andYour Project
If you’re building a mini-Google, how might DHTs be useful in: Crawling + indexing URIs by keyword? Storing and retrieving query results?
The hard parts: Coordinating different crawlers to avoid redundancy Ranking different sites (often more difficult to
distribute) What if a search contains 2+ keywords?
(You’ll initially get to test out DHTs in Homework 3)
35
From Chord to Pastry
What we saw was the basic data algorithms for the Chord system
Pastry is a slightly different: It uses a different mapping mechanism than
the ring (but one that works similarly) It doesn’t exactly use a hash table abstraction
– instead there’s a notion of routing messages It allows for replication of data and finds the
closest replica It’s written in Java, not C … And you’ll be using it in your projects!
36
Pastry API Basics (v 1.4.3_02)
See freepastry.org for details and downloads Nodes have identifiers that will be hashed:
interface rice.p2p.commonapi.Id 2 main kinds of NodeIdFactories – we’ll use socket-based
Nodes are logical entities: can have more than one virtual node Several kinds of NodeFactories: create virtual Pastry
nodes
All Pastry nodes have built in functionality to manage routing
Derive from “common API” class rice.p2p.commonapi.Application
37
Creating a P2P Network
Example code in DistTutorial.java Create a Pastry node:
Environment env = new Environment();PastryNodeFactory d = new SocketPastryNodeFactory(new
NodeFactory(keySize), env);
// Need to compute InetSocketAddress of a host to be addrNodeHandle aKnownNode =
((SocketPastryNodeFactory)d).getNodeHandle(addr);PastryNode pn = d.newNode(aKnownNode);MyApp = new MyApp(pn); // Base class of your
application! No need to call a simulator – this is real!
38
Pastry Client APIs
Based on a model of routing messages Derive your message from class rice.p2p.commonapi.Message
Every node has an Id (NodeId implementation) Every message gets an Id corresponding to its key Call endpoint.route(id, msg, hint) (aka routeMsg) to
send a message (endpoint is an instance of Endpoint) The hint is the starting point, of type NodeHandle
At each intermediate point, Pastry calls a notification: forward(id, msg, nextHop)
At the end, Pastry calls a final notification: deliver(id, msg) aka messageForAppl
39
IDs
Pastry has mechanisms for creating node IDs itself
Obviously, we need to be able to create IDs for keys
Need to use java.security.MessageDigest:MessageDigest md = MessageDigest.getInstance("SHA"); byte[] content = myString.getBytes();md.update(content);byte shaDigest[] = md.digest();
rice.pastry.Id keyId = new rice.pastry.Id(shaDigest);
40
How Do We Create a Hash Table (Hash Map/Multiset) Abstraction?
We want the following: put (key, value) remove (key) valueSet = get (key)
How can we use Pastry to do this?
41
Next Time
Distributed filesystem and database storage: GFS, PNUTS