an algebra for basic graph patternsconferenze.dei.polimi.it/lid2008/lid08fletcher.pdf · an algebra...
Post on 29-Sep-2020
1 Views
Preview:
TRANSCRIPT
An Algebra forBasic Graph Patterns
George FletcherWashington State University, Vancouver
LID 2008, Rome19-20 May 2008
(web) data:
‣ social
‣ semantic
‣ structured, semi-structured, unstructured
semantic web, data spaces, web 3.0, ...
data model?
requirements
1. “things” and their relationships
2. don’t force artificial data/metadata distinctions
3. support both structure and no structure
Resource Description Framework
• W3C standard
• data model for social semantic web
Resource Description Framework
• W3C standard
• data model for social semantic web
• triples
‣ dataspace = finite ternary relation over “things”
‣ things = URI’s, literals
• fulfills 1, 2, and 3
An Algebra For Basic Graph Patterns
George H.L. Fletcher
School of Engineering and Computer ScienceWashington State University, Vancouver
fletcher@vancouver.wsu.edu
Abstract. Motivated by recent developments in the dataspaces, web,and personal information management communities, we outline researchdirections on query processing for SPARQL, the W3C recommendationlanguage for querying RDF triple stores. The core of each SPARQL queryis a basic graph pattern (BGP). BGP is a little logic for extracting sub-sets of related nodes in an RDF graph. In this paper we undertake aformal study of BGP with an eye towards e!cient SPARQL query evalu-ation. Our main contributions are (1) an algebraization of BGP, and (2)first steps towards a framework for the design of structural indexes toaccelerate processing of queries in this algebra.
1 Introduction
The flexibility and fundamental character of ternary relations for data modelingwas recognized very early in the development of modern logic [34]. Althoughthere were some initial e!orts towards realizing this idea in information systems(e.g., [31]), “triples” have only recently gained traction with the developmentof the W3C Resource Description Framework (RDF) [24, 29]. As a data modelfor capturing “things” and their “relationships,” treating each as co-equal firstclass objects, RDF successfully blurs the somewhat arbitrary distinction one istypically forced to make between “data” and “metadata.”1
Example 1 Consider the following set of atoms, denoting a small subset of theobjects in a particular individual’s dataspace of information:
{Yamada, McShea, Herzog, doc1, doc2, doc3, knows, performed, authored,type, social action, is a kind of, past action, created on, 14.5.08, PDF, MP3}
A sample of some of the (triple) relationships holding between these objects isgiven in Figure 1(a). For example, the first triple asserts that Yamada authoredDocument 1; the second triple asserts that Yamada knows McShea; and the thirdtriple asserts that ‘knows’ is a kind of social action. This “graph” of relationshipscan be visualized as in Figure 1(b).
1 Note that RDF has a!nities to data models previously proposed in the context ofgraph databases (e.g., [5, 18]) and data integration systems (e.g., [4, 30, 47]).
RDF
˘!Yamada, authored, doc1",!Yamada, knows, McShea",!knows, is a kind of, social action",!Herzog, authored, doc2",!Herzog, authored, doc3",!McShea, performed, doc3",!McShea, past action, authored",!doc1, type, PDF",!doc2, type, MP3",!doc3, type, MP3",!doc3, created on, 14.5.08"
¯
(a) A triple graph
doc3doc2
MP3
Yamada Herzog McShea
socialaction
14.5.08
doc1
type
knows
authored performed
created on
past action
is a kind of
(b) A visualization of this graph
Fig. 1. A small subset of a personal dataspace graph.
The flexibility of the RDF data model has made it a natural choice for tech-nologies being developed in the social and semantic web research communities[2, 40]. In addition to the central role RDF plays in these areas, recent e!ortstowards the development of dataspaces [13, 20] of personal information [26] havealso highlighted the natural fit of triples in modeling modern data scenarios. Asthe use of RDF grows, the need for RDF data management becomes increasinglycrucial.
There has been an explosion of proposals for RDF query languages [16]. TheW3C recommendation query language SPARQL [36], however, is emerging asthe most mature and accepted. SPARQL is a declarative language with a syntaxvery much like SQL.
Example 2 Consider the query “Retrieve authors and the types of their docu-ments.” In SPARQL, where variables are identified by a preceding ?, this querycan be posed against the graph in Figure 1(a) as follows:
SELECT ?a ?tWHERE { ?a authored ?d . ?d type ?t . }
The WHERE clause of a SPARQL query specifies a basic graph pattern (BGP).Such patterns, which are at the heart of all SPARQL queries, identify a subsetof related atoms to be extracted from the RDF graph, which is then returned asa set of variable mappings.
Example 3 The BGP of Example 2 is (?a authored ?d) ; (?d type ?t). Thesemantics of this pattern with respect to the graph of Figure 1(a) is a set ofvariable mappings, each of which makes the pattern hold in the graph, i.e.,
1 2 3Yamada doc1 PDFHerzog doc2 MP3Herzog doc3 MP3
RDF
doc3doc2
MP3
Yamada Herzog McShea
socialaction
14.5.08
doc1
type
knows
authored performed
created on
past action
is a kind of
˘!Yamada, authored, doc1",!Yamada, knows, McShea",!knows, is a kind of, social action",!Herzog, authored, doc2",!Herzog, authored, doc3",!McShea, performed, doc3",!McShea, past action, authored",!doc1, type, PDF",!doc2, type, MP3",!doc3, type, MP3",!doc3, created on, 14.5.08"
¯
(a) A triple graph
doc3doc2
MP3
Yamada Herzog McShea
socialaction
14.5.08
doc1
type
knows
authored performed
created on
past action
is a kind of
(b) A visualization of this graph
Fig. 1. A small subset of a personal dataspace graph.
The flexibility of the RDF data model has made it a natural choice for tech-nologies being developed in the social and semantic web research communities[2, 40]. In addition to the central role RDF plays in these areas, recent e!ortstowards the development of dataspaces [13, 20] of personal information [26] havealso highlighted the natural fit of triples in modeling modern data scenarios. Asthe use of RDF grows, the need for RDF data management becomes increasinglycrucial.
There has been an explosion of proposals for RDF query languages [16]. TheW3C recommendation query language SPARQL [36], however, is emerging asthe most mature and accepted. SPARQL is a declarative language with a syntaxvery much like SQL.
Example 2 Consider the query “Retrieve authors and the types of their docu-ments.” In SPARQL, where variables are identified by a preceding ?, this querycan be posed against the graph in Figure 1(a) as follows:
SELECT ?a ?tWHERE { ?a authored ?d . ?d type ?t . }
The WHERE clause of a SPARQL query specifies a basic graph pattern (BGP).Such patterns, which are at the heart of all SPARQL queries, identify a subsetof related atoms to be extracted from the RDF graph, which is then returned asa set of variable mappings.
Example 3 The BGP of Example 2 is (?a authored ?d) ; (?d type ?t). Thesemantics of this pattern with respect to the graph of Figure 1(a) is a set ofvariable mappings, each of which makes the pattern hold in the graph, i.e.,
1 2 3Yamada doc1 PDFHerzog doc2 MP3Herzog doc3 MP3
RDF
doc3doc2
MP3
Yamada Herzog McShea
socialaction
14.5.08
doc1
type
knows
authored performed
created on
past action
is a kind of
˘!Yamada, authored, doc1",!Yamada, knows, McShea",!knows, is a kind of, social action",!Herzog, authored, doc2",!Herzog, authored, doc3",!McShea, performed, doc3",!McShea, past action, authored",!doc1, type, PDF",!doc2, type, MP3",!doc3, type, MP3",!doc3, created on, 14.5.08"
¯
(a) A triple graph
doc3doc2
MP3
Yamada Herzog McShea
socialaction
14.5.08
doc1
type
knows
authored performed
created on
past action
is a kind of
(b) A visualization of this graph
Fig. 1. A small subset of a personal dataspace graph.
The flexibility of the RDF data model has made it a natural choice for tech-nologies being developed in the social and semantic web research communities[2, 40]. In addition to the central role RDF plays in these areas, recent e!ortstowards the development of dataspaces [13, 20] of personal information [26] havealso highlighted the natural fit of triples in modeling modern data scenarios. Asthe use of RDF grows, the need for RDF data management becomes increasinglycrucial.
There has been an explosion of proposals for RDF query languages [16]. TheW3C recommendation query language SPARQL [36], however, is emerging asthe most mature and accepted. SPARQL is a declarative language with a syntaxvery much like SQL.
Example 2 Consider the query “Retrieve authors and the types of their docu-ments.” In SPARQL, where variables are identified by a preceding ?, this querycan be posed against the graph in Figure 1(a) as follows:
SELECT ?a ?tWHERE { ?a authored ?d . ?d type ?t . }
The WHERE clause of a SPARQL query specifies a basic graph pattern (BGP).Such patterns, which are at the heart of all SPARQL queries, identify a subsetof related atoms to be extracted from the RDF graph, which is then returned asa set of variable mappings.
Example 3 The BGP of Example 2 is (?a authored ?d) ; (?d type ?t). Thesemantics of this pattern with respect to the graph of Figure 1(a) is a set ofvariable mappings, each of which makes the pattern hold in the graph, i.e.,
1 2 3Yamada doc1 PDFHerzog doc2 MP3Herzog doc3 MP3
RDF
subjects predicates objects
doc3doc2
MP3
Yamada Herzog McShea
socialaction
14.5.08
doc1
type
knows
authored performed
created on
past action
is a kind of
˘!Yamada, authored, doc1",!Yamada, knows, McShea",!knows, is a kind of, social action",!Herzog, authored, doc2",!Herzog, authored, doc3",!McShea, performed, doc3",!McShea, past action, authored",!doc1, type, PDF",!doc2, type, MP3",!doc3, type, MP3",!doc3, created on, 14.5.08"
¯
(a) A triple graph
doc3doc2
MP3
Yamada Herzog McShea
socialaction
14.5.08
doc1
type
knows
authored performed
created on
past action
is a kind of
(b) A visualization of this graph
Fig. 1. A small subset of a personal dataspace graph.
The flexibility of the RDF data model has made it a natural choice for tech-nologies being developed in the social and semantic web research communities[2, 40]. In addition to the central role RDF plays in these areas, recent e!ortstowards the development of dataspaces [13, 20] of personal information [26] havealso highlighted the natural fit of triples in modeling modern data scenarios. Asthe use of RDF grows, the need for RDF data management becomes increasinglycrucial.
There has been an explosion of proposals for RDF query languages [16]. TheW3C recommendation query language SPARQL [36], however, is emerging asthe most mature and accepted. SPARQL is a declarative language with a syntaxvery much like SQL.
Example 2 Consider the query “Retrieve authors and the types of their docu-ments.” In SPARQL, where variables are identified by a preceding ?, this querycan be posed against the graph in Figure 1(a) as follows:
SELECT ?a ?tWHERE { ?a authored ?d . ?d type ?t . }
The WHERE clause of a SPARQL query specifies a basic graph pattern (BGP).Such patterns, which are at the heart of all SPARQL queries, identify a subsetof related atoms to be extracted from the RDF graph, which is then returned asa set of variable mappings.
Example 3 The BGP of Example 2 is (?a authored ?d) ; (?d type ?t). Thesemantics of this pattern with respect to the graph of Figure 1(a) is a set ofvariable mappings, each of which makes the pattern hold in the graph, i.e.,
1 2 3Yamada doc1 PDFHerzog doc2 MP3Herzog doc3 MP3
RDF
RDF/XML, N3, NTriples, ...
... an old idea
Charles S. Peirce (1839-1914)
SPARQL
SPARQL
• W3C standard RDF query language
SPARQL
• W3C standard RDF query language
• SQL-like syntax
SELECT ?a ?tWHERE{
?a authored ?d .?d type ?t
}
SPARQL
• W3C standard RDF query language
• SQL-like syntax
SELECT ?a ?tWHERE{
?a authored ?d .?d type ?t
}
• query processing?
SPARQL
• core of expression is a basic graph pattern
SELECT ?a ?tWHERE{
?a authored ?d .?d type ?t
}
Basic graph patterns
(v1, v2, v3); (v4, v5, v6); v2 = authored; v3 = v4; v5 = type
SELECT ?a ?tWHERE{
?a authored ?d .?d type ?t
}
Basic graph patterns
(v1, v2, v3); (v4, v5, v6); v2 = authored; v3 = v4; v5 = type
doc3doc2
MP3
Yamada Herzog McShea
socialaction
14.5.08
doc1
type
knows
authored performed
created on
past action
is a kind of
Basic graph patterns
(v1, v2, v3); (v4, v5, v6); v2 = authored; v3 = v4; v5 = type
doc3doc2
MP3
Yamada Herzog McShea
socialaction
14.5.08
doc1
type
knows
authored performed
created on
past action
is a kind of
1 2 3 4 5 6Yamada authored doc1 doc1 type PDFHerzog authored doc2 doc2 type MP3Herzog authored doc2 doc2 type MP3
Basic graph patterns
1 6Yamada PDFHerzog MP3
(v1, v2, v3); (v4, v5, v6); v2 = authored; v3 = v4; v5 = type
doc3doc2
MP3
Yamada Herzog McShea
socialaction
14.5.08
doc1
type
knows
authored performed
created on
past action
is a kind of
Basic graph patterns
(v1, v2, v3); · · · ; (v3n!2, v3n!1, v3n); c1; · · · ; cm
Basic triple algebra
E ::= ! | E · E | !c(E)
Basic triple algebra
Given RDF graph G
E ::= ! | E · E | !c(E)
[[!]]G = G
[[E1 · E2]]G = [[E1]]G " [[E2]]G[[!c(E)]]G = {m # [[E]]G | m satisfies c}
(v1, v2, v3); (v4, v5, v6); v2 = authored; v3 = v4; v5 = type
doc3doc2
MP3
Yamada Herzog McShea
socialaction
14.5.08
doc1
type
knows
authored performed
created on
past action
is a kind of
Basic triple algebra
doc3doc2
MP3
Yamada Herzog McShea
socialaction
14.5.08
doc1
type
knows
authored performed
created on
past action
is a kind of
1 2 3 4 5 6Yamada authored doc1 doc1 type PDFHerzog authored doc2 doc2 type MP3Herzog authored doc2 doc2 type MP3
Basic triple algebra!3=4(!2=authored(!) · !2=type(!))
FactFor every BGP there is a semantically equivalent BTA expression, and vice versa.
Basic triple algebra
Basic triple algebra
• little conjunctive algebra
• can leverage ~40 years of research on relational algebra evaluation
‣ e.g., Stocker et al. WWW 2008
• our interest: design space of index data structures for acceleration of SPARQL query processing
Indexing RDF
• state of the art: value-based indexing
Indexing RDF
• state of the art: value-based indexing
• missing: structure-based indexing
‣ e.g., dataguide and A(k) indexes for XML
‣ e.g., join indexes for relational and OO data
Indexing RDF
• methodology for coupling query languages and structural indexes (PODS 2006 & DBPL 2007)
• language indistinguishability
• build data structure over the partition induced by language indistinguishability of objects
Indexing RDF
Let G be a graph, s, s! ! subjects(G), and L be a subset ofBTA.
We say s and s! are L-equivalent, denoted s "L s!, if forany E ! L and n ! N, it is the case that there exists amapping m ! [[E]]G with m[3n + 1] = s if and only if thereexists a mapping m! ! [[E]]G with m![3n + 1] = s!.
The partition induced by "L on subjects(G) is called theL-partition of subjects(G).
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
L0 = BTA
• partition is too fine
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
L0 = BTA
• partition is too fine
• !1=s(!) " BTA
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
L0 = BTA
{[Yamada], [Herzog], [McShea],[doc1], [doc2], [doc3]}
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
L1
• disallow conditions involving atoms
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
L1
• disallow conditions involving atoms
•
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
L1
!3=4(! ·! )
{[Yamada, Herzog, McShea],[doc1, doc2, doc3]}
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
L2
• only allow atoms in selections on predicates
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
L2
• only allow atoms in selections on predicates
• partition is too fine
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
L2
• only allow atoms in selections on predicates
• partition is too fine
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
L2
{[Yamada], [Herzog], [McShea],[doc1], [doc2], [doc3]}
Lk3 ! L2
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
• only allow atoms in selections on predicates
Lk3 ! L2
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
• only allow atoms in selections on predicates
• only allow join paths of length at most k
Lk3 ! L2
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
• only allow atoms in selections on predicates
• only allow join paths of length at most k
Lk3 ! L2
doc3doc2
MP3
Yamada Herzog McShea
14.5.08
doc1
type
authored performed
created on
{[Yamada, Herzog], [McShea],[doc1], [doc2, doc3]}
Lki ! BTA
• localized fragments of BTA
Lki ! BTA
• localized fragments of BTA
• building blocks of large classes of queries
Lki ! BTA
• design and implementation of index data structures built over partitions induced by localized fragments of BTA
• query processing using these indexes
‣ query decomposition
ongoing work
• BP expressiveness results for SPARQL
• “ontology” discovery for RDF
future work
Thank you!
Questions?
top related