vldb 2005 an efficient sql-based rdf querying scheme eugene inseok chong souripriya das george eadon...

36

Upload: peter-melton

Post on 24-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center
Page 2: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

An Efficient SQL-based RDF Querying Scheme

Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan

New England Development CenterOracle

Page 3: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Talk Outline

• Introduction• Functionality• Design and Implementation • Performance• Conclusions and Future Work

Page 4: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Introduction

Page 5: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF (Resource Description Framework)• RDF is a W3C Standard for describing resources on

the web

• Uniform Resource Identifiers (URIs) are used to identify resources

• Example: http://www.oracle.com/people#John

• RDF triples are used to make statements about a resource

• Format: (subject predicate object)• Example: (:John :brotherOf :Mary)• Represents a directed, labeled edge in an RDF graph:

:John :Mary:brotherOf

Page 6: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF Data and Graph ExampleFamily Data: (:John :brotherOf :Mary)

(:Mary :parentOf :Matt)

(:John :name “John”)

(:Mary :name “Mary”)

(:Matt :name “Matt”)

:John

:Mary

:brotherOf

:Matt

:parentOf

:name John

Mary

:name

Matt

:name

Page 7: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF Querying Problem

• Given• RDF graphs: the data set to be searched• Graph Pattern: containing a set of variables

• Find• Matching Subgraphs

• Return • Sets of variable bindings: where each set

corresponds to a Matching Subgraph

Page 8: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF Query ExampleFamily Data: (:John :brotherOf :Mary)

(:Mary :parentOf :Matt) (:John :name “John”) (:Mary :name “Mary”) (:Matt :name “Matt”)Graph Pattern: (names of Mary’s brothers)

(?x :brotherOf ?y) (?y :name “Mary”)

(?x :name ?n) Variable Bindings: x = :John, y = :Mary, n = “John”Matching Subgraph: (:John :brotherOf :Mary)

(:Mary :name “Mary”) (:John :name “John”)

:John

:Mary

:brotherOf

:Matt

:parentOf

:name John

Mary

:name

Matt

:name

Page 9: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF Storage Issues• Need to store RDF <subject, predicate,

object> triples where the individual components can be URIs, blank nodes, or literals

• Namespaces used in URIs could be long• Multiple triples describe a resource resulting in

repetition of (possibly long) URIs • Different representations possible for a literal

occurring in multiple triples• e.g. 120 120.0 12.0e+1 1.20e+2

• RDF graph may include schema triples• e.g. (:brotherOf rdfs:domain :Male)

Page 10: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF Querying Issues in SQL• Support specification of graph pattern-based SQL

query• Occurrence of same variables in multiple triples of

graph pattern: Processing requires self-join• e.g. (?x :brotherOf ?y)

(?y :name “Mary”) (?x :name ?n)

• Query processing (e.g for filter conditions, ORDER BY) requires datatype-specific comparison semantics

Schema Triple: (:age rdfs:range xsd:int)

Graph Pattern: (?x :age ?a)Filter Condition: a > 60ORDER BY: a DESCENDING

Page 11: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF Querying Issues: Inference• Query processing may involve Inferencing.• Example:

Data: (:Jim :brotherOf :John) (:John :fatherOf :Mary)

Graph Pattern:(?x :uncleOf ?y)

Result: Empty

Rule:(?x :brotherOf ?y) (?y :fatherOf ?z) (?x :uncleOf ?z)

Inferred data: (:Jim :uncleOf :Mary)

Result: x = :Jim, y = :Mary

Page 12: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF Querying Approach• General Approach

• Create a new (declarative, SQL-like) query language • e.g.: RQL, SeRQL, TRIPLE, N3, Versa, SPARQL, RDQL,

RDFQL, SquishQL, RSQL, etc.

• SQL-based Approach• Introduces a SQL Table Function RDF_MATCH that uses

SPARQL-like graph pattern to express RDF queries

• Benefits of SQL-based Approach• Leverages all the powerful constructs in SQL (e.g.,

SELECT / FROM / WHERE, ORDER BY, GROUP BY, aggregates, Join) to process graph query results

• RDF queries can easily be combined with conventional queries on database tables thereby avoiding staging

Page 13: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

• SELECT …FROM …, TABLE (

) t, …WHERE …;

• Use of RDF_MATCH Table Function allows embedding a graph query in a SQL query

Embedding RDF Query in SQL

RDF Query (expressed as RDF_MATCH Table Function invocation)

Page 14: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Functionality

Page 15: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF_MATCH Table Function• Input parameters

RDF_MATCH (Pattern, graph patternModels, Data (set of RDF graphs)RuleBases, Rules (0 or more rulebases)Aliases list of prefixes for namespaces)

• Returns a set of columns containing variable bindings• Variable matching URI returned as single VARCHAR2

column with the same name (e.g. x for ?x)• Variable matching literal returned as a pair of VARCHAR2

columns with a name (e.g. x for ?x) and the type (x$type for ?x)

Page 16: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF_MATCH Example• Example: student reviewers less than 25 years old

SELECT t.r reviewer, t.c conf, t.a ageFROM TABLE ( RDF_MATCH (

‘(?r rdf:type :Student) (?r :reviewerOf ?c)

(?r :age ?a)’, RDFModels(‘reviewers’), NULL,

RDFAliases(…))) tWHERE t.a < 25;

Page 17: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Specifying Rules• RDFS rulebase: Pre-Loaded• Can add User-defined rules

• Rule: “Chairperson of Conference is also a reviewer”

(‘rb’, rulebase name

‘ChairpersonRule’, rule name ‘(?r :ChairpersonOf ?c)’ antecedents

NULL, filter conditionNULL, aliases‘(?r:ReviewerOf ?c)’) consequents

Page 18: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF_MATCH Example with rulebase• Query: Find reviewers of conferences• SELECT t.r reviewer

FROM TABLE( RDF_MATCH(

‘(?r :ReviewerOf ?c)’, RDFModels (‘reviewers’),

RDFRules (‘rb’), NULL)) t;

• Data (:Mary :ChairpersonOf :IDBC2005)• Inferred data (:Mary :ReviewerOf :IDBC2005)

Page 19: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Design & Implementation

Page 20: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF Data Storage• Triples Data stored after normalization in two

tables• UriMap(UriID, UriValue,…) contains mapping of

(URIs, blank nodes, literals) to internal identifiers• IdTriples (ModelID, SubjectID, PropertyID,

ObjectID,…) contains the triple information encoded as three identifiers

• Multiple representation of literals: The first occurrence treated as canonical, rest mapped to canonical representation

• e.g. 120.0 120 1.20e+2 12.0e+1

Page 21: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

RDF_MATCH Query Processing• Subsititute aliases with namespaces in search pattern• Convert URIs and literals to internal IDs• Generate Query

• Generate self-join query based on matching variables• Generate SQL subqueries for rulebases component

(if any)• Generate the join result by joining internal IDs with

UriMap table • Use model IDs to restrict IdTriples table

• Compile and Execute the generated query

Page 22: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Optimization: Table Function Rewrite

• TableRewriteSQL( )• Takes RDF Query (specified via arguments) as input • generates a SQL string

• Substitute the table function call with the generated SQL string

• Reparse and execute the resulting query• Advantages

• Avoid execution-time overhead (linear in number of result rows) associated with table function infrastructure

• Leverage SQL optimizer capabilities to optimize the resulting query (including filter condition pushdown)

Page 23: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Optimization: Materialized Join Views• Generic Materialized Join views (MJVs)

• Subject-Subject, Object-Subject, …

• Subject-property matrix MJVs (SPMJVs)• custom, workload based (e.g., frequent search patterns)Example: Select student name, university, and age• Select r, u, a ……

‘(?r rdf:type :Student) (?r :enrolledAt ?u) (?r :age ?a)’……

• SPMJV: < Student enrolledAt age >

Page 24: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Performance

Page 25: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Dataset

• WordNet : lexical database for English language

• UniProt : large scale (80 million triples)• Protein and annotation data

Page 26: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Experiments

• Varying number of triples in search pattern• Varying filter conditions• Varying projection list• Large-scale RDF data• Subject-property MJVs

Page 27: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Varying Number of Triples

• ‘(?a wn:hyponymOf ?b) (?b wn:hyponymOf ?c) …..

• Increasing number of self-joins

Page 28: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Varying Number of Triples

00.20.40.60.8

11.21.41.6

0 2 4 6 8

# of Triples in the Pattern

Tim

e (s

eco

nd

s)

Without MJV With MJV

Page 29: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Varying Projection List

• ‘(?c0 wn:wordForm ?word) (?c0 wn:wordForm ?syn1) (?c1 wn:wordForm ?syn1) …. (5 triples)

• Benefit of the projection list optimization• Eliminate joins with UriMap table for variables not

referenced outside of RDF_MATCH

Page 30: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Varying Projection List

0

0.1

0.2

0.3

0.4

0 1 2 3 4 5Projection List Size

Tim

e (s

eco

nd

s)

Page 31: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Large-Scale RDF Data

• UniProt – 10M, 20M, 40M, 80M triples• 6 example queries given with UniProt• Number of matches remain constant as

dataset size changes (ROWNUM)

Page 32: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

UniProt Sample QueriesDescription Pattern Projection Result limit

Q1: Display the ranges of transmembrane regions

6 triples5 vars

3 vars 15000 rows

Q2: List proteins with publications by authors with matching names

5 triples5 vars1 LIKE pred.

3 vars 10 rows

Q3: Count the number of times a publication by a specific author is cited

3 triples2 vars

0 vars 32 rows

Q4: List resources that are related to proteins annotated with a specific keyword

3 triples2 vars

1 var 3000 rows

Q5: List genes associated with human diseases

7 triples5 vars

3 vars 750 rows

Q6: List recently modified entries

2 triples2 vars1 range pred.

2 vars 8000 rows

Page 33: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Query Response TimesRDF_MATCH Performance Scalability

  Q1 Q2 Q3 Q4 Q5 Q6

10 M Triples0.86 < 0.01 < 0.01 0.03 0.18 0.46

20 M Triples 0.95 < 0.01 < 0.01 0.03 0.19 0.47

40 M Triples 0.96 < 0.01 < 0.01 0.03 0.18 0.47

80 M Triples 1.03 < 0.01 < 0.01 0.03 0.20 0.49

Maximum .054 0.002 0.002 .011 .065 0.07

Page 34: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Conclusions

Page 35: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center

VLDB 2005

Conclusions and Future Work• SQL-based RDF querying scheme

• RDF_MATCH table function• Supports graph-pattern based query on RDF data with

RDFS and user-defined rules• Efficient Execution

• Table Function Rewrite• Materialized Join Views: Generic and Subject-Property• Rule Indexes

• Future work • OPTIONAL support – outer-join• Provenance support

Page 36: VLDB 2005 An Efficient SQL-based RDF Querying Scheme Eugene Inseok Chong Souripriya Das George Eadon Jagannathan Srinivasan New England Development Center