cass-mt review: 6-apr-2011 task 3: semantic databases on the xmt

13
CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT PNNL: David Haglin, Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric Goodman, Edward Jimenez, Greg Mackey 1

Upload: dexter

Post on 16-Mar-2016

37 views

Category:

Documents


0 download

DESCRIPTION

CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT. PNNL: David Haglin , Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric Goodman , Edward Jimenez, Greg Mackey. Recap from August Review. We built a simple automatic query translator - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

CASS-MT Review: 6-Apr-2011Task 3: Semantic Databases on the XMT

PNNL: David Haglin, Bob Adolf, Sinan al-Saffar, Cliff JoslynCray: David Mizell

SNL: Eric Goodman, Edward Jimenez, Greg Mackey

1

Page 2: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

Recap from August Review

We built a simple automatic query translatorMuch of the work was done by hand

Lessons learned from experiments:Query optimization must happen early and oftenAn efficient semantic search engine will almost certainly need data-driven and query-driven optimization

Now that BTC is largely passed, we are continuing forward with these goals

Page 3: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

Query Optimization Research Agenda

Prerequisites for query optimization research:Build out an end-to-end query engine

Enables: validation, measurement, profilingBuild a simple research compiler

Enables: rapid prototyping, attribute aggregation

Not to be construed as standing up a productGlue code is not engineered to be robustCompiler is a first pass at using intermediate forms

Page 4: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

A Modular Query Engine

Page 5: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

Progress: Data IngestPortability important

Using MTGL on multiple systems:Cray XMT Threadstorm nodesCray XMT service node (Linux)Cray CX-1

Endian-ness an issue for storing/retrieving binary triplesEnsure First triple has small (<232) “Subject”Upon reading triples, if first integer >= 232, then swap bytes on all integers read in. Do this on all systems.

Swap 100,000,000 uint64_tIdentical code compiled and run on each platform

Cray CX-1 XMT Login XMT 16 Proc3.59s 5.06s 5.07s

Page 6: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

Progress: Graph Representation

Work in progress: use MTGL and sample code from SPEED-MT to build out components:

Build a compressed_sparse_row<BidirectionalS> representation of the RDF graphFocus on an ability to memory map graph data structures for fast reloading (rememd).Adapt Search-Space Recursive Descent code (described in August, 2010 review) to the MTGL-based data structure.Redesign Dictionary Encoding storage on disk:

Use only one file that supports an “Endian-ness hack”Avoid the need to parse strings from char-array to rebuild in-memory data dictionary.

Page 7: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

A Transformation-oriented Query Compiler

Page 8: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

Progress: Parser

Query ParsingSPARQL 1.0 implemented as ANTLR LL* grammarTested using SPARQL Performance Benchmark (SP2Bench) and Data Access Working Group (DAWG) tests

Currently passes 175 of 214 tests (81%)We are not currently working to improve the coverage

SPARQL parsing is not a priorityWe needed enough coverage to get interesting properties

(OPTIONAL, UNION, FILTER, blank nodes, etc.)

Page 9: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

Progress: Intermediate RepresentationQuery language is not amenable to optimization

So we lower into a more comfortable form

GPIR: Graph Pattern Intermediate RepresentationQuery is represented as a graphEntity references are unified (all ?x refer to the same thing)Entities are tagged with language attributes

e.g.- all triples from a UNION statement are tagged with a union ID and a common union group ID

EPIR: Execution Plan Intermediate RepresentationStill very much a work-in-progress

Page 10: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

Progress: Intermediate RepresentationPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT DISTINCT ?predicateWHERE { { ?person rdf:type foaf:Person . ?subject ?predicate ?person } UNION { ?person rdf:type foaf:Person . ?person ?predicate ?object }}

# Query Graph <21148736> in GPIR9:0-5-83:0-1-26:4-5-07:0-1-2%0:variable:T0:label:"?person"1:label:"http://www.w3.org/1999/02/22-rdf-syntax-ns#type"2:label:"http://xmlns.com/foaf/0.1/Person"3:union_group:13:optional:F3:union_id:04:variable:T4:label:"?subject"5:variable:T5:select:T5:label:"?predicate"6:union_group:16:optional:F6:union_id:07:union_group:17:optional:F7:union_id:18:variable:T8:label:"?object"9:union_group:19:optional:F9:union_id:1

Parses to:

Page 11: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

Progress: Transforms

Transforms operate on an IRInput and output are same format, so they can be chained

Example transform: xform_rem_uniRemoves union group attributes which only have one memberThink of this as algebraic simplification on math expressions (A+0 => A), except for SPARQL UNION statements

Page 12: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

Potential Optimizing Transforms

Longer-term, we are looking at several different types of transforms to attempt. Here are some examples:

Impossible query identification: a triple pattern, constraint, or inferred interaction does not exist in the dataDeterministic bind: if a property is known to be unique (e.g.- rdf:type is usually unique), a traversal can avoid nondeterminism while satisfying that constraintSelectivity-based strategy detection: if the pattern does (or can be reduced to) not include complex interactions, a simpler execution strategy can be chosen on-the-fly.

Page 13: CASS-MT Review:  6-Apr-2011 Task 3: Semantic Databases on the XMT

Future Work: specific directionsContinue working with Larry Holder (WSU) to find common ground on frequent subgraph mining and semantic database queryWork with Bill Howe on query language and hybrid search strategiesExpand our collaboration with Task 1.Support Task 16 (Mayo)Engage with Bioinformatics domain to find/build interestingly large and complex Bio dataset (i.e., more complex than uniprot)Find collections of complex queriesContinue work on search engine comparison:

Array-basedSubgraph-isomorphism (MTGL)Sprinkle-SPARQLQuery-optimization infused with pattern matching

Extend study of larger path types (n=4,5) and/or non-linear motifs