lod2 webinar series: virtuoso 7

Post on 01-Nov-2014

4.201 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

This webinar in the course of the LOD2 webinar series will present Virtuoso 7. Virtuoso Column Store, Adaptive Techniques for RDF Graph Databases. In this webinar we shall discuss the application of column store techniques to both graph (RDF) and relational data for mixed work-loads ranging from lookup to analytics. Virtuoso is an innovative enterprise grade multi-model data server for agile enterprises & individuals. It delivers an unrivaled platform agnostic solution for data management, access, and integration. The unique hybrid server architecture of Virtuoso enables it to offer traditionally distinct server functionality within a single product If you are interested in Linked (Open) Data principles and mechanisms, LOD tools & services and concrete use cases that can be realised using LOD then join us in the free LOD2 webinar series! http://lod2.eu/BlogPost/webinar-series

TRANSCRIPT

LOD2 Webinar . 29.11.2011 . Page 1 http://lod2.eu

Creating Knowledge out of Interlinked Data

http://lod2.eu

LOD2 is a large-scale integrating project co-funded by the European Commission within the FP7 Information and Communication Technologies Work Programme. This 4-year project comprises leading Linked Open Data technology researchers, companies, and service providers. Coming from across 12 countries the partners

are coordinated by the Agile Knowledge Engineering and Semantic Web Research Group at the University of Leipzig, Germany.

LOD2 will integrate and syndicate Linked Data with existing large-scale

applications. The project shows the benefits in the scenarios of Media and Publishing, Corporate Data intranets and eGovernment.

http://lod2.eu

Once  per  month  the  LOD2  webinar  series  offer  a  free  webinar  about  tools  and  services  along  the  Linked  

Open  Data  Life  Cycle.    

Stay  with  us  and  learn  more  about  acquisiAon,  ediAng,  composing,  connected  applicaAons  –  and  finally  

publishing  Linked  Open  Data.  

© 2012 OpenLink Software, All rights reserved.

Virtuoso 7.0 Enabling Massively Scalable Big Data Analytics

for RDF & SQL Data Management

By Orri Erling, Virtuoso Program Manager & Hugh Williams, Professional Services Manager

Making Technology Work For You

© 2012 OpenLink Software, All rights reserved.

Company Overview

OpenLink Company Overview n  OpenLink Software is a privately-held company founded in 1992 by its President &

CEO, Kingsley Idehen. The company is an industry acclaimed technology innovator in the following areas:

§  ODBC, JDBC, ADO.NET, and OLE-DB compliant Data Access Drivers for Oracle, SQL Server, Informix, Ingres, Sybase, Progress, MySQL, and PostgreSQL

§  High-Performance & Scalable Multi-Model (Relational & Graph) Database Technology

§  Data Integration Middleware (Data Virtualization Technology across a wide variety of Protocols & Formats)

§  Web Application Server Technology

§  Linked Data Deployment & Management

§  Socially-enhanced Distributed Collaborative Applications Platforms (Weblogs, Wikis, Feed Aggregation and Syndication, Web File Systems, Discussion Forums, etc.)

§  Identity Management.

© 2012 OpenLink Software, All rights reserved.

Products & Services Software Products

•  OpenLink Universal Data Access Drivers (UDA) - High-performance data access drivers for ODBC, JDBC, ADO.NET, and OLE DB that provide transparent access to enterprise databases.

•  OpenLink Virtuoso - available in single server and cluster editions that are deployed in cloud and/or enterprise modes.

•  OpenLink Data Spaces Platform and Applications

•  OpenLink Ajax Toolkit •  OpenLink Data Explorer

•  An Open Source Data Access SDK for ODBC All OpenLink products are delivered by download from the Internet (http, ftp, etc.). Temporary licenses are issued upon download and may be extended as needed, on a case-by-case basis. Permanent licenses are issued once payment is received.

© 2012 OpenLink Software, All rights reserved.

Products & Services Professional and Support Services

•  OpenLink Product Support provides front-line email and phone support, web-based online support, and a variety of premium services such as phone, emergency, and onsite support.

•  Our Support staff is comprised of individuals with extensive knowledge of data access, data migration, database administration, programming APIs, and other relevant skills.

•  Services are sold in either Standard "Bronze" or Premium "Platinum" Support packages, with varying hours of availability, response times, etc.

•  We also offer Custom Development, Training, and other Consultancy services. These services can be offered on- or off-site. Expenses for travel, accommodations, food, etc., associated with on-site services are charged separately.

© 2012 OpenLink Software, All rights reserved.

Customers OpenLink's installed base is in excess of 10,000 customers worldwide. Examples include:

© 2012 OpenLink Software, All rights reserved.

n  Data.Gov (U.S. Govt. Open Linked Data initiative)

n  Verizon n  Raytheon n  Bank of America n  CGI Federal n  Elsevier n  French National Library n  Globo n  Scottish Government

n  St Jude's Medical n  Barclays Bank n  Wells Fargo n  and many more

Office Locations

USA OpenLink Software, Inc 10 Burlington Mall Road Suite 265 Burlington, MA 01803 Tel.: +1 781 273 0900 Fax: +1 781 229 8030

© 2012 OpenLink Software, All rights reserved.

UK OpenLink Software Ltd. Airport House Purley Way Croydon, Surrey CR0 0XZ Tel.: +44 (0)20 8681 7701 Fax: +44 (0)20 8681 7702

© 2012 OpenLink Software, All rights reserved.

Virtuoso Universal Server Overview

Situation Analysis

© 2012 OpenLink Software, All rights reserved.

Data is growing exponentially

along the following dimensions:

n Volume

n Velocity

n Variety

All of this happens while the total

hours in day remains 24 hrs.

Product Value Proposition

© 2012 OpenLink Software, All rights reserved.

Enterprise and Individual Agility

via Data Access, Integration, and

Management, without

compromising performance,

scalability, security, and platform

independence.

Virtuoso locks you into an experience (openness, performance, and scale) not

the platform itself. -- Kingsley Idehen, Founder & CEO, OpenLink

Software

Product Architecture

© 2012 OpenLink Software, All rights reserved.

A high-performance, scalable,

secure, and operating-system-

independent server designed

to handle contemporary

challenges associated with

standards compliant data

access, data integration, and

data management.

Data Virtualization Middleware

© 2012 OpenLink Software, All rights reserved.

An in-built middleware layer

(“Sponger”) for creating

Transient & Persistent

Views over Heterogeneous

Data Sources.

Sophisticated Content Crawler

© 2012 OpenLink Software, All rights reserved.

DBMS hosted Content

Crawler that’s leverages

loosely coupled binding to

the Sponger Middleware

component for

transformation of

unstructured and semi-

structured data into Linked

Data.

Core Platform behind LOD Cloud

© 2010 OpenLink Software, All rights reserved.

Core Platform (Graph DBMS and Linked Data Deployment) behind DBpedia, many

bubbles in the LOD Cloud, and the LOD Cloud cache itself.

Virtuoso Linked Data projects •  DBpedia - public SPARQL endpoint over the DBpedia data

(and international Chapters)

•  LOD Cloud Cache - public server hosting LOD cloud datasets

•  URIBurner - Linked Data generation & transformation service

•  Linked Geo Data - OpenStreetMap Spatial data as Linked Data

•  Sindice - SPARQL endpoint behind its Semantic Web Index

•  Data.gov - US Government Linked Data

•  Health.data.gov - Clinical Quality Linked Data on health.data.gov

•  Seevl - Linked Data music discovery service

•  Bio2RDF - Life science data mapped to Linked Data

•  Neurocommons - Life science data mapped to Linked Data

•  Musicbrainz - MusicBrainz database published as Linked Data

•  Open PHACTS - DBpedia-like Linked Data Space for Pharma

•  Others - Many others …

© 2012 OpenLink Software, All rights reserved.

Powerful Standards Support

© 2012 OpenLink Software, All rights reserved.

ODBC compliance enables use of client applications (e.g. Microsoft Access) as front-

ends for Virtuoso, 3rd party RDBMS engines, and the World Wide Web hosted Linked

Open Data Cloud.

Powerful Standards Support Cont’d

© 2012 OpenLink Software, All rights reserved.

ODBC & HTML5 compliance enables development of rich client apps. that

leverage the WebDB-ODBC bridge for accessing data across: Virtuoso, 3rd party

RDBMS engines, and the World Wide Web hosted Linked Open Data Cloud.

Insight Discovery & Exploration

© 2012 OpenLink Software, All rights reserved.

Native Faceted Browsing that enables multi-dimensional drill-downs via any browser

Insight Discovery & Exploration

© 2012 OpenLink Software, All rights reserved.

Microsoft Silverlight or HTML5 based PivotViewer Front-End for SPARQL and SPARQL-FED

Queries

Powerful SPARQL Query Service

© 2012 OpenLink Software, All rights reserved.

Basic SPARQL Endpoint for Creating Query Definitions & Sharing Query Results.

Example: health.data.gov data directly from a Web Browser.

Powerful SPARQL Query Builder

© 2012 OpenLink Software, All rights reserved.

Use Query By Example (QBE) Patterns to Construct & Share Query

Results.

How Do I Get Going?

n  Download, install, and experience the power of coherent integration of disparate data sources, data access protocols, and data representation formats.

n  In an nutshell, commence exploitation of powerful business intelligence, socially enhanced collaboration, data virtualization, and entity analytics without writing a line of code!

n  Turn "Big Data" into exploitable "Smart Data" without compromise!

n  Will be integrated into the next release of the LOD2 Stack

© 2012 OpenLink Software, All rights reserved.

© 2012 OpenLink Software, All rights reserved.

Virtuoso 7.0

27 © 2012 OpenLink Software, All rights reserved.

Flexible Big Data Challenge

n  Data Agility is challenged by Volume, Velocity, and Variety

n  “Schema Last” is great - if the price is right n  RDF, graphs promise powerful querying with the

flexibility and scale of NoSQL key-value stores n  Inference may be good for integration, if can

express the right things, beyond OWL n  RDF data management technology must learn

from the lessons of SQL RDBMS, everything applies

28 © 2012 OpenLink Software, All rights reserved.

Virtuoso 7.0 Mission Statement

Destruction of the following items as impediments to

Big (Open) Linked Data exploitation:

n Performance

n Scalability

n Platform Independence

n Security & Privacy

n Price

29 © 2012 OpenLink Software, All rights reserved.

Virtuoso 7.0 & Big Data Myths

Myths put to rest:

n Scalable Open Ended SPARQL Endpoints

n Scalable Open Ended Read-Write SPARQL

Endpoints

n Fine-grained Access Controls underlying Read-

Only or Read-Write endpoints.

30 © 2012 OpenLink Software, All rights reserved.

Virtuoso Column Store Features

n  Supports SQL and SPARQL query languages

n  Compact column-wise storage

n  Vectored execution of commands

n  Shared nothing scale out for clusters

n  Powerful procedure language with parallel,

distributed control structures

n  Full-text and geospatial indexes

31 © 2012 OpenLink Software, All rights reserved.

Storage Engine n  Freely mix column-, and row-wise indices n  All SQL and RDF data types natively supported , single

execution engine for SQL/SPARQL

n  Column compression 3x more space efficient than row-wise compression for RDF

n  Column stores are not only for big scans, random access surpasses rows as as soon as there is some locality

n  9 B/quad with DBpedia, 7 B/quad with BSBM or RDF-H, 14 B/quad with web crawls (PSOG, POSG, SP, OP, GS, excluding literals)

32 © 2012 OpenLink Software, All rights reserved.

Execution Engine n  Vectoring is not only for column stores n  Vectoring makes a random access into a linear merge

join if there is any locality: Always a win, mileage depends on run time factors

n  Vectoring eliminates interpretation overhead and makes CPU friendly code possible

n  Even with run time data typing, vectoring allows use of type-specific operators on homogenous data, e.g. arithmetic

n  Dynamically adjust vector size: Larger vector may not fit in cache but will get better locality for random access

33 © 2012 OpenLink Software, All rights reserved.

Graph operations n  Run time computation plus caching instead of

materialization n  SPARQL/SQL extension for arbitrary transitive subqueries: n  Flexible options for returning shortest paths, all paths, all /

distinct reachable, attributes of steps on paths etc. n  Efficient execution, searching the graph from both ends if

looking for a path with ends given n  Query operators for RDF hierarchy traversal n  Special query operator for OWL sameAs and IFP based

identity n  Taking OWL sameAs / IFP identity into account for

DISTINCT /GROUP BY

34 © 2012 OpenLink Software, All rights reserved.

Query Optimization Challenges n  Typical SQL stats do not help n  Need to measure data cardinalities starting from

constants in the query n  Need to sample fanout predicate by predicate, as

needed n  Predicate and class hierarchies are easy to

handle in sampling n  sameAs or IFP inference voids all guesses n  Is hash join worthwhile? High setup cost means

that one must be sure of cardinalities first

35 © 2012 OpenLink Software, All rights reserved.

Deep Sampling n  Everything is a join -> sampling must also do joins n  As the candidate plan grows, the cost model

executes all the ops on a sample of the data n  Actual cardinality and locality are known, also when

search conditions are correlated n  Having high confidence in the cost model, hash join

plans become safe and attractive n  Even though there is an indexed access path for all,

a scan can be better because it produces results in order. Need to be sure of selectivity before taking the risk

36 © 2012 OpenLink Software, All rights reserved.

Elastic Cluster

n  Data is partitioned by key, different indices may have different partition keys

n  Partitions may split and migrate between servers

n  Partitions may be kept in duplicate for fault tolerance/load balancing

n  Actual access stats drive partition split and placement

37 © 2012 OpenLink Software, All rights reserved.

Optimizing for Cluster n  Vectored execution is natural in a cluster since single-tuple

messages are not an option n  Keep max ops in flight at all times, always send long messages n  Fully distributed query coordination: ¡  Any node can service a client request. Correlated subqueries, stored

procedures may execute anywhere, arbitrary parallelism and recursion between partitions

¡  On single shared memory box, cluster is approximately even with single process multithreading, low overhead

¡  1.8x more throughput in BSBM BI when going from 1 to 2 machines ¡  Distributed stored procedures, send the proc to the data, as in map-

reduce, except that there are no limits on cross partition calling/recursion ¡  Choice of transactional and auto-commit update semantics, can have

atomic ops without global transaction

38 © 2012 OpenLink Software, All rights reserved.

Cluster Architecture Diagrams

39 © 2012 OpenLink Software, All rights reserved.

n  55 billion triples in LOD cache, only 384 GB of RAM, 2TB disk

n  2 x 384 GB of RAM, 4TB SSD

n  Most of Linked Open Data and Web Crawls

n  http://lod.openlinksw.com

n  http://lod.openlinksw.com/sparql

LOD Cache

40 © 2012 OpenLink Software, All rights reserved.

Independent Benchmark Report from CWI:

Berlin SPARQL Benchmark

#Triples Source File Size

Compressed Source File Size

Source Data Files Per Loader Node

Final Database File Size

Load Time

50 Billion 2.8 TB 240 GB 30 GB 1.8 TB 10h 54s

150 Billion 8.5 TB 728 GB 91 GB 5.6 TB n/a

41 © 2012 OpenLink Software, All rights reserved.

Store Comparisons Summary:

Exploration oriented queries (QMpH)

Berlin SPARQL Benchmark

100 Million Triples

200 Million Triples

1 Billion Triples

Virtuoso 6 37,678.319 32,969.006

8,984.789

Virtuoso 7 47,178.820

27,933.682

42 © 2012 OpenLink Software, All rights reserved.

Store Comparisons Summary:

Business Intelligence oriented queries (QMpH)

Berlin SPARQL Benchmark

10 Million Triples 100 Million Triples

1 Billion Triples

Virtuoso 6 431.465 35.342 2.383

Virtuoso 7 996.795 75.236

43 © 2012 OpenLink Software, All rights reserved.

Store Comparisons Summary:

Exploration oriented queries (Cluster Edition) (QMpH)

Berlin SPARQL Benchmark

10 Billion Triples 50 Billion Triples 150 Billion Triples

Virtuoso 7 2,360.210 4,253.157 2,090.574

44 © 2012 OpenLink Software, All rights reserved.

Store Comparisons Summary:

Business Intelligence oriented queries (Cluster Edition) (QMpH)

Berlin SPARQL Benchmark

10 Billion Triples 50 Billion Triples 150 Billion Triples

Virtuoso 7 13.078 0.964 0.285

45 © 2012 OpenLink Software, All rights reserved.

Future Work

n  Complete deep sampling: enhanced query optimization plans

n  Run TPC-H and TPC-DS in SQL and their 1:1 translation in SPARQL, demonstrating SPARQL performance as near to SQL as possible

Additional Information n  OpenLink Software

¡  OpenLink Software - www.openlinksw.com ¡  OpenLink Virtuoso - virtuoso.openlinksw.com ¡  Universal Data Access - uda.openlinksw.com

n  Social Media Data spaces ¡  http://virtuoso.openlinksw.com/blog/ (weblog) ¡  https://plus.google.com/112399767740508618350/

posts (Google+) ¡  https://twitter.com/OpenLink (Twitter) ¡  http://www.linkedin.com/company/openlink-software

(LinkedIn) ¡  Hashtag: #LinkedData (Anywhere)

© 2012 OpenLink Software, All rights reserved.

EU-FP7 LOD2 WP6 – 25.-26.03.2013. Page 47 http://lod2.eu

Creating Knowledge out of Interlinked Data

LOD2 Stack Usability Survey 2013

http://www.surveygizmo.com/s3/1188229/LOD2-Stack-Usability-Survey-2013

top related