lod2 webinar series: virtuoso 7

LOD2 Webinar . 29.11.2011 . Page 1 http://lod2.eu

Creating Knowledge out of Interlinked Data

http://lod2.eu

LOD2 is a large-scale integrating project co-funded by the European Commission within the FP7 Information and Communication Technologies Work Programme. This 4-year project comprises leading Linked Open Data technology researchers, companies, and service providers. Coming from across 12 countries the partners

are coordinated by the Agile Knowledge Engineering and Semantic Web Research Group at the University of Leipzig, Germany.

LOD2 will integrate and syndicate Linked Data with existing large-scale

applications. The project shows the benefits in the scenarios of Media and Publishing, Corporate Data intranets and eGovernment.

http://lod2.eu

Once per month the LOD2 webinar series offer a free webinar about tools and services along the Linked

Open Data Life Cycle.

Stay with us and learn more about acquisiAon, ediAng, composing, connected applicaAons – and finally

publishing Linked Open Data.

Virtuoso 7.0 Enabling Massively Scalable Big Data Analytics

for RDF & SQL Data Management

By Orri Erling, Virtuoso Program Manager & Hugh Williams, Professional Services Manager

Making Technology Work For You

Company Overview

OpenLink Company Overview n  OpenLink Software is a privately-held company founded in 1992 by its President &

CEO, Kingsley Idehen. The company is an industry acclaimed technology innovator in the following areas:

§  ODBC, JDBC, ADO.NET, and OLE-DB compliant Data Access Drivers for Oracle, SQL Server, Informix, Ingres, Sybase, Progress, MySQL, and PostgreSQL

§  High-Performance & Scalable Multi-Model (Relational & Graph) Database Technology

§  Data Integration Middleware (Data Virtualization Technology across a wide variety of Protocols & Formats)

§  Web Application Server Technology

§  Linked Data Deployment & Management

§  Socially-enhanced Distributed Collaborative Applications Platforms (Weblogs, Wikis, Feed Aggregation and Syndication, Web File Systems, Discussion Forums, etc.)

§  Identity Management.

Products & Services Software Products

•  OpenLink Universal Data Access Drivers (UDA) - High-performance data access drivers for ODBC, JDBC, ADO.NET, and OLE DB that provide transparent access to enterprise databases.

•  OpenLink Virtuoso - available in single server and cluster editions that are deployed in cloud and/or enterprise modes.

•  OpenLink Data Spaces Platform and Applications

•  OpenLink Ajax Toolkit •  OpenLink Data Explorer

•  An Open Source Data Access SDK for ODBC All OpenLink products are delivered by download from the Internet (http, ftp, etc.). Temporary licenses are issued upon download and may be extended as needed, on a case-by-case basis. Permanent licenses are issued once payment is received.

Products & Services Professional and Support Services

•  OpenLink Product Support provides front-line email and phone support, web-based online support, and a variety of premium services such as phone, emergency, and onsite support.

•  Our Support staff is comprised of individuals with extensive knowledge of data access, data migration, database administration, programming APIs, and other relevant skills.

•  Services are sold in either Standard "Bronze" or Premium "Platinum" Support packages, with varying hours of availability, response times, etc.

•  We also offer Custom Development, Training, and other Consultancy services. These services can be offered on- or off-site. Expenses for travel, accommodations, food, etc., associated with on-site services are charged separately.

Customers OpenLink's installed base is in excess of 10,000 customers worldwide. Examples include:

n  Data.Gov (U.S. Govt. Open Linked Data initiative)

n  Verizon n  Raytheon n  Bank of America n  CGI Federal n  Elsevier n  French National Library n  Globo n  Scottish Government

n  St Jude's Medical n  Barclays Bank n  Wells Fargo n  and many more

Office Locations

USA OpenLink Software, Inc 10 Burlington Mall Road Suite 265 Burlington, MA 01803 Tel.: +1 781 273 0900 Fax: +1 781 229 8030

UK OpenLink Software Ltd. Airport House Purley Way Croydon, Surrey CR0 0XZ Tel.: +44 (0)20 8681 7701 Fax: +44 (0)20 8681 7702

Virtuoso Universal Server Overview

Situation Analysis

Data is growing exponentially

along the following dimensions:

n Volume

n Velocity

n Variety

All of this happens while the total

hours in day remains 24 hrs.

Product Value Proposition

Enterprise and Individual Agility

via Data Access, Integration, and

Management, without

compromising performance,

scalability, security, and platform

independence.

Virtuoso locks you into an experience (openness, performance, and scale) not

the platform itself. -- Kingsley Idehen, Founder & CEO, OpenLink

Software

Product Architecture

A high-performance, scalable,

secure, and operating-system-

independent server designed

to handle contemporary

challenges associated with

standards compliant data

access, data integration, and

data management.

Data Virtualization Middleware

An in-built middleware layer

(“Sponger”) for creating

Transient & Persistent

Views over Heterogeneous

Data Sources.

Sophisticated Content Crawler

DBMS hosted Content

Crawler that’s leverages

loosely coupled binding to

the Sponger Middleware

component for

transformation of

unstructured and semi-

structured data into Linked

Core Platform behind LOD Cloud

Core Platform (Graph DBMS and Linked Data Deployment) behind DBpedia, many

bubbles in the LOD Cloud, and the LOD Cloud cache itself.

Virtuoso Linked Data projects •  DBpedia - public SPARQL endpoint over the DBpedia data

(and international Chapters)

•  LOD Cloud Cache - public server hosting LOD cloud datasets

•  URIBurner - Linked Data generation & transformation service

•  Linked Geo Data - OpenStreetMap Spatial data as Linked Data

•  Sindice - SPARQL endpoint behind its Semantic Web Index

•  Data.gov - US Government Linked Data

•  Health.data.gov - Clinical Quality Linked Data on health.data.gov

•  Seevl - Linked Data music discovery service

•  Bio2RDF - Life science data mapped to Linked Data

•  Neurocommons - Life science data mapped to Linked Data

•  Musicbrainz - MusicBrainz database published as Linked Data

•  Open PHACTS - DBpedia-like Linked Data Space for Pharma

•  Others - Many others …

Powerful Standards Support

ODBC compliance enables use of client applications (e.g. Microsoft Access) as front-

ends for Virtuoso, 3rd party RDBMS engines, and the World Wide Web hosted Linked

Open Data Cloud.

Powerful Standards Support Cont’d

ODBC & HTML5 compliance enables development of rich client apps. that

leverage the WebDB-ODBC bridge for accessing data across: Virtuoso, 3rd party

RDBMS engines, and the World Wide Web hosted Linked Open Data Cloud.

Insight Discovery & Exploration

Native Faceted Browsing that enables multi-dimensional drill-downs via any browser

Insight Discovery & Exploration

Microsoft Silverlight or HTML5 based PivotViewer Front-End for SPARQL and SPARQL-FED

Queries

Powerful SPARQL Query Service

Basic SPARQL Endpoint for Creating Query Definitions & Sharing Query Results.

Example: health.data.gov data directly from a Web Browser.

Powerful SPARQL Query Builder

Use Query By Example (QBE) Patterns to Construct & Share Query

Results.

How Do I Get Going?

n  Download, install, and experience the power of coherent integration of disparate data sources, data access protocols, and data representation formats.

n  In an nutshell, commence exploitation of powerful business intelligence, socially enhanced collaboration, data virtualization, and entity analytics without writing a line of code!

n  Turn "Big Data" into exploitable "Smart Data" without compromise!

n  Will be integrated into the next release of the LOD2 Stack

Virtuoso 7.0

Flexible Big Data Challenge

n  Data Agility is challenged by Volume, Velocity, and Variety

n  “Schema Last” is great - if the price is right n  RDF, graphs promise powerful querying with the

flexibility and scale of NoSQL key-value stores n  Inference may be good for integration, if can

express the right things, beyond OWL n  RDF data management technology must learn

from the lessons of SQL RDBMS, everything applies

Virtuoso 7.0 Mission Statement

Destruction of the following items as impediments to

Big (Open) Linked Data exploitation:

n Performance

n Scalability

n Platform Independence

n Security & Privacy

n Price

Virtuoso 7.0 & Big Data Myths

Myths put to rest:

n Scalable Open Ended SPARQL Endpoints

n Scalable Open Ended Read-Write SPARQL

Endpoints

n Fine-grained Access Controls underlying Read-

Only or Read-Write endpoints.

Virtuoso Column Store Features

n  Supports SQL and SPARQL query languages

n  Compact column-wise storage

n  Vectored execution of commands

n  Shared nothing scale out for clusters

n  Powerful procedure language with parallel,

distributed control structures

n  Full-text and geospatial indexes

Storage Engine n  Freely mix column-, and row-wise indices n  All SQL and RDF data types natively supported , single

execution engine for SQL/SPARQL

n  Column compression 3x more space efficient than row-wise compression for RDF

n  Column stores are not only for big scans, random access surpasses rows as as soon as there is some locality

n  9 B/quad with DBpedia, 7 B/quad with BSBM or RDF-H, 14 B/quad with web crawls (PSOG, POSG, SP, OP, GS, excluding literals)

Execution Engine n  Vectoring is not only for column stores n  Vectoring makes a random access into a linear merge

join if there is any locality: Always a win, mileage depends on run time factors

n  Vectoring eliminates interpretation overhead and makes CPU friendly code possible

n  Even with run time data typing, vectoring allows use of type-specific operators on homogenous data, e.g. arithmetic

n  Dynamically adjust vector size: Larger vector may not fit in cache but will get better locality for random access

Graph operations n  Run time computation plus caching instead of

materialization n  SPARQL/SQL extension for arbitrary transitive subqueries: n  Flexible options for returning shortest paths, all paths, all /

distinct reachable, attributes of steps on paths etc. n  Efficient execution, searching the graph from both ends if

looking for a path with ends given n  Query operators for RDF hierarchy traversal n  Special query operator for OWL sameAs and IFP based

identity n  Taking OWL sameAs / IFP identity into account for

DISTINCT /GROUP BY

Query Optimization Challenges n  Typical SQL stats do not help n  Need to measure data cardinalities starting from

constants in the query n  Need to sample fanout predicate by predicate, as

needed n  Predicate and class hierarchies are easy to

handle in sampling n  sameAs or IFP inference voids all guesses n  Is hash join worthwhile? High setup cost means

that one must be sure of cardinalities first

Deep Sampling n  Everything is a join -> sampling must also do joins n  As the candidate plan grows, the cost model

executes all the ops on a sample of the data n  Actual cardinality and locality are known, also when

search conditions are correlated n  Having high confidence in the cost model, hash join

plans become safe and attractive n  Even though there is an indexed access path for all,

a scan can be better because it produces results in order. Need to be sure of selectivity before taking the risk

Elastic Cluster

n  Data is partitioned by key, different indices may have different partition keys

n  Partitions may split and migrate between servers

n  Partitions may be kept in duplicate for fault tolerance/load balancing

n  Actual access stats drive partition split and placement

Optimizing for Cluster n  Vectored execution is natural in a cluster since single-tuple

messages are not an option n  Keep max ops in flight at all times, always send long messages n  Fully distributed query coordination: ¡  Any node can service a client request. Correlated subqueries, stored

procedures may execute anywhere, arbitrary parallelism and recursion between partitions

¡  On single shared memory box, cluster is approximately even with single process multithreading, low overhead

¡  1.8x more throughput in BSBM BI when going from 1 to 2 machines ¡  Distributed stored procedures, send the proc to the data, as in map-

reduce, except that there are no limits on cross partition calling/recursion ¡  Choice of transactional and auto-commit update semantics, can have

atomic ops without global transaction

Cluster Architecture Diagrams

n  55 billion triples in LOD cache, only 384 GB of RAM, 2TB disk

n  2 x 384 GB of RAM, 4TB SSD

n  Most of Linked Open Data and Web Crawls

n  http://lod.openlinksw.com

n  http://lod.openlinksw.com/sparql

LOD Cache

Independent Benchmark Report from CWI:

Berlin SPARQL Benchmark

#Triples Source File Size

Compressed Source File Size

Source Data Files Per Loader Node

Final Database File Size

Load Time

50 Billion 2.8 TB 240 GB 30 GB 1.8 TB 10h 54s

150 Billion 8.5 TB 728 GB 91 GB 5.6 TB n/a

Store Comparisons Summary:

Exploration oriented queries (QMpH)

100 Million Triples

200 Million Triples

1 Billion Triples

Virtuoso 6 37,678.319 32,969.006

8,984.789

Virtuoso 7 47,178.820

27,933.682

Business Intelligence oriented queries (QMpH)

10 Million Triples 100 Million Triples

1 Billion Triples

Virtuoso 6 431.465 35.342 2.383

Virtuoso 7 996.795 75.236

Exploration oriented queries (Cluster Edition) (QMpH)

10 Billion Triples 50 Billion Triples 150 Billion Triples

Virtuoso 7 2,360.210 4,253.157 2,090.574

Business Intelligence oriented queries (Cluster Edition) (QMpH)

10 Billion Triples 50 Billion Triples 150 Billion Triples

Virtuoso 7 13.078 0.964 0.285

Future Work

n  Complete deep sampling: enhanced query optimization plans

n  Run TPC-H and TPC-DS in SQL and their 1:1 translation in SPARQL, demonstrating SPARQL performance as near to SQL as possible

Additional Information n  OpenLink Software

¡  OpenLink Software - www.openlinksw.com ¡  OpenLink Virtuoso - virtuoso.openlinksw.com ¡  Universal Data Access - uda.openlinksw.com

n  Social Media Data spaces ¡  http://virtuoso.openlinksw.com/blog/ (weblog) ¡  https://plus.google.com/112399767740508618350/

posts (Google+) ¡  https://twitter.com/OpenLink (Twitter) ¡  http://www.linkedin.com/company/openlink-software

(LinkedIn) ¡  Hashtag: #LinkedData (Anywhere)

EU-FP7 LOD2 WP6 – 25.-26.03.2013. Page 47 http://lod2.eu

Creating Knowledge out of Interlinked Data

LOD2 Stack Usability Survey 2013

http://www.surveygizmo.com/s3/1188229/LOD2-Stack-Usability-Survey-2013

lod2 webinar series: virtuoso 7

store comparisons

exploration

insight discovery

lod cloud

products amp

eucreating

2012 openlink

core platform

Technology

lod2 webinar series: lod2 in information and publishing...

lod2 webinar series: limes

lod2 webinar: unifiedviews

lod2 ckan workshop vienna: einleitung, martin kaltenböck

lod2 plenary meeting 2011: institute mihajlo pupin –...

lod2 webinar series classification and quality analysis with...

virtuoso user guide -...

open data lod2

lod2 introduction jordse@gmail.com 서울대학교 bike lab

lod2 webinar series: virtuoso by openlink software

folienmaster des lvermgeo - wordpress.comadv-produktstandard...

free webinar: lod2 stack - 1st release

virtuoso ip production platform - techex · virtuoso ip...

lod2 wp9 (publicdata.eu) - review meeting 2011

lod2 plenary meeting 2011: zemanta – partner introduction

supporting the linked data publication proc- ess with the...

generalising 3d buildings from lod2 to lod1 -...

lod2 webinar series: ontowiki

deliverable 6.1.3 final release of integrated lod2...

lod2 webinar series: 3rd relase of the stack