test sets: lubm, sp2b, barton set, billion triple challenge varsha dubey

28
Test Sets: LUBM, Test Sets: LUBM, SP2B, Barton Set, SP2B, Barton Set, Billion Triple Billion Triple Challenge Challenge Varsha Dubey Varsha Dubey

Upload: june-price

Post on 12-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

Test Sets: LUBM, SP2B, Test Sets: LUBM, SP2B, Barton Set, Billion Triple Barton Set, Billion Triple

ChallengeChallenge

Varsha DubeyVarsha Dubey

Page 2: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

AgendaAgendaGeneral Introduction of all the test sets General Introduction of all the test sets LUBMLUBM

IntroductionIntroduction LUBM for OWLLUBM for OWL LUBM BenchmarkLUBM Benchmark SummarySummary

SP2BSP2B IntroductionIntroduction A SPARQL Performance BenchmarkA SPARQL Performance Benchmark SummarySummary

Barton Data SetBarton Data Set IntroductionIntroduction Barton Data Set as RDF BenchmarkBarton Data Set as RDF Benchmark SummarySummary

Billion Triple ChallengeBillion Triple Challenge IntroductionIntroduction Solutions ProposedSolutions Proposed SummarySummary

Page 3: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

LUBM - IntroductionLUBM - Introduction

LUBM: Lehigh University Benchmark - A Benchmark for LUBM: Lehigh University Benchmark - A Benchmark for OWL Knowledge Base SystemsOWL Knowledge Base Systems

Need –Need – Issue - How to choose an appropriate KBS for a large OWL applicationIssue - How to choose an appropriate KBS for a large OWL application 2 basic requirements for such application –2 basic requirements for such application –

Enormous amount of data where scalability and efficiency becomes crucialEnormous amount of data where scalability and efficiency becomes crucial

Sufficient reasoning capabilities to support semantic requirements of the Sufficient reasoning capabilities to support semantic requirements of the systemsystem

Need semantic web data that are of large range and commit to Need semantic web data that are of large range and commit to semantically rich ontologies.semantically rich ontologies.

Increased reasoning capability – increased processing time/query Increased reasoning capability – increased processing time/query response timeresponse time

Best Approach - Extensional queries over a large dataset that commits Best Approach - Extensional queries over a large dataset that commits to a single ontology of moderate complexity and size - LUBMto a single ontology of moderate complexity and size - LUBM

Page 4: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

LUBM For OWLLUBM For OWL

University Benchmark for OWLUniversity Benchmark for OWL LUBM Design Goals –LUBM Design Goals –

Support extensional queries – Extensional queries are queries about the instance data over Extensional queries are queries about the instance data over

ontologies. ontologies. Intensional queries (i.e., queries about classes and properties). Intensional queries (i.e., queries about classes and properties). Majority of Semantic Web applications will want to use data to answer Majority of Semantic Web applications will want to use data to answer

questions, and that reasoning about subsumption will typically be a questions, and that reasoning about subsumption will typically be a means to an end, not an end in itself. means to an end, not an end in itself.

It is important to have benchmarks that focus on this kind of query.It is important to have benchmarks that focus on this kind of query.

Page 5: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

LUBM For OWLLUBM For OWL

Arbitrary scaling of data – In order to evaluate the ability of systems to handle large DATA, we

need to be able to vary the size of data, and see how the system scales.

Ontology of moderate size and complexity – Existing DL benchmarks have looked at reasoning with large and

complex ontologies, while various RDF systems have been evaluated with regards to various RDF Schemas.

It is important to have a benchmark that fell between these two extremes.

Furthermore, since focus is on data, the ontology should not be too large.

Page 6: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

LUBM For OWLLUBM For OWL

LUBM Overview – LUBM Overview – This benchmark is based on an ontology for the university This benchmark is based on an ontology for the university

domain. domain. Its test data are synthetically generated instance data over that Its test data are synthetically generated instance data over that

ontology; they are random and repeatable and can be scaled to ontology; they are random and repeatable and can be scaled to an arbitrary size. an arbitrary size.

It offers fourteen test queries over the data. It offers fourteen test queries over the data. It also provides a set of performance metrics used to evaluate It also provides a set of performance metrics used to evaluate

the system.the system.

Page 7: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

LUBM BenchmarkLUBM Benchmark

Benchmark Ontology - Univ-BenchBenchmark Ontology - Univ-Bench Univ-Bench describes universities and departments and the activities Univ-Bench describes universities and departments and the activities

that occur at them.that occur at them. The ontology is expressed in OWL Lite, the simplest sublanguage of The ontology is expressed in OWL Lite, the simplest sublanguage of

OWL.OWL. The ontology currently defines 43 classes and 32 properties (including The ontology currently defines 43 classes and 32 properties (including

25 object properties and 7 data type properties).25 object properties and 7 data type properties). It uses OWL Lite language features including It uses OWL Lite language features including

inverseOf,TransitiveProperty, someValuesFrom restrictions, and inverseOf,TransitiveProperty, someValuesFrom restrictions, and intersectionOf.intersectionOf.

Page 8: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

LUBM BenchmarkLUBM Benchmark

Data Generation and OWL Datasets –Data Generation and OWL Datasets – Test data of the LUBM are extensional data created over the Univ-Bench Test data of the LUBM are extensional data created over the Univ-Bench

ontology.ontology. Data generation is carried out by UBA (Univ-Bench Artificial data Data generation is carried out by UBA (Univ-Bench Artificial data

generator), a tool we have developed for the benchmark.generator), a tool we have developed for the benchmark. The generator features random and repeatable data generation.The generator features random and repeatable data generation. Instances of both classes and properties are randomly decided.Instances of both classes and properties are randomly decided. To make the data as realistic as possible, some restrictions are applied To make the data as realistic as possible, some restrictions are applied

based on common sense and domain investigation. based on common sense and domain investigation. Example Restrictions – Example Restrictions –

““a minimum of 15 and a maximum of 25 departments in each university”, a minimum of 15 and a maximum of 25 departments in each university”,

““an undergraduate student/faculty ratio between 8 and 14 inclusive”, an undergraduate student/faculty ratio between 8 and 14 inclusive”,

““each graduate student takes at least 1 but at most 3 courses”, and so forth.each graduate student takes at least 1 but at most 3 courses”, and so forth.

The generator identifies universities by assigning them zero-based The generator identifies universities by assigning them zero-based indexes, i.e., the first university is named University0, and so on.indexes, i.e., the first university is named University0, and so on.

Page 9: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

LUBM BenchmarkLUBM BenchmarkTest Queries –Test Queries –

The LUBM currently offers fourteen test queries, one more than when it was originally The LUBM currently offers fourteen test queries, one more than when it was originally developed.developed.

They are written in SPARQL [28], the query language that is poised to become the standard They are written in SPARQL [28], the query language that is poised to become the standard for RDF.for RDF.

Factors taken in consideration –Factors taken in consideration – Input size – Input size – This is measured as the proportion of the class instances involved in the query This is measured as the proportion of the class instances involved in the query

to the total class instances in the benchmark data. to the total class instances in the benchmark data. Selectivity – Selectivity – This is measured as the estimated proportion of the class instances involved This is measured as the estimated proportion of the class instances involved

in the query that satisfy the query criteria. in the query that satisfy the query criteria. Whether the selectivity is high or low for a query may depend on the dataset used. Whether the selectivity is high or low for a query may depend on the dataset used. Complexity - Complexity - We use the number of classes and properties that are involved in the query We use the number of classes and properties that are involved in the query

as an indication of complexity. as an indication of complexity. Since we do not assume any specific implementation of the repository, the real degree of Since we do not assume any specific implementation of the repository, the real degree of

complexity may vary by systems and schemata. complexity may vary by systems and schemata. Assumed hierarchy information – Assumed hierarchy information – This considers whether information from the class This considers whether information from the class

hierarchy or property hierarchy is required to achieve the complete answer. hierarchy or property hierarchy is required to achieve the complete answer. Assumed logical inference - Assumed logical inference - This considers whether logical inference is required to This considers whether logical inference is required to

achieve the completeness of the answer. achieve the completeness of the answer. Features used in the test queries include subsumption, i.e., inference of implicit subclass Features used in the test queries include subsumption, i.e., inference of implicit subclass

relationship, owl:TransitiveProperty, owl:inverseOf, and realization, i.e., inference of the relationship, owl:TransitiveProperty, owl:inverseOf, and realization, i.e., inference of the most specific concepts that an individual is an instance of. most specific concepts that an individual is an instance of.

Page 10: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

LUBM BenchmarkLUBM BenchmarkPerformance Metrics – Performance Metrics – Load Time –Load Time –

In a LUBM dataset, every university contains 15 to 25 departments, each described by a In a LUBM dataset, every university contains 15 to 25 departments, each described by a separate OWL file. These files are loaded to the target system in an incremental fashion. separate OWL file. These files are loaded to the target system in an incremental fashion.

We measure the load time as the stand alone elapsed time for storing the specified dataset to We measure the load time as the stand alone elapsed time for storing the specified dataset to the system. This also counts the time spent in any processing of the ontology and source files, the system. This also counts the time spent in any processing of the ontology and source files, such as parsing and reasoning.such as parsing and reasoning.

Repository Size - Repository Size - Repository size is the resulting size of the repository after loading the specified benchmark Repository size is the resulting size of the repository after loading the specified benchmark

data into the system. Size is only measured for systems with persistent storage and is data into the system. Size is only measured for systems with persistent storage and is calculated as the total size of all files that constitute the repository. calculated as the total size of all files that constitute the repository.

Query Response Time - Query Response Time - Query response time is measured based on the process used in database benchmarks.Query response time is measured based on the process used in database benchmarks. To account for caching, each query is executed for ten times consecutively and the average To account for caching, each query is executed for ten times consecutively and the average

time is computed.time is computed.

Query Completeness and Soundness – Query Completeness and Soundness – We also examine query completeness and soundness of each system.We also examine query completeness and soundness of each system. we measure the degree of completeness of each query answer as the percentage of the we measure the degree of completeness of each query answer as the percentage of the

entailed answers that are returned by the system. Note that we request that the result set entailed answers that are returned by the system. Note that we request that the result set contains unique answers.contains unique answers.

Page 11: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

LUBM BenchmarkLUBM Benchmark

Benchmark Architecture – Benchmark Architecture – Test Module requests operation on Test Module requests operation on

repository – open/close, launches repository – open/close, launches the loading process, issues queries the loading process, issues queries and obtains results through and obtains results through interface as shown.interface as shown.

Target systems and test queries are Target systems and test queries are defined in KBS specification and defined in KBS specification and Query definition files.Query definition files.

Queries are translated to query Queries are translated to query language before issuing to the language before issuing to the system to reduce the query system to reduce the query response time.response time.

Translated queries are fed to tester Translated queries are fed to tester through query definition file.through query definition file.

Tester reads each line from query Tester reads each line from query definition file and passes to the definition file and passes to the system.system.

Page 12: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

LUBM SummaryLUBM SummarySummary-Summary-In the LUBM, the Univ-Bench ontology models the university domain in the OWL In the LUBM, the Univ-Bench ontology models the university domain in the OWL language and offers necessary features for evaluation purposes. language and offers necessary features for evaluation purposes.

The OWL datasets are synthetically created over the ontology. The OWL datasets are synthetically created over the ontology.

The data generated are random and repeatable, and can scale to an arbitrary size. The data generated are random and repeatable, and can scale to an arbitrary size.

Fourteen test queries are chosen to represent a variety of properties including input Fourteen test queries are chosen to represent a variety of properties including input size, selectivity, complexity, assumed hierarchy information, assumed logical size, selectivity, complexity, assumed hierarchy information, assumed logical inference, amongst others. inference, amongst others.

A set of performance metrics are provided, which include load time and repository A set of performance metrics are provided, which include load time and repository size, query response time, query completeness and soundness, and a combined size, query response time, query completeness and soundness, and a combined metric for evaluating the overall query performance. metric for evaluating the overall query performance.

The LUBM is intended to be used to evaluate Semantic Web KBSs with respect to The LUBM is intended to be used to evaluate Semantic Web KBSs with respect to extensional queries over a large dataset that commits to a single realistic ontology.extensional queries over a large dataset that commits to a single realistic ontology.

Conclusion -Conclusion -The LUBM is not meant to be an overall Semantic Web KBS benchmark. The LUBM is not meant to be an overall Semantic Web KBS benchmark.

It is a benchmark limited to a particular domain represented by the ontology it uses.It is a benchmark limited to a particular domain represented by the ontology it uses.

Page 13: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

SP2B IntroductionSP2B Introduction

SP2B - A SPARQL Performance BenchmarkSP2B - A SPARQL Performance Benchmark

Need –Need – Recently, the SPARQL query language for RDF has reached the W3C Recently, the SPARQL query language for RDF has reached the W3C

recommendation status. recommendation status. In response to this emerging standard, the database community is currently In response to this emerging standard, the database community is currently

exploring efficient storage techniques for RDF data and evaluationexploring efficient storage techniques for RDF data and evaluation strategies for SPARQL queries. strategies for SPARQL queries. A meaningful analysis and comparison of these approaches necessitates a A meaningful analysis and comparison of these approaches necessitates a

comprehensive and universal benchmark platform.comprehensive and universal benchmark platform. The Lehigh University Benchmark (LUBM) was designed with focus on inference The Lehigh University Benchmark (LUBM) was designed with focus on inference

and reasoning capabilities of RDF engines. However, the SPARQL specification and reasoning capabilities of RDF engines. However, the SPARQL specification disregards the semantics of RDF and RDFS i.e. does not involve automated disregards the semantics of RDF and RDFS i.e. does not involve automated reasoning on top of RDFS constructs such as subclass and subproperty reasoning on top of RDFS constructs such as subclass and subproperty relations. relations.

With this regard, LUBM does not constitute an adequate scenario for SPARQL With this regard, LUBM does not constitute an adequate scenario for SPARQL performance evaluation. This is underlined by the fact that central SPARQL performance evaluation. This is underlined by the fact that central SPARQL operators, such as UNION and OPTIONAL, are not addressed in LUBM.operators, such as UNION and OPTIONAL, are not addressed in LUBM.

Page 14: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

SP2B IntroductionSP2B Introduction

• SP2B Overview –SP2B Overview – A language-specific benchmark framework specifically designed

to test the most common SPARQL constructs, operator constellations, and a broad range of RDF data access patterns.

SP2Bench aims at a comprehensive performance evaluation,

rather than assessing the behavior of engines in an application-driven scenario.

It allows to assess the generality of optimization approaches and to compare them in a universal, application-independent setting.

Page 15: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

SP2B – SPARQL Performance SP2B – SPARQL Performance BenchmarkBenchmark

Benchmarking –Benchmarking – The Barton Library benchmark [19] queries implement a user browsing The Barton Library benchmark [19] queries implement a user browsing

session through the RDF Barton online catalog.session through the RDF Barton online catalog. By design, the benchmark is application-oriented. By design, the benchmark is application-oriented. All queries are encoded in SQL, assuming that the RDF data is stored in All queries are encoded in SQL, assuming that the RDF data is stored in

a relational DB. a relational DB. Due to missing language support for aggregation, most queries cannot Due to missing language support for aggregation, most queries cannot

be translated into SPARQL. be translated into SPARQL. On the other hand, central SPARQL features like left outer joins (the On the other hand, central SPARQL features like left outer joins (the

relational equivalent of SPARQL operator OPTIONAL) andrelational equivalent of SPARQL operator OPTIONAL) and solution modifiers are missing. solution modifiers are missing. In summary, the benchmark offers only limited support for testing native In summary, the benchmark offers only limited support for testing native

SPARQL engines.SPARQL engines.

Benchmark QueriesBenchmark Queries

Page 16: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

SP2B – SPARQL Performance SP2B – SPARQL Performance BenchmarkBenchmark

Design Principles –Design Principles – RelevantRelevant - - thus testing typical operations within the specific thus testing typical operations within the specific

domains.domains.means benchmark should not focus on correctness verification, but on means benchmark should not focus on correctness verification, but on common operator constellations that impose particular challenges.common operator constellations that impose particular challenges.

PortablePortable - - i.e. should be executable on different platformsi.e. should be executable on different platforms.. ScalableScalable - - e.g. it should be possible to run the benchmark on both e.g. it should be possible to run the benchmark on both

small and very large data sets.small and very large data sets.data generator is deterministic, platform independent, and accurate w.r.t. the data generator is deterministic, platform independent, and accurate w.r.t. the desired size of generated documents. desired size of generated documents. Moreover, it is very efficient and gets by with a constant amount of main Moreover, it is very efficient and gets by with a constant amount of main memory, and hence supports the generation of arbitrarily large RDF memory, and hence supports the generation of arbitrarily large RDF documents.documents.

Understandable -Understandable -It is important to keep queries simple and understandable. At the same time, It is important to keep queries simple and understandable. At the same time, they should leave room for diverse optimizations. In this regard, the queries they should leave room for diverse optimizations. In this regard, the queries are designed in such a way that they are amenable to a wide range of are designed in such a way that they are amenable to a wide range of optimization strategies.optimization strategies.

Page 17: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

SP2BSP2BDBLP -DBLP -Digital Bibliography & Library ProjectDigital Bibliography & Library ProjectDBLPDBLP is a is a computer science bibliography website hosted at website hosted at Universität Trier, in , in Germany. .

It was originally a It was originally a database and and logic programming bibliography site, and bibliography site, and has existed at least since the 1980s. has existed at least since the 1980s.

DBLP listed more than one million articles on computer science in March DBLP listed more than one million articles on computer science in March 2008. Journals tracked on this site include VLDB, a journal for very large 2008. Journals tracked on this site include VLDB, a journal for very large databases, the IEEE Transactions and the ACM Transactions. Conference databases, the IEEE Transactions and the ACM Transactions. Conference proceedings papers are also tracked. It is mirrored at five sites across the proceedings papers are also tracked. It is mirrored at five sites across the Internet.Internet.

For his work on maintaining DBLP, Michael Ley received an award from the For his work on maintaining DBLP, Michael Ley received an award from the Association for Computing Machinery and the VLDB Endowment Special Association for Computing Machinery and the VLDB Endowment Special Recognition Award in 1997.Recognition Award in 1997.

DBLPDBLP originally stood for originally stood for DataBase systems and Logic ProgrammingDataBase systems and Logic Programming but is but is now taken to stand for now taken to stand for Digital Bibliography & Library ProjectDigital Bibliography & Library Project..

Page 18: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

SP2BSP2B

DBLP RDF Scheme in SP2B –DBLP RDF Scheme in SP2B –• XML-to-RDF mapping of the original DBLP data set. XML-to-RDF mapping of the original DBLP data set.

• However, as we want to generate arbitrarily-sized documents we provide However, as we want to generate arbitrarily-sized documents we provide lists of first and last names, publishers and random words to our data lists of first and last names, publishers and random words to our data generator. generator.

• Conference and journal names are always of the form “Conference and journal names are always of the form “Conference $i Conference $i ($year)”($year)”

• and “and “Journal $i ($year)”, where $i is a unique conference Journal $i ($year)”, where $i is a unique conference (respectively (respectively journal) number in the year $journal) number in the year $year. year.

• Borrowed vocabulary from FOAF, SWRC, and Dublin Core (DC) to Borrowed vocabulary from FOAF, SWRC, and Dublin Core (DC) to describe persons and scientific resources.describe persons and scientific resources.

• Additionally, we introduce a namespace bench, which defines DBLP-Additionally, we introduce a namespace bench, which defines DBLP-specific document classes, such as bench:Book and bench:Article. specific document classes, such as bench:Book and bench:Article.

Page 19: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

SP2BSP2BData Generator –Data Generator –Data generation is incremental, i.e. small documents are always contained Data generation is incremental, i.e. small documents are always contained in larger documents.in larger documents.

The generator is implemented in C++ and offers two parameters, to fix The generator is implemented in C++ and offers two parameters, to fix either a triple count limit or the year up to which data will be generated. either a triple count limit or the year up to which data will be generated.

When the triple count limit is set, we make sure to end up in a “consistent” When the triple count limit is set, we make sure to end up in a “consistent” state, e.g. when proceedings are written, their conference also will be state, e.g. when proceedings are written, their conference also will be included. included.

All random functions (which, for example, are used to assign the attributes All random functions (which, for example, are used to assign the attributes according to Table I) base on a fixed seed, which makes data generation according to Table I) base on a fixed seed, which makes data generation deterministic. deterministic.

Moreover the implementation is platform-independent, so we ensure that Moreover the implementation is platform-independent, so we ensure that experimental results from different machines are comparable.experimental results from different machines are comparable.

Page 20: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

SP2B – SummarySP2B – SummaryWe have presented the SP2Bench performance benchmark for We have presented the SP2Bench performance benchmark for SPARQL, which constitutes the first methodical approach for testing SPARQL, which constitutes the first methodical approach for testing the performance of SPARQL engines w.r.t. different operator the performance of SPARQL engines w.r.t. different operator constellations, RDF access paths, typical RDF constructs, and a constellations, RDF access paths, typical RDF constructs, and a variety of possible optimization approaches.variety of possible optimization approaches.

Our data generator relies on a deep study of Our data generator relies on a deep study of DBLPDBLP. .

Although it is not possible to mirror Although it is not possible to mirror all correlations found in the all correlations found in the original DBLP data, many aspects are modeled in faithful detail and original DBLP data, many aspects are modeled in faithful detail and the queries are designed in such a way that they build on exactly the queries are designed in such a way that they build on exactly those aspects, which makes them realistic, understandable, and those aspects, which makes them realistic, understandable, and predictable.predictable.

Page 21: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

Barton Data Set - IntroductionBarton Data Set - IntroductionWhat is Barton Data Set –What is Barton Data Set –

The dataset we work with is taken from the publicly available Barton The dataset we work with is taken from the publicly available Barton Libraries dataset. Libraries dataset.

This data is provided by the Simile Project which develops tools for library This data is provided by the Simile Project which develops tools for library data management and interoperability. data management and interoperability.

The data contains records acquired from an RDFformatted dump of the MIT The data contains records acquired from an RDFformatted dump of the MIT Libraries Barton catalog, converted from raw data stored in an old library Libraries Barton catalog, converted from raw data stored in an old library format standard called MARC (Machine Readable Catalog). format standard called MARC (Machine Readable Catalog).

Because of the multiple sources the data was derived from and the diverse Because of the multiple sources the data was derived from and the diverse nature of the data that is cataloged, the structure of the data is quite nature of the data that is cataloged, the structure of the data is quite irregular.irregular.

At the time of publication of this report, there are slightly more than 50 At the time of publication of this report, there are slightly more than 50 million triples in the dataset, with a total of 221 unique properties, of which million triples in the dataset, with a total of 221 unique properties, of which the vast majority appear infrequently. the vast majority appear infrequently.

Of these properties, 82 (37%) are multi-valued, meaning that they appear Of these properties, 82 (37%) are multi-valued, meaning that they appear more than once for a given subject; however, these properties appear more more than once for a given subject; however, these properties appear more often (77% of the triples have a multi-valued property).often (77% of the triples have a multi-valued property).

Page 22: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

Barton Data Set as RDF Barton Data Set as RDF BenchmarkBenchmark

Barton Dataset Use –Barton Dataset Use – As RDF Benchmark As RDF Benchmark This data set can be converted to RDF/RDFXML or Triple formats using This data set can be converted to RDF/RDFXML or Triple formats using

some kind of tools developed and then used as performance benchmark some kind of tools developed and then used as performance benchmark for KBS.for KBS.

Example Use –Example Use – Scalable Semantic Web Data Management Using Vertical Partitioning -Scalable Semantic Web Data Management Using Vertical Partitioning -

Barton data set converted to triples and used as performance benchmark to prove the Barton data set converted to triples and used as performance benchmark to prove the performance of vertical partitioning of semantic web data. performance of vertical partitioning of semantic web data.

Page 23: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

Barton DatasetBarton Dataset

Summary –Summary – Barton data set is huge data set of MIT libraries catalog which Barton data set is huge data set of MIT libraries catalog which

could be used as performance benchmark for semantic web could be used as performance benchmark for semantic web data systems. data systems.

The dataset provides a good demonstration of the relatively The dataset provides a good demonstration of the relatively unstructured nature of Semantic Web data.unstructured nature of Semantic Web data.

Page 24: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

Billion Triple Challenge - Billion Triple Challenge - IntroductionIntroduction

Introduction - Introduction - What is billion triple challenge?What is billion triple challenge? Peter Mika (Yahoo!) and Jim Hendler (RPI) have initiated the Billion Peter Mika (Yahoo!) and Jim Hendler (RPI) have initiated the Billion

Triples Challenge at the 7Triples Challenge at the 7thth Int. Semantic Web Conference. Int. Semantic Web Conference. They constructed the challenge of managing a huge amount of over one They constructed the challenge of managing a huge amount of over one

billion ill-structured facts harvested from public sources such as billion ill-structured facts harvested from public sources such as Wikipedia and semantic home pages and making this information and Wikipedia and semantic home pages and making this information and its relationships available for easy access and intuitive interaction by the its relationships available for easy access and intuitive interaction by the lay user.lay user.

Page 25: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

Billion Triple Challenge – SolutionsBillion Triple Challenge – Solutions

Billion Triple Challenge - problemBillion Triple Challenge - problem Managing a huge amount of over one billion ill-structured facts Managing a huge amount of over one billion ill-structured facts

harvested from public sources such as Wikipedia and semantic home harvested from public sources such as Wikipedia and semantic home pages and making this information and its relationships available for pages and making this information and its relationships available for easy access and intuitive interaction by the lay user.easy access and intuitive interaction by the lay user.

Billion Triple Challenge – solutionBillion Triple Challenge – solution General solution overview/requirements – General solution overview/requirements –

Huge Data and Less MemoryHuge Data and Less Memory

Efficient data store – no redundanciesEfficient data store – no redundancies

Efficient access – easy and fastEfficient access – easy and fast

Page 26: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

Billion Triple ChallengeBillion Triple Challenge

Semantics Scales in the Cloud:Semantics Scales in the Cloud: University of Koblenz wins the challenge.University of Koblenz wins the challenge. Semaplorer - an application –Semaplorer - an application – SemaPlorer is an easy to use application that allows end users to SemaPlorer is an easy to use application that allows end users to

interactively explore and visualize a very large, mixed-quality and interactively explore and visualize a very large, mixed-quality and semantically heterogeneous distributed semantic data set in real-time.semantically heterogeneous distributed semantic data set in real-time.

Its purpose is to acquaint oneself about a city, touristic area, or other Its purpose is to acquaint oneself about a city, touristic area, or other area of interest.area of interest.

By visualizing the data using a map, media, and different context views, By visualizing the data using a map, media, and different context views, we clearly go beyond simple storage and retrieval of large numbers of we clearly go beyond simple storage and retrieval of large numbers of triples. triples.

The interaction with the large data set is driven by the user. The interaction with the large data set is driven by the user. SemaPlorer leverages different semantic data sources such as DBpedia, SemaPlorer leverages different semantic data sources such as DBpedia,

GeoNames, WordNet, and personal FOAF files. These make a significant GeoNames, WordNet, and personal FOAF files. These make a significant portion of the data provided for the billion triple challenge.portion of the data provided for the billion triple challenge.

More info @ More info @ http://www.uni-koblenz-landau.de/koblenz/fb4/institute/IFI/AGStaab/Research/systeme/semaphttp://www.uni-koblenz-landau.de/koblenz/fb4/institute/IFI/AGStaab/Research/systeme/semap

Page 27: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

SummarySummaryTest sets –Test sets – Huge data sets either in RDF/Triple or any other Huge data sets either in RDF/Triple or any other

format that can be used by KBS as RDF store to format that can be used by KBS as RDF store to check the performance of the system while loading, check the performance of the system while loading, querying/accessing such huge data sets.querying/accessing such huge data sets.

LUBM, SP2B are more of benchmarking standardsLUBM, SP2B are more of benchmarking standards Barton Data Set - a huge library data setBarton Data Set - a huge library data set Billion Triple Challenge – Challenge to desig and Billion Triple Challenge – Challenge to desig and

develop such efficient KBS to handle billion triples.develop such efficient KBS to handle billion triples.

Page 28: Test Sets: LUBM, SP2B, Barton Set, Billion Triple Challenge Varsha Dubey

Thank YouThank You