producing, publishing and consuming linked data three lessons from the bio2rdf project

2
Lesson # 1 Rdfise data using ETL software like Talend. Background With the proliferation of new online databases, data integration continues to be one of the major unsolved problems for bioinformatics. In spite of initiatives like BioPAX [1], Biomart [2], the EBI, KEGG and NCBI integrated web resources, the web of bioinformatics databases is still a web of independent data silos. Since 2005, the aim of the Bio2RDF project has been to make popular public datasets available in RDF format; the data description format of the growing Semantic Web. Initially data from OMIM, KEGG, Entrez Gene, along with numerous other resources, were converted to the RDF semantic format. Currently 38 SPARQL endpoints are made available from the Bio2RDF server [3]. Bio2RDF project has been the primary source of bioinformatics data in the Linked Data cloud in 2009. Today many organisations have started to publish their datasets or knowledge bases using the RDF/SPARQL standard. GO, Uniprot and Reactome were early converts to publishing in RDF. Most recently PDBJ, KEGG, NCBO have started to publish their own data in the new semantic way. From the data integration perspective projects like BioLOD [4] from the Riken Institute and Linked Life Data [5] from Ontotext have pushed the Semantic Web model close to production quality service. The linked Data cloud of bioinformatics is now rapidly growing [6]. The technology incubation phase is over. One question data provider should ask themselves now is : How costly is it to produce and publish data in RDF according to this new paradigm ? And, from the bioinformatician data consumer point of view : How useful can semantic web technologies be to build the data mashups needed to support a specific knowledge discovery tasks and the needs of domain experts ? These are the questions we answer here by proposing methods for producing, publishing and consuming RDF data, and by sharing the lessons we have learnt while building Bio2RDF. Producing RDF RDF is all about triples, building triples, storing triples and querying triples. A triple is defined by the subject-predicate-object model. If you have used key-value table before, you already know what triples are. A collection of triples define a graph so generic that all data can be represented using it. Every kind of data can be converted in triples from all known formats: HTML, XML, relational database, columns table or key-value representation. Converting data to RDF is so important to build the Semantic Web that it is expressed by a new verbs : triplify or rdfize ! Building the Bio2RDF rdfizers we had to deal with all those kind of data formats and sources. Lesson #1 Transforming data in RDF is an ETL (Extract Transform Load) task and there are now free and professional frameworks available for this purpose. Talend [7] software is a first class ETL framework, based on Eclipse, generating native java code from a graphical representation of the data transformation workflow. Using this professional quality software to rdfize data is much more productive than writting Java, Perl, PHP scrits as we use to do in the past. To build the namespace SPARQL endpoint at Bio2RDF [8], a RDF mashup composed of GO, Uniprot, LSRN, GenBank, MIRIAM and Bio2RDF namespace description, we generated RDF from XML, HTML, key/value format file, tabular file and also an RDF dump. Using Talend ETL framework has made the programming job and quality testing far more efficient. Publish on the Linked Data web The inventor of HTML, Tim Berner Lee, has also define the rules by which the Semantic Web should be designed [9]: 1) Use URIs as names for things 2) Use HTTP URIs so that people can look up those names. 3) When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) 4) Include links to other URIs. so that they can discover more things. Building Bio2RDF, we have been early adopters of those rules. The DBpedia project, a version of Wikipedia available in RDF format and through one of the first major public SPARQL endpoints, is at the heart of the Linked Data cloud, it is built using Virtuoso triplestore [10], a first class software, that is free and open-source. Lesson #2 To publish semantic web data chose a good triplestore and made a SPARQL endpoint available publicly on the Internet. Bio2RDF project has also depended on Virtuoso, and benefits from all the innovation in each new version. Virtuoso not only offers SPARQL endpoint to submit queries based on the W3C standards, full text search and facet browsing-based user interface are available so the RDF graph can be browsed, queried, searched and explored with type ahead completion service. All this from one software product directly out of the box. Sesame [11], 4Store [12], Mulgara [13] and other new projects emerging each year make publishing data over the web a new affordable reality. Consuming triples Why should we start using Semantic Web data and technologies ? Because building a database from public resources on the web is more efficient than the traditional way of creating datawarehouse. The Giant Global Graph (GGG) of the entire Semantic Web is the new datastore you can build your semantic mashup from with the tools of your choice. To answer a high level scientific question from data already available in RDF, you need first to build a specific triplestore that you will eventually be able to query to, and hopefully, will obtain the expected results. Building a specific database just to answer a specific question, this is what semantic data mashup are about. Lesson #3 Semantic datasources available from SPARQL endpoint can be consumed in all kind of ways to create mashup. For example the following ways of consuming RDF include; (i) SPARQL queries over REST, (ii) dereferenced RDF graph by URI over HTTP, (iii) SOAP services returning RDF or even better still (iv) the new web services model proposed by SADI framework [14]. Programming in Java, PHP, Ruby or PERL, using RDF/XML, Turtle or JSON/RDF format is also possible and the needed software get better each year. Its is a wild new world of open technologies you will benefit from and to learn and use. The Bio2RDF project first offered an RDF graph that could be dereferenced by a URI in the form http://bio2rdf.org/omim:602080. Any HTTP GET request will return the RDF version of a document from one of the database we expose as RDF in the format of your choice. Next, you can submit queries directly to one of our public SPARQL endpoints like http://namespace.bio2rdf.org/sparql . Programming a script or designing a workflow with software like Taverna or Talend, you can build your data mashup from the growing semantic web data sources in days, not weeks. To explore the possibilities offered by a triplestore, discover the Bio2RDF SPARQL endpoint about bioinformatics database at http://namespace.bio2rdf.org/fct , submit SPARQL queries to its endpoint at http://namespace.bio2rdf.org/sparql And, if you are a SOAP services user, consume its web services described here http://namespace.bio2rdf.org/bio2rdf/SOAP/services.wsdl . Discussion Combining data from different sources is the main problem of data integration in bioinformatics. The Semantic Web community have addressed this problem for years, now the emergent Semantic Web technologies are mature and ready to be used in production scale systems. The Bio2RDF community think that solving data integration problem in bioinformatics can be solve by applying existing Semantic Web practices. The bioinformatics community could significantly benefit from what is being developed now, in fact our community has done a lot to show that Semantic Web model has a great potential in solving Life Science problems. By sharing our own Bio2RDF experience and these simple lessons we have learned, we hope that may be you should give it a try in your next data integration project. These are the instructions creating triples from the data flow. Producing, Publishing and Consuming Linked Data Three lessons from the Bio2RDF project François Belleau ([email protected]) Acknowledgements Bio2RDF is a community project available at http://bio2rdf.org The community can be joined at https://groups.google.com/forum/?fromgroups#! forum/bio2rdf This work was done under the supervision of Dr Arnaud Droit, assistant professor and director of the Centre de Biologie Computationnelle du CRCHUQ at Laval University , where a mirror of Bio2RDF is hosted. Michel Dumontier, from the Dumontier Lab at Carleton University, is also hosting Bio2RDF server and actually leads the project Thanks to all the people member of the Bio2RDF community, and especially Marc- Alexandre Nolin and Peter Ansell, initial developers. Expose data as RDF using dereferencable URI according to design rule #1 and #2 Full text search query results with ranking based on the number of connections in the graph Lesson # 2 To publish semantic data use a triplestore like Virtuoso Lesson # 2 To publish semantic data use a triplestore like Virtuoso Discover concepts using type ahead search Make a SPARQL endpoint public so query can be submited. Here is a query used to discoverer the schema of an unknown triplestore. Lesson # 3 Consume semantic data as you like, using HTTP GET, SOAP services or new tool designed to explore RDF data. Using the RelFinder tool [15] it is possible to query RDF graphically and visualise the triplestore's graph. 9) http://www.w3.org/DesignIssues/LinkedData.html 10) http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/ 11) http://www.openrdf.org/ 12) http://4store.org/ 13) http://www.mulgara.org/ 14) http://sadiframework.org 15) http://www.visualdataweb.org/relfinder.php 16) http://www.soapui.org/ Using soapUI popular tool [16] you can consume Bio2RDF's SOAP services returning triples in ntriple format. References 1) http://www.biopax.org/ 2) http://www.biomart.org/ 3) http://www.bio2rdf.org/ 4) http://biolod.org/ 5) http://linkedlifedata.com/ 6) http://richard.cyganiak.de/2007/10/lod/ 7) http://talend.com/ 8) http://namespace.bio2rdf.org/sparql This is the workflow producing triples from Genbank HTML web page about external database references.

Upload: francois-belleau

Post on 08-Jul-2015

1.365 views

Category:

Technology


0 download

DESCRIPTION

How to produce RDF using Talend, How to publish it with Virtuoso, How to consume Linked Data...

TRANSCRIPT

Page 1: Producing, Publishing and Consuming Linked Data Three lessons from the Bio2RDF project

Lesson # 1Rdfise data using ETL software like Talend.

Background

With the proliferation of new online databases, data integration continues to be one of the major unsolved problems for bioinformatics. In spite of initiatives like BioPAX [1], Biomart [2], the EBI, KEGG and NCBI integrated web resources, the web of bioinformatics databases is still a web of independent data silos.

Since 2005, the aim of the Bio2RDF project has been to make popular public datasets available in RDF format; the data description format of the growing Semantic Web. Initially data from OMIM, KEGG, Entrez Gene, along with numerous other resources, were converted to the RDF semantic format. Currently 38 SPARQL endpoints are made available from the Bio2RDF server [3].

Bio2RDF project has been the primary source of bioinformatics data in the Linked Data cloud in 2009. Today many organisations have started to publish their datasets or knowledge bases using the RDF/SPARQL standard. GO, Uniprot and Reactome were early converts to publishing in RDF. Most recently PDBJ, KEGG, NCBO have started to publish their own data in the new semantic way. From the data integration perspective projects like BioLOD [4] from the Riken Institute and Linked Life Data [5] from Ontotext have pushed the Semantic Web model close to production quality service. The linked Data cloud of bioinformatics is now rapidly growing [6]. The technology incubation phase is over.

One question data provider should ask themselves now is : How costly is it to produce and publish data in RDF according to this new paradigm ? And, from the bioinformatician data consumer point of view : How useful can semantic web technologies be to build the data mashups needed to support a specific knowledge discovery tasks and the needs of domain experts ?

These are the questions we answer here by proposing methods for producing, publishing and consuming RDF data, and by sharing the lessons we have learnt while building Bio2RDF.

Producing RDF

RDF is all about triples, building triples, storing triples and querying triples. A triple is defined by the subject-predicate-object model. If you have used key-value table before, you already know what triples are. A collection of triples define a graph so generic that all data can be represented using it. Every kind of data can be converted in triples from all known formats: HTML, XML, relational database, columns table or key-value representation. Converting data to RDF is so important to build the Semantic Web that it is expressed by a new verbs : triplify or rdfize ! Building the Bio2RDF rdfizers we had to deal with all those kind of data formats and sources.

Lesson #1 Transforming data in RDF is an ETL (Extract Transform Load) task and there are now free and professional frameworks available for this purpose.

Talend [7] software is a first class ETL framework, based on Eclipse, generating native java code from a graphical representation of the data transformation workflow. Using this professional quality software to rdfize data is much more productive than writting Java, Perl, PHP scrits as we use to do in the past.

To build the namespace SPARQL endpoint at Bio2RDF [8], a RDF mashup composed of GO, Uniprot, LSRN, GenBank, MIRIAM and Bio2RDF namespace description, we generated RDF from XML, HTML, key/value format file, tabular file and also an RDF dump. Using Talend ETL framework has made the programming job and quality testing far more efficient.

Publish on the Linked Data web

The inventor of HTML, Tim Berner Lee, has also define the rules by which the Semantic Web should be designed [9]:

1) Use URIs as names for things2) Use HTTP URIs so that people can look up those names.3) When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)4) Include links to other URIs. so that they can discover more things.

Building Bio2RDF, we have been early adopters of those rules. The DBpedia project, a version of Wikipedia available in RDF format and through one of the first major public SPARQL endpoints, is at the heart of the Linked Data cloud, it is built using Virtuoso triplestore [10], a first class software, that is free and open-source.

Lesson #2 To publish semantic web data chose a good triplestore and made a SPARQL endpoint available publicly on the Internet.

Bio2RDF project has also depended on Virtuoso, and benefits from all the innovation in each new version. Virtuoso not only offers SPARQL endpoint to submit queries based on the W3C standards, full text search and facet browsing-based user interface are available so the RDF graph can be browsed, queried, searched and explored with type ahead completion service. All this from one software product directly out of the box.

Sesame [11], 4Store [12], Mulgara [13] and other new projects emerging each year make publishing data over the web a new affordable reality.

Consuming triples

Why should we start using Semantic Web data and technologies ? Because building a database from public resources on the web is more efficient than the traditional way of creating datawarehouse. The Giant Global Graph (GGG) of the entire Semantic Web is the new datastore you can build your semantic mashup from with the tools of your choice.

To answer a high level scientific question from data already available in RDF, you need first to build a specific triplestore that you will eventually be able to query to, and hopefully, will obtain the expected results. Building a specific database just to answer a specific question, this is what semantic data mashup are about.

Lesson #3 Semantic datasources available from SPARQL endpoint can be consumed in all kind of ways to create mashup.

For example the following ways of consuming RDF include; (i) SPARQL queries over REST, (ii) dereferenced RDF graph by URI over HTTP, (iii) SOAP services returning RDF or even better still (iv) the new web services model proposed by SADI framework [14]. Programming in Java, PHP, Ruby or PERL, using RDF/XML, Turtle or JSON/RDF format is also possible and the needed software get better each year. Its is a wild new world of open technologies you will benefit from and to learn and use.

The Bio2RDF project first offered an RDF graph that could be dereferenced by a URI in the form http://bio2rdf.org/omim:602080. Any HTTP GET request will return the RDF version of a document from one of the database we expose as RDF in the format of your choice. Next, you can submit queries directly to one of our public SPARQL endpoints like http://namespace.bio2rdf.org/sparql. Programming a script or designing a workflow with software like Taverna or Talend, you can build your data mashup from the growing semantic web data sources in days, not weeks.

To explore the possibilities offered by a triplestore, discover the Bio2RDF SPARQL endpoint about bioinformatics database at http://namespace.bio2rdf.org/fct, submit SPARQL queries to its endpoint at http://namespace.bio2rdf.org/sparqlAnd, if you are a SOAP services user, consume its web services described here http://namespace.bio2rdf.org/bio2rdf/SOAP/services.wsdl.

Discussion

Combining data from different sources is the main problem of data integration in bioinformatics. The Semantic Web community have addressed this problem for years, now the emergent Semantic Web technologies are mature and ready to be used in production scale systems. The Bio2RDF community think that solving data integration problem in bioinformatics can be solve by applying existing Semantic Web practices. The bioinformatics community could significantly benefit from what is being developed now, in fact our community has done a lot to show that Semantic Web model has a great potential in solving Life Science problems. By sharing our own Bio2RDF experience and these simple lessons we have learned, we hope that may be you should give it a try in your next data integration project.

These are the instructions creating triplesfrom the data flow.

Producing, Publishing and Consuming Linked Data

Three lessons from the Bio2RDF project

François Belleau ([email protected])

Acknowledgements

● Bio2RDF is a community project available at http://bio2rdf.org● The community can be joined at https://groups.google.com/forum/?fromgroups#!

forum/bio2rdf● This work was done under the supervision of Dr Arnaud Droit, assistant professor and

director of the Centre de Biologie Computationnelle du CRCHUQ at Laval University, where a mirror of Bio2RDF is hosted.

● Michel Dumontier, from the Dumontier Lab at Carleton University, is also hosting Bio2RDF server and actually leads the project

● Thanks to all the people member of the Bio2RDF community, and especially Marc-Alexandre Nolin and Peter Ansell, initial developers.

Expose data as RDF using dereferencable URI according to design rule #1 and #2

Full text search query results with ranking based on the number of connections in the graph

Lesson # 2To publish semantic data use

a triplestore like Virtuoso

Lesson # 2To publish semantic data use

a triplestore like Virtuoso

Discover concepts using type ahead search

Make a SPARQL endpoint public so query can be submited.

Here is a query used to discoverer

the schema of an unknown triplestore.

Lesson # 3Consume semantic data as you like,using HTTP GET, SOAP services

or new tool designed to explore RDF data.

Using the RelFinder tool [15] it is possible to query RDF graphically and visualise the triplestore's graph.

9) http://www.w3.org/DesignIssues/LinkedData.html 10) http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/ 11) http://www.openrdf.org/ 12) http://4store.org/ 13) http://www.mulgara.org/ 14) http://sadiframework.org 15) http://www.visualdataweb.org/relfinder.php 16) http://www.soapui.org/

Using soapUI popular tool [16] you can consume Bio2RDF's SOAP

services returning triples in ntriple format.

References

1) http://www.biopax.org/2) http://www.biomart.org/ 3) http://www.bio2rdf.org/ 4) http://biolod.org/ 5) http://linkedlifedata.com/ 6) http://richard.cyganiak.de/2007/10/lod/ 7) http://talend.com/ 8) http://namespace.bio2rdf.org/sparql

This is the workflow producing triples from Genbank HTML web page about

external database references.

Page 2: Producing, Publishing and Consuming Linked Data Three lessons from the Bio2RDF project