a context sensitive model for querying linked scientific dataeprints.qut.edu.au › 49777 › 1 ›...

A context sensitive model for queryinglinked scientific data

Peter AnsellBachelor of Science/Bachelor of Business (Biomedical Science and IT)

Avondale CollegeDecember 2005

Bachelor of IT (Hons., IIA)Queensland University of Technology

December 2006

A thesis submitted in partial fulfilment of the requirements for the degree ofDoctor of Philosophy

November, 2011

Principal Supervisor: Professor Paul RoeAssociate Supervisor: A/Prof James Hogan

*School of Information TechnologyFaculty of Science and TechnologyQueensland Univesity of TechnologyBrisbane, Queensland, AUSTRALIA

c© Copyright by Peter Ansell 2011. All Rights Reserved.

The author hereby grants permission to the Queensland University of Technology toreproduce and redistribute publicly paper and electronic copies of this thesis document in

whole or in part.

Keywords: Semantic web, RDF, Distributed databases, Linked Data

iii

Abstract

This thesis provides a query model suitable for context sensitive access to a wide rangeof distributed linked datasets which are available to scientists using the Internet. Themodel is designed based on scientific research standards which require scientists to pro-vide replicable methods in their publications. Although there are query models availablethat provide limited replicability, they do not contextualise the process whereby differentscientists select dataset locations based on their trust and physical location. In differentcontexts, scientists need to perform different data cleaning actions, independent of theoverall query, and the model was designed to accommodate this function. The querymodel was implemented as a prototype web application and its features were verifiedthrough its use as the engine behind a major scientific data access site, Bio2RDF.org.The prototype showed that it was possible to have context sensitive behaviour for eachof the three mirrors of Bio2RDF.org using a single set of configuration settings. Theprototype provided executable query provenance that could be attached to scientificpublications to fulfil replicability requirements. The model was designed to make itsimple to independently interpret and execute the query provenance documents usingcontext specific profiles, without modifying the original provenance documents. Exper-iments using the prototype as the data access tool in workflow management systemsconfirmed that the design of the model made it possible to replicate results in differentcontexts with minimal additions, and no deletions, to query provenance documents.

v

Contents

Acknowledgements xvii

1 Introduction 11.1 Scientific data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Distributed data . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.1.2 Science example . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.2.1 Data quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.2.2 Data trust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.2.3 Context sensitivity and replication . . . . . . . . . . . . . . . . . 20

1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.4 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.6 Research artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.7 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Related work 252.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2 Early Internet and World Wide Web . . . . . . . . . . . . . . . . . . . . 30

2.2.1 Linked documents . . . . . . . . . . . . . . . . . . . . . . . . . . 312.3 Dynamic web services . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3.1 SOAP Web Service based workflows . . . . . . . . . . . . . . . . 322.4 Scientific data formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.5 Semantic web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.5.1 Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.5.2 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.5.3 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6 Conversion of scientific datasets to RDF . . . . . . . . . . . . . . . . . . 392.7 Custom distributed scientific query applications . . . . . . . . . . . . . . 402.8 Federated queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Model 473.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2 Query types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

vii

3.2.1 Query groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.3 Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3.1 Provider groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.3.2 Endpoint implementation independence . . . . . . . . . . . . . . 60

3.4 Normalisation rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.4.1 URI compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.5 Namespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.5.1 Contextual namespace identification . . . . . . . . . . . . . . . . 67

3.6 Model configurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.6.1 Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.6.2 Multi-dataset locations . . . . . . . . . . . . . . . . . . . . . . . . 69

3.7 Formal model algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.7.1 Formal model specification . . . . . . . . . . . . . . . . . . . . . 72

4 Integration with scientific processes 754.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2 Data exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2.1 Annotating data . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.3 Integrating Science and Medicine . . . . . . . . . . . . . . . . . . . . . . 784.4 Case study: Isocarboxazid . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.4.1 Use of model features . . . . . . . . . . . . . . . . . . . . . . . . 844.5 Web Service integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.6 Workflow integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.7 Case Study : Workflow integration . . . . . . . . . . . . . . . . . . . . . 92

4.7.1 Use of model features . . . . . . . . . . . . . . . . . . . . . . . . 954.7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.8 Auditing semantic workflows . . . . . . . . . . . . . . . . . . . . . . . . . 984.9 Peer review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.10 Data-based publication and analysis . . . . . . . . . . . . . . . . . . . . 102

5 Prototype 1075.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.2 Use on the Bio2RDF website . . . . . . . . . . . . . . . . . . . . . . . . 1085.3 Configuration schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.4 Query type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.4.1 Template variables . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.4.2 Static RDF statements . . . . . . . . . . . . . . . . . . . . . . . . 118

5.5 Provider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.6 Namespace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.7 Normalisation rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.7.1 Integration with other Linked Data . . . . . . . . . . . . . . . . . 1245.7.2 Normalisation rule testing . . . . . . . . . . . . . . . . . . . . . . 126

5.8 Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

viii

6 Discussion 1296.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.2.1 Context sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.2.2 Data quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.2.3 Data trust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1376.2.4 Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.3 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.3.1 Context sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.3.2 Data quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1446.3.3 Data trust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1506.3.4 Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6.4 Distributed network issues . . . . . . . . . . . . . . . . . . . . . . . . . . 1556.5 Prototype statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1586.6 Comparison to other systems . . . . . . . . . . . . . . . . . . . . . . . . 162

7 Conclusion 1657.1 Critical reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7.1.1 Model design review . . . . . . . . . . . . . . . . . . . . . . . . . 1677.1.2 Prototype implementation review . . . . . . . . . . . . . . . . . . 1707.1.3 Configuration maintenance . . . . . . . . . . . . . . . . . . . . . 1717.1.4 Trust metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1727.1.5 Replicable queries . . . . . . . . . . . . . . . . . . . . . . . . . . 1737.1.6 Prominent data quality issues . . . . . . . . . . . . . . . . . . . . 173

7.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1757.3 Future model and prototype extensions . . . . . . . . . . . . . . . . . . . 1787.4 Scientific publication changes . . . . . . . . . . . . . . . . . . . . . . . . 181

A Glossary 185

B Common Ontologies 187

ix

List of Figures

1.1 Example Scientific Method . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Flow of information in living cells . . . . . . . . . . . . . . . . . . . . . . 41.3 Integrating scientific knowledge . . . . . . . . . . . . . . . . . . . . . . . 51.4 Use of information by scientists . . . . . . . . . . . . . . . . . . . . . . . 61.5 Storing scientific data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.6 Data issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.7 Datasets related to flow of information in living cells . . . . . . . . . . . 101.8 Different methods of referencing Uniprot in Entrez Gene dataset . . . . 111.9 Different methods of referencing Entrez Gene in Uniprot dataset . . . . 121.10 Different methods of referencing Entrez Gene and Uniprot in HGNC dataset 121.11 Indirect links between DailyMed, Drugbank, and KEGG . . . . . . . . . 131.12 Direct links between DailyMed, Drugbank, and KEGG . . . . . . . . . . 141.13 Bio2RDF website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1 Linked Data Naive Querying . . . . . . . . . . . . . . . . . . . . . . . . 272.2 Non Symmetric Linked Data . . . . . . . . . . . . . . . . . . . . . . . . 282.3 Distributed data access versus local data silo . . . . . . . . . . . . . . . 292.4 Heterogeneous datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.5 Linked Data access methods . . . . . . . . . . . . . . . . . . . . . . . . . 462.6 Using Linked Data to retrieve information . . . . . . . . . . . . . . . . . 46

3.1 Comparison of model to federated RDF query models . . . . . . . . . . 493.2 Query parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.3 Example: Search for Marplan in Drugbank . . . . . . . . . . . . . . . . . 513.4 Search for all references to disease . . . . . . . . . . . . . . . . . . . . . . 543.5 Search for references to disease in a particular namespace . . . . . . . . 553.6 Search for references to disease using a new query type . . . . . . . . . . 563.7 Search for all references in a local curated dataset . . . . . . . . . . . . . 573.8 Query groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.9 Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.10 Provider groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.11 Normalisation rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.12 Single query template across homogeneous datasets . . . . . . . . . . . . 65

4.1 Medicine related RDF datasets . . . . . . . . . . . . . . . . . . . . . . . 80

xi

4.2 Links between datasets in Isocarboxazid case study . . . . . . . . . . . . 814.3 Integration of prototype with Semantic Web Pipes . . . . . . . . . . . . 934.4 Semantic tagging process . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.5 Peer review process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.6 Scientific publication cycle . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.1 URL resolution using prototype . . . . . . . . . . . . . . . . . . . . . . . 1105.2 Simple system configuration in Turtle RDF file format . . . . . . . . . . 1115.3 Optimising a query based on context . . . . . . . . . . . . . . . . . . . . 1135.4 Public namespaces and private identifiers . . . . . . . . . . . . . . . . . 1145.5 Template parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.6 Uses of static RDF statements . . . . . . . . . . . . . . . . . . . . . . . . 1185.7 Context sensitive use of providers by prototype . . . . . . . . . . . . . . 1205.8 Namespace overlap between configurations . . . . . . . . . . . . . . . . . 122

6.1 Enforcing syntactic data quality . . . . . . . . . . . . . . . . . . . . . . . 1366.2 Different paging strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.1 Overall solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

xii

List of Tables

2.1 Bio2RDF dataset sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1 Normalisation rule stages . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Normalisation rule methods in prototype . . . . . . . . . . . . . . . . . . 123

6.1 Context change scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.2 Comparison of data quality abilities in different systems . . . . . . . . . 1506.3 Comparison of data trust capabilities of various systems . . . . . . . . . 1536.4 Bio2RDF prototype statistics . . . . . . . . . . . . . . . . . . . . . . . . 161

xiii

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meetrequirements for an award at this or any other higher education institution.To the best of my knowledge and belief, the thesis contains no material pre-viously published or written by another person except where due reference ismade.

Signature:Peter Ansell

Date:

xv

Acknowledgements

I dedicate this thesis to my wife, Karina Ansell. She has been very supportive of myresearch throughout my candidature. I would also like to thank my supervisors for theeffort they put into guiding me through the process.

xvii

Chapter 1

Introduction

There are many different ways of sharing information using current communicationstechnology. A business may write a marketing letter to a customer, or they may emailthe information with a link to their website. A grandparent may send a birthdaycard to their grandchild, or they may write a note on a social networking website.A politician may make a speech on television, or they may self-publish a video on avideo-sharing website. A scientist may publish an article in a paper based journal,while simultaneously publishing it on a website. In each case, the information needsto be shared with others to have an effect. In some cases, the information needs tobe processed by humans, while in other cases computers must be able to process andintegrate the data. In all cases, the data must be accessible by the interested parties.

In a social network a person may want to plan and share details about an upcomingevent. For example, they may want to plan a menu based on the expected guests ata dinner party. If the social network made it possible for people to share their mealpreferences with their friends, the host may be able to use this data to determine whichdishes would be acceptable. Although the host could print out a list of meal preferencesfrom the social networking site and compare them to each of their recipes, it would bemore efficient for the host to automatically derive a list of recipes that match the guestscriteria. For the process to work effectively the preferences on the social networking sitemust be matched with the ingredients on the recipe website.

Business and social networks do not require their information to be publicly availablefor it to be useful, especially as the privacy of information is paramount in attractingusers. In the dinner party example, the guests do not need to publicly disclose prefer-ences, and the recipe website does not require information about each guest individually.In order for the menu planning system to work, it needs the guests to include informa-tion about which ingredients and recipes they like and dislike. This information is thenaccessible to their social networked friends, who can use the recipe site without theirfriends having to be members.

Social networks and business-to-business networks rely heavily on data integration,for example, the foods on the social networking site need to be matched to the mealingredients on the recipe site. However, in both cases there are a small number oftypes of data to represent. For example, social networks may only need to be able to

1

2 Chapter 1. Introduction

represent people or agents, and networks or groups, while businesses may only needto exchange information about suppliers, customers and stock. These conditions mayhave contributed to the lack of a large set of interlinked data available about business orsocial networks, as the low level of complexity makes it possible for humans to examineand manually link datasets as needed. The dinner party example is not functional atthis time, as there are currently no datasets available with information about recipeingredients that have been used by social networks to create meal preference profiles.However, both social networks and businesses may be encouraged to publish more in-formation in the future based on schemas such as FOAF [57] and GoodRelations [68];including the recently published LinkedOpenCommerce site1.

Networks are also important for scientists as they provide a fast and effective way forthem to publicly provide and exchange all of the relevant data for their publications withother scientists in order for their data and results to be peer reviewed and replicated [51].They perform experiments using data from public and private datasets to derive newresults using variations of the scientific method, possibly similar to those described byKuhn [81], Crawford and Stucki [43], or Godfrey-Smith [54]. An example of a workflowdescribing one of the many scientific methods is shown in Figure 1.1.

In some sciences, such as biology, and chemistry, there are a large number of differentsources of curated data that contain links between datasets, as compared to othersciences where there may be a limited number of sources of data for scientists to integrateand crosscheck their experiments against. In cellular biology scientists need to integratevarious types of data, including data about genetic material, genes, translated genes,proteins, regulatory networks, protein functions, and others, as shown in Figure 1.2.This data is generally distributed across a number of datasets that focus either on aparticular disease or organism, such as cancer [120] or flies [150], or on a particular typeof information, such as proteins [19].

These biological datasets contain data from many different scientists and it is impor-tant that it is accurately described and regularly updated. Scientists continually updatethese datasets using new data along with links to relevant data items in other datasets.For example, Figure 1.3 shows the changes necessary to reflect the new discovery thata gene found in humans is effectively identical to a gene in mice.

There are a wide range of ways that well linked data can be used by scientists,including the motivation and hypothesis for an experiment, during the design of theexperiment, and in the publication of the results, as shown in Figure 1.4. For example,the scientist would use the fact that the gene was equivalent in humans and mice asthe motivation, with the hypothesis that the gene may cause cancer in mice. Thecharacteristics of the cancer, including the fact that it occurs in bones would be used aspart of the experiment design, and the results would contain the implication that thegene in mice causes cancer. The combined knowledge about the genes, the cancer, andthe resulting implications, would be used by peer reviewers to determine whether theresulting knowledge was publishable.

1http://linkedopencommerce.com/

http://linkedopencommerce.com/

Chapter 1. Introduction 3

Scientist forms results into a publication

Scientist reads published material

Scientist forms a testable hypothesis

Scientist designs an experiment

Scientist performs experiment

Scientist analyses data

Scientist makes changes as

required

Scientist submits publication to a journal

Journal editor sends anonymous copies of the publication to peers for review

Peers review work with reference to its validity

and its agreeance with previous published work

Journal editor takes peer reviews and decides whether to publish the work

Article is published as part of a journal issue(Electronic and/or paper)

Scientist analyses previous experiments

Citations by peers Critical reviews by peers

Figure 1.1: Example Scientific Method


Figure 1.2: Flow of information in living cellsOriginal figure source: Kenzelmann, Rippe, and Mattick. [2006].

doi:10.1038/msb4100086

As the amount of data grows, and more scientists contribute data about differenttypes of concepts, it is typical for scientists to store data about concepts such as genesand diseases in different locations. Scientists are then able to access both datasetsefficiently based on the types of data that they need for their research. The combineddata is discoverable by scientists accessing multiple datasets, using the links containedin each dataset for correlation, as shown in Figure 1.5. In that example, the data needsto be integrated for future researchers to study the similar effects of the gene in bothhumans and mice. Scientists can use the link specifying that the genes are identical toimply that the gene may cause cancer in mice, before testing this hypothesis with anexperiment.

Although most scientific publications only demonstrate the final set of steps thatwere used to determine the novel result, the process of determining goals and elimi-nating strategies which are not useful is important. The resulting publications wouldcontain links to the places where the knowledge could be found. The information inthe publication would make it possible for other scientists to reproduce the experiment,and provide evidence for the knowledge that the gene causes cancer in both humansand mice.

Scientific peers may want to reproduce or extend an experiment using their ownresources. These experiments are very similar to the structure of the peer’s experiments,but the data access and processing resources need to be changed to fit the scientist’s

http://dx.doi.org/10.1038/msb4100086


Initial information

Integrated information

Gene ABCC1

Human

Found in:

Gene GHFC

Mouse

Found in:

Bone Cancer

Causes:

Gene ABCC1

Human

Found in:

Gene GHFC

Mouse

Found in:

Bone Cancer

Causes:

Identical genes:

May cause:

Identity used to imply:

Figure 1.3: Integrating scientific knowledge

context. For example, they may have personally curated a copy of a dataset, and wishto substitute it with the publicly available copy that was originally used. This datasetmay not have the same structure, and it may not be accessible using the same methodas the public dataset. In the example, a peer may have a dataset that contains empiricaldata related to the expression of a gene in various mice and wish to use this as part ofthe experiment. If the curated dataset contains enough information to satisfy the goalsof the experiment, the peer should be able to easily substitute the datasets to replicatethe experiment.

1.1 Scientific data

There are a large number of publicly available scientific datasets that are useful invarious disciplines. In disciplines such as physics, datasets are mostly made up ofdirect numeric observations, and there is little relationship between the raw data fromdifferent experiments. In other sciences, particularly those based around biology, mostdata describes non-numeric information such as gene sequences or relationships betweenproteins in a cell, and there is a clear relationship between the concepts in differentdatasets. In the case of biology, there is a clear incentive to integrate different typesof data, compared to physics where it is possible to perform isolated experiments withand process the raw data without direct correlations to shared concepts such as curatedgene networks.


Relationship of published information to the scientific method

Experiment

Hypothesis

Exploration

Analysis of data

Conclusions

Article writing

Example methods and sources Scientific method

Peer review

Publication

Read articles and search databased to ascertain where research gaps exist

Form a falsifiable hypothesis which attempts to create new knowledge

in the area

Use a previously published experimental method and terminology

Use accepted conventions on data processing to further understand the

data

Explain results in terms of current scientific data, with reference to why

the conclusions may be different

Write up coherent arguments suitable for publication including references and

citations to appropriate prior articles

Verify the coherence of the conclusive arguments and the suitability of the experimental method and analysis

Contribute method and results to the body of scientific knowledge through

publication of the peer-reviewed article

PLoS One, myExperiment, CPAN, CRAN, SourceForge

XML, TSV, SOAP, RDF, Workflow

HTML, URL, RDF, URI

LaTeX, BibTeX, Word, Endnote

Workflow, Custom system

PLoS One, myExperiment, CPAN, CRAN, SourceForge

Paper, HTML, PDF, Databases

PubMed, NCBI Gene, NCBI Taxon, Diseasome

NCBI XML, BioMoby, Bio2RDF, Taverna

Written, Programming code, Workflow

Written, Ontology, Linked Data

Written

Paper, PDF, RDF, XML, Workflow,

Figure 1.4: Use of information by scientists


Mouse Genes

Unlinked information

Linked information

Gene ABCC1

Human

Found in:

Gene GHFC

Mouse

Found in:

Bone Cancer

Causes:

Gene ABCC1

Human

Found in:

Gene GHFC

Mouse

Found in:Causes:

Identical genes:

May cause:

Human Genes Cancers

Bone Cancer

Bones

Found in:

Cancers

Bone Cancer

Bones

Found in:

Mouse GenesHuman Genes

Gene links

Implied knowledge

Figure 1.5: Storing scientific data

In biology, an experiment relating to the functioning of a cell in one organism mayshare a number of conceptual similarities with another experiment examining the ef-fects of a drug on a different organism. The similarities between the experiments areidentifiable, and are commonly shared between scientists by including links in publisheddatasets. For example, a current theory about how genetic information influences thebehaviour of cells in living organisms is shown in Figure 1.2, illustrating a small numberof data types that are required for biologists to integrate when processing their data.

In situations where data is heavily tied to particular experiments there may not be acase for adding links to other datasets. In other cases the experimental results may onlybe relevant when they are interpreted in terms of shared concepts and links betweendatasets that can be identified separately from the experiment. If the links and conceptscannot easily be recognised, then the experimental results may not be interpreted fully.A scientist needs to be able to access data using links and recognise the meaning of thelink.

Given the magnitude and complexity of the scientific data available, which includesa range of small and large datasets shown in Figure 4.1 and a range of concepts shownin Figure 1.2, scientists cannot possibly maintain it all in a single location. In practice,scientists distribute their data across different datasets, in most cases based on the typeof information that the data represents. For example, some datasets contain informationabout genes, while others contain information about diseases such as cancer. Scientiststypically use the World Wide Web to access these datasets, although there are othermethods including Grids [85] and custom data access systems, such as LSIDs [41], that


can be used to access distributed data.

1.1.1 Distributed data

Science is recognised to be fundamentally changing as a result of the electronic publica-tion of datasets that supplement paper and electronic journal articles [71]. It is simpleto publish data electronically, as the World Wide Web is essentially egalitarian, as ithas very low barriers to entry. In comparison, academic journals which are based onpeer review and authoritarian principles have higher barriers to entry, although theyshould not be designed to distinguish the source of an article based on their socialstanding. Although it is simple to publish datasets, it is comparatively hard to get thedata recognised and linked by other datasets.

In practice, there are a limited number of scientists curating these distributeddatasets, as most scientists are users rather than curators or submitters of data. Inaddition, the scientists using these distributed datasets may be assisted by researchassistants, so the ideal “scientist” performing experiments or analysing results usingdifferent distributed datasets will likely be a group of collaborating researchers andassistants.

There are both technical and infrastructure problems that may make it hard for anew dataset to be linked to from existing scientific datasets. These issues range fromoutdated information, to the size of the dataset, and the way the data is made availablefor use by scientists.

Data maybe outdated or not linked if it is difficult for the maintainer of a dataset toobtain or correlate the data from another dataset. If a dataset maintainer is not sure thata piece of data from another dataset is directly relevant, they are less likely to link to thedata. As distributed data requires that datasets give labels to pieces of information, itis important that different datasets use compatible methods for identifying each item.Although most datasets allow other datasets to freely reuse their identifiers withoutcopyright restrictions, there are some closed, commercial datasets that put restrictionson reuse of their identifiers. This limits the usefulness of these identifiers, as scientistscannot openly critique the closed datasets because the act of referring to the closeddataset identifier in an open dataset could be illegal 2. By comparison, these datasetsfreely link to other datasets, so if they were open, they would be valuable sources ofdata for the scientific community.

If there are no accurate, well-linked, sources of data, a scientist may need to manuallycurate the available data and host it locally. In some cases scientists may have no issuestrusting a dataset to be updated regularly and contain accurate links to other datasets,but there may be a contextual reason why they prefer to use data from a differentlocation, including a third party middleware service and their own computing resources.The contextual reasons may include a more regularly operational data provider or alocally available copy of the data. Knowledge of these and other operational constraintsare necessary for other scientists to verify and replicate the research in future.

2http://lists.w3.org/Archives/Public/public-lod/2011Aug/0117.html

http://lists.w3.org/Archives/Public/public-lod/2011Aug/0117.html


1.1.2 Science example

The science of biology offers a large number of public linked datasets, in comparison tochemistry where there are a large number of datasets, but the majority are privatelymaintained and commercially licensed. Some specific issues surrounding data accessin biology are shown in Figure 1.6, along with the general issues affecting scientificdistributed data. Medical patient datasets, for example, may be encoded using an HL7file format standard 3. These datasets, however, are generally private, and doctors maybenefit from simple access to both biology and public medicine datasets for internalintegration without widespread publication of the resulting documents. For example,datasets describing drugs, side effects and diseases are directly relevant, and would bevery useful if they could be linked to and integrated easily [37].

Data

BusinessSocial Networks

Main problem: Trusted interfaces for Business-to-Business

interaction

Limited number of operations : Quote,

Purchase, Offer for Sale

Existing data labelling standard:

EAN UCC-13

Main problem: Privacy of shared

information

Hard to legally integrate

Identity of individuals split across networks

Medicine

Main problem: Privacy of all

patient information

Large number of unique patients; not

generalisable

Different standards for labelling conditions

Clear links to science data are

desirable

Science

Main problem: No standard for labelling and publishing data

Large number of data types: Gene, Protein,

DNA, Taxonomy, Chemical elements,

Chemical compounds

Datasets contain links to other

datasets

Large number of public datasets

Main problem: Structured data is

produced and used by different entities

Data is physically distributed

Data may be legally

protected

Data can be labelled

Data labels can be used by

others

Biology

Main problem: Large amount of data

Genes, proteins, and regulatory networks are not easy to generalise for all members of a species; difficult to

share labels

A small number of large datasets; a large

number of small datasets

Datasets link and publish data in different ways

Figure 1.6: Data issues

3http://www.hl7.org/implement/standards/index.cfm?ref=nav

http://www.hl7.org/implement/standards/index.cfm?ref=nav


Figure 1.2 shows the major concepts currently recognised in the field of cellular bi-ology, along with their relationships. The combined cycle describes how scientists linkinformation about different parts of genomic cycle, using references to other datasetscontaining linked concepts that are thought to be causally relevant. For example, asshown in Figure 1.7, the Entrez Gene dataset contains information about genomic in-formation, and this genomic information can be translated to form proteins, whichhave relevant information in the Uniprot dataset, among others. The Uniprot datasetalso attempts to compile relevant links to Entrez Gene, among other datasets, to allowscientists to identify the genes that are thought to be used to create particular proteins.

Uniprot

CPath

Reactome

Entrez Gene

Drugbank

Sider

Pfam

Figure 1.7: Datasets related to flow of information in living cellsOriginal figure source: Kenzelmann, Rippe, and Mattick. [2006].

doi:10.1038/msb4100086

These datasets are vital for scientists who require information from multiple sectionsof the biological cycle to complete their experiments. This thesis examines an exampleinvolving a scientist who is required to determine and accommodate for genetic causes ofside effects from drugs. The necessary published information to complete these types ofexperiments are available in linked scientific datasets including DrugBank [144], HGNC(dataset with single textual symbol to each human gene) [125], NCBI Entrez Gene(dataset with information about genes) [93], Uniprot (dataset with information aboutproteins) [139], and Sider (dataset with side effect information for drugs) [80].

At a conceptual level, this experiment should be relatively easy to perform usingcurrent methods such as web browsing or workflows. However, there are various reasonswhy it is difficult for scientists to perform the example experiment and publish the resultsincluding:

• References different in each format and dataset

http://dx.doi.org/10.1038/msb4100086


Uniprot reference information available in Entrez Gene

In HTML: mRNA and Protein(s) : UniProtKB/Swiss-Prot <a href="http://www.uniprot.org/entry/P27338">P27338</a> HTML source URL : http://www.ncbi.nlm.nih.gov/gene/4129

In ASN.1: type other, heading "UniProtKB", source , comment type other, source src db "UniProtKB/Swiss-Prot", tag str "P27338" , anchor "P27338" ASN.1 source URL : http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=asn.1&id=4129

In XML: <Dbtag><Dbtag_db>UniProtKB/Swiss-Prot</Dbtag_db><Dbtag_tag><Object-id><Object-id_str>P27338</Object-id_str></Object-id></Dbtag_tag></Dbtag>XML source URL : http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&id=4129

Figure 1.8: Different methods of referencing Uniprot in Entrez Gene dataset

• Many file formats, including Genbank XML, Tab-separated-values and FASTA

• Lack of interlinking : One way link from Sider to Drugbank

• Experimental replication methods vary according to the sources used

• Custom methods are generally hard to replicate in different contexts

There are different methods of referencing between the Entrez Gene dataset and theUniprot dataset. There are multiple file formats available for each item in the EntrezGene dataset, with each format using a different method to reference the same Uniprotitem as shown in Figure 1.8. In a similar way, there are multiple file formats availablefor the equivalent Uniprot protein, however, each format also uses a different method toreference the same Entrez Gene item, as shown in Figure 1.9 Both of these datasets arecurated and used by virtually every biologist studying the relationship between genesand proteins. The HGNC gene symbol that is equivalent to these items is representedusing different file formats each of which use a different method of referencing externaldata links. In the case of Entrez Gene, HGNC also contains two separate properties,with very similar semantic meanings, that link to the same targets in Entrez Gene, asshown in Figure 1.10.

Although some datasets are not linked in either one or both directions, there maybe ways to make the data useful. For instance, the DailyMed website contains linksto DrugBank, but DrugBank does not link back to DailyMed directly. DrugBank doeshowever link to KEGG [76], which links to DailyMed, as shown in Figure 1.11. Thereare links to PubChem [31] from DrugBank and KEGG, however, they are not identical,as KEGG links to a PubChem substance, while DrugBank links to both PubChem com-pounds and PubChem substances. In this example, described more fully in Section 4.4,a scientist or doctor may want to examine the effects of a drug in terms of its side effectsand any genetic causes, however, this information isn’t simple to obtain given the linksand data access methods that are available for distributed linked datasets.

In addition to the directly available data from these sites, there are restructuredand annotated versions available from tertiary data providers including Bio2RDF [24],Chem2Bio2RDF [39], Neurocommons [117], and the LODD group [121]. These versions


Entrez Gene information available in Uniprot

In RDF/XML: <rdfs:seeAlso rdf:resource="http://purl.uniprot.org/geneid/4129" />RDF/XML source URL : http://www.uniprot.org/uniprot/P27338.rdf

In XML: <dbReference type="GeneID" id="4129" />XML source URL : http://www.uniprot.org/uniprot/P27338.xml

In HTML: Genome annotation databases : GeneID: <a href="http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&term=4129">4129</a>HTML source URL : http://www.uniprot.org/uniprot/P27338.html

In Text File: DR GeneID; 4129; -.Text source URL : http://www.uniprot.org/uniprot/P27338.txt

In GFF : No References available in this formatGFF source URL : http://www.uniprot.org/uniprot/P27338.gff

In FASTA : No References available in this formatFASTA source URL : http://www.uniprot.org/uniprot/P27338.fasta

Figure 1.9: Different methods of referencing Entrez Gene in Uniprot dataset

Uniprot and Entrez Gene reference information available in HGNC

In HTML:

Entrez Gene ID 4129 <a href="http://view.ncbi.nlm.nih.gov/gene/4129">Gene</a> UniProt ID (mapped data supplied by UniProt) P27338 <a href="http://www.uniprot.org/uniprot/P27338">UniProt</a>

HTML source URL : http://www.genenames.org/data/hgnc_data.php?hgnc_id=6834

In Tab Seperated Values Format:

HGNC ID Approved Symbol Approved Name Status Entrez Gene ID Entrez Gene ID (mapped data supplied by NCBI) UniProt ID (mapped data supplied by UniProt)6834 MAOB monoamine oxidase B Approved 4129 4129 P27338

Tab Seperated Values Format source URL : http://www.genenames.org/cgi-bin/hgnc_downloads.cgi?title=HGNC+output+data&col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_status&col=gd_pub_eg_id&col=md_eg_id&col=md_prot_id&status=Approved&status=Entry+Withdrawn&status_opt=2&level=pri&=on&where=gd_hgnc_id+%3D+%276834%27&order_by=gd_hgnc_id&limit=&format=text&submit=submit&.cgifields=&.cgifields=level&.cgifields=chr&.cgifields=status&.cgifields=hgnc_dbtag

Figure 1.10: Different methods of referencing Entrez Gene and Uniprot in HGNCdataset


Indirect linking

http://www.drugbank.ca/search/search?query=APRD00701

http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?id=6788

http://www.drugbank.ca/drugs/DB01247

http://www.genome.jp/dbget-bin/www_bget?drug:D02580

http://www.drugbank.ca/cgi-bin/getCard.cgi?CARD=DB01247

Same item

http://sideeffects.embl.de/drugs/3759/

Link to KEGG Drug database:

Link back to Drugbank:

Links to search page for secondary accession number:

Contains link to actual item:

http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?setid=AC387AA0-3F04-4865-A913-DB6ED6F4FDC5

Redirects to:

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=3759

http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?setid=ac387aa0-3f04-4865-a913-db6ed6f4fdc5

Contains link to:

Redirects to:

Contains link to: http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=17396751

Contains link to:

Contains link to:

http://www.drugbank.ca/cgi-bin/getCard.cgi?CARD=DB01247.txt

Contains link to:

http://www.genome.jp/dbget-bin/www_bget?drug:D02580

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=148916

Contains link to:Contains link to:

Figure 1.11: Indirect links between DailyMed, Drugbank, and KEGG

of the datasets contain more links to other datasets compared to the original data, how-ever the annotated information may not be as trusted as the original information thatis directly accessible using the publishers website. If scientists rely on extra annotationsprovided by these tertiary sources, the experiments may not be simple to replicate usingthe original information, making it difficult for peer reviewers to verify the conclusionsunless they use the annotated datasets. Although tertiary sources may not be as usefulas primary sources, they provide a simpler understanding of the links between differ-ent items, as shown by replicating the example from Figure 1.11 using the annotateddatasets in Figure 1.12.

Although it is simpler to visualise the information using the annotated versions,there are still difficulties. For example, the DailyMed annotated dataset provided bythe LODD group does not utilise the same identifiers for items as the official version,although the official version uses two different methods itself. The DailyMed itemidentified by both “AC387AA0-3F04-4865-A913-DB6ED6F4FDC5” and “6788” in theofficial version, is identified as “2892” in the LODD version, and there are no referencesto indicate that the LODD published data is equivalent to the DailyMed published data.

A tertiary provider may accidentally refer to two different parts of a dataset as if


Direct linking

http://www4.wiwiss.fu-berlin.de/dailymed/page/drugs/2892

http://bio2rdf.org/dr:D02580

http://bio2rdf.org/drugbank_drugs:DB01247

http://www4.wiwiss.fu-berlin.de/sider/resource/drugs/3759

http://bio2rdf.org/pubchem:3759



Figure 1.12: Direct links between DailyMed, Drugbank, and KEGG

they were one set. For example, the Bio2RDF datasets do not currently distinguishbetween the PubChem compounds dataset and the PubChem substances dataset. Thiserror results in the use of identifiers in the examples above such as http://bio2rdf.

org/pubchem:3759 which may refer to either to a compound, http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=3759, or an unrelated substance, http://

pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=3759. In these cases, the twoidentifiers would be very difficult to disambiguate unless a statement clearly describeseither a compound or a substance.

1.2 Problems

A number of concepts need to be defined to understand the motivating factors forthis research. These concepts will be used to identify and solve problems that werehighlighted using examples in Section 1.1.2.

Data cleaning : The process of verifying and modifying data so that it is usefuland coherent, including useful references to related data. For example, this mayinclude manipulating the Entrez and Uniprot datasets so that the links can bedirectly resolved to get access to information about the referenced data.

Data quality : A judgment about whether data is useful and coherent. For example,this may be the opinion of a scientist about whether the datasets were useful,along with which manipulations would be necessary to improve them.

Trust : A judgment by a scientist that they are comfortable using data from a partic-ular source in the context of their current experiment. In the example, scientists








need to be confident that the sources of data they are using will be both non-trivialand scientifically acceptable to their peers.

Provenance : The details of the retrieval method and locations of data items thatwere used in an experiment, as distinct from the provenance information relatedto the original author and curator of the data item [129]. The distributed natureof scientific data was illustrated by compartmentalising the logical data and linksfrom Figure 1.3 to match the way it is physically stored, as shown in Figure 1.5.In terms of this example, the provenance of the asserted relationship between themouse gene and the human cancer, would include references to data located ineach of the other datasets.

Context : The factors that are relevant to the way a scientist performs or replicatesan experiment. These factors influence the way different scientists approach thesame problems. For example, a scientist may publish a method using datasetsdistributed across many locations; while a peer may reproduce or extend thedocument solely using resources at their institution.

These factors are relevant to the way scientists work with data, as shown in Sec-tion 1.1. Scientists are not currently able to clean and query datasets to fix errorsand produce accurate results using methods that are simple to replicate. The lack ofreplicability may result in lower data quality for published datasets due to the lack ofreuse and verification. Although scientists can access and trust parts of a small numberof large public datasets, they are not able to easily trust the large number of smalldatasets, or even all of a small selection of large datasets. Trust requires knowledge ofthe state of the dataset. In addition, scientists need to understand the usefulness of thedata in reference to other related datasets. It is difficult to understand the usefulnessof an approach while there are barriers to replicating existing research that uses similarlinked datasets.

Some of the problems discussed here are more relevant to the use of workflows, ascompared to manually performing an experiment by hand. For example, it is moreimportant that data is trusted for large scale workflows, as the intermediate steps arerequired to be correct for the final, published, results to be useful. In comparison,workflows make it easy to generate query provenance, as the precedents for each valuecan be explicitly linked through the structure of the workflow. In manual processes, thequery provenance needs to be compiled by the scientist alongside their results. Contex-tual factors can adversely affect both workflow and manual processes, as workflows canallow for context variables as workflow inputs, and if they don’t it makes it difficult toreplicate. Similarly, manual processes are designed to be intimately replicated, makingit possible to modify context variables at the desired steps.

1.2.1 Data quality

Data should be accessible to different scientists in a consistent manner so that it can beeasily reused. The usefulness of data can be compromised by errors or inconsistencies


at different levels. Errors may occur when the data does not conform to its declaredschema; when dataset links are not consistent or normalised; when properties are notstandardised; or, when the data is factually incorrect. Some of these errors can becorrected when a query is performed, while others require correction before queries canbe executed.

Data may be compromised by one or more fields that do not match the relevantschema definition, possibly due to changes in the use of the dataset after its initialcreation [69]. In some of these cases, automated methods may be available to identifyand remove this information, including community accepted vocabularies that definethe expected syntactic and semantic types of linked resources. In other cases such asrelational databases, the structure of the database may need to be modified in order tosolve the issue.

The decentralised nature of the set of linked scientific datasets makes dynamic veri-fication of multiple datasets hard. Although a syntactic validator may be able to verifythe completeness of a record, a semantic validator may rely on either a total record, orthe total dataset, in order to decide whether the data is consistent. In many cases theresults of a simple query may not return a complete record, as a scientist may only beinterested in a subset of the fields available.

The usefulness of linked data lies in the use of accurate, easily recognisable linksbetween datasets. Datasets can be published with direct links to other records, or theymay be published using identifiers that can be used to name a record without directlylinking to it. In many cases, the direct links contain an identifier that can be used toname a record without linking to it, as shown in a variety of examples in Section 1.1.2.In some cases there are two equivalent identifiers for a record. Figure 1.11 shows twoways of referencing the relevant DailyMed record using both “AC387AA0-3F04-4865-A913-DB6ED6F4FDC5” and “6788” as identifiers.

It may be necessary for scientists to recognise identifiers inside links, independentof the range of links that could be constructed, as the same identifier may be presentin two otherwise different links. If two different identifiers can be used equivalentlyfor a record, then normalising the identifier to a single form will not adversely affectthe semantic meaning of the record. However, there are cases where the two identifiersdenote two different records with data representing the same thing. In this case, thetwo identifiers are synonyms rather than equivalent, meaning that they need to bedistinguished so that scientists can retain the ability to describe both items [112]. Thisis particularly important when integrating data from two heterogeneous sources, whereit may be important in future that the identifiers are separate so that their provenancecan be determined.

The issue of variable data quality is more prominent when data is aggregated fromdifferent sources, as incorrect or inconsistent information is either very visible, or causeslarge gaps in the results. Both of these outcomes make it difficult or impossible to trustthe results. This is particularly relevant to the increasingly important discipline ofdata mining, where large databases are analysed to help scientists form opinions about


apparent trends [112]. It is necessary in science predominantly because of the tendencyof scientific datasets to not always remove entries for incorrect items, and to importreferences at one point in time, without regularly checking in future to verify that thereferences are still semantically valid [69].

The use of multiple linked datasets may highlight existing data quality issues, how-ever links between distributed datasets may still be useful when the data is factuallyincorrect, as the link may be important in determining the underlying issue. In com-parison to textual search engines which rely on ever more sophisticated algorithms toguess the meaning of documents, scientists generate purpose built datasets and curatethe links to other datasets to improve the quality of their data.

1.2.2 Data trust

Data trust is a complex concept, with one paper, Gil and Artz [52], identifying 19 factorsthat may affect the trust that can be put into a particular set of data. In the contextof this research, data trust is evaluated at the level of datasets and the usefulness ofdifferent queries on each dataset. At a high level, different scientists determine whetherthey trust a dataset, and peers use this information to determine whether or not thedataset is trusted enough to be used as a source for queries. Although this informationis implicitly visible in publications, a computer understandable annotation would allowscientists to efficiently perform queries across only their trusted datasets. If queries donot return the expected results, then the dataset that was responsible for the incorrectdata could be explicitly annotated as untrusted for future queries, and past queriescould be replicable regardless of the current situation, but not necessarily trusted.

It is difficult to define a global trust level, due to evidence that knowledge is notonly segmented, but its meaning varies according to context [64]. In business processmodelling, where it is necessary to explicitly represent methods and knowledge relatedto the business activities, one paper, Bechky [21], concluded that within a business itis hard to represent knowledge accurately for a variety of reasons including job speci-ficity and training differences. Some of these may map to science due to its similaritieswith expert knowledge holders and segregated disciplines. Another paper, Rao andOsei-Bryson [113], focused on analysis of the data quality in internal business knowl-edge management systems, with the conclusion that there are a number of analyticaldimensions that are necessary to formally trust any knowledge, with some overlap inthe dimensions to the factors identified in Gil and Artz [52].

Trust in a set of data is related to data quality, as any trust metric requires thatdatasets contain meaningful statements. If a data trust metric is to extend past overalldatasets to particular data items, there is a need to precisely designate a level of trust todata items based on general categories of trustworthiness. The continuous evolution ofknowledge and data, including different levels of trust for scientific theories throughouttheir lifespan, compared to raw data, make it hard to define the exact meaning of alldata items in a dataset in the context of a relatively static shared meaning for an item.The lack of exact meaning at this level, due to social and evolutionary issues, means


that there is no automated general mechanism for detecting and resolving differencesin opinion about meaning.

The focus of many current distributed scientific data access models is on the use ofscientific concept ontologies as the basis for distributing queries [4, 15, 17, 25, 30, 53, 74,75, 82, 100, 103, 107, 108, 117, 118, 133, 143]. These systems do not contain methods fordescribing the trust that scientists have in the dataset, including different levels of trustbased on the context of each query, as ontology researchers assume that datasets will besemantically useful enough to be integrated transparently in all relevant experiments.The datasets that are available to these systems need to remain independent from aquery in order for the system to be able to automatically decide how to distribute thequery based on high level scientific concepts and properties. These systems assume thatboth the syntactic and semantic data quality will be high enough to support realisticand complex queries across multiple linked datasets without having scientists examineor filter the results at each stage of the query process.

If a dataset is well maintained and highly trusted within a domain it will be moreconsistent and accurate, and the way the data is represented should match the commontheories about what meaning the data implies. If the level of trust in a dataset isnot high across a community, the various degrees of trust that cannot be described incurrent ontologies, such as uncertainty, will lead to differences in implementation. Theuse of different implementations will lead to differences in the trust in the knowledgethat is described using the resulting data. Data trust in science is gained based on aparticular historical context. A scientific theory may be currently accepted to be partof the Real layer in Critical Realist [96] terms due to a large degree of evidence pointingto realistic causal mechanisms, but in the future if the context changes the observationsmay be interpreted differently. The change in recognition did not mean that the worldactually changed to suit the new theory, however, it does mean that a community ofscientists will gradually accept the new theory and regard knowledge based on the oldtheory as possibly suspect.

Strassner et al. [137] attempted to utilise ontologies as the basis for managing theknowledge about a computer network, and to use this knowledge to automate the processof network management in different contexts. They found that a lack of sufficientlydiverse datasets to experiment on and the lack of congruence between different ontologiesmade it difficult to utilise an ontology based approach as part of a network managementsystem. In comparison, there are a large number of diverse scientific linked datasets,and there are issues relating to the use of incongruent ontologies as the datasets havebeen modelled by different organisations, each of which have a different view on themeaning of the data.

As the number of scientific datasets grows, the level of knowledge by particularscientists about the suitability of any particular set of datasets will likely decrease dueto the lack of knowledge by each scientist about exactly what each database contains.Not every piece of information is trustworthy, and even trustworthy datasets can containinaccuracies including out of date links to other datasets, or errors due to insufficient


scientific evidence for some claims [35]. Scientists have to be free to reject sources ofdata that they feel are not accurate enough to give value to the investigations in theircurrent context. However, there is no current method which allows them to do this usinga distributed linked dataset query model. Current theoretical models include manydifferent factors [52], but no method for scientists to use to integrate their knowledgeof trust with their scientific experiments across linked datasets. Scientists can recordand evaluate the data and process provenance of their workflows, however, this does notrecognise the concept of trust in a dataset provided at a particular location [34, 95, 148].

Although some trust systems can be improved using community based ratings, thesesystems are still prone to trust issues based on the nature of the community. Scientistsrequire the ability to act autonomously to discover new results as necessary withoutbeing forced to confine themselves to the average of the current community opinionabout a dataset or data item. They require a system that provides access to multipledatasets, without requiring that all scientists utilise it completely for it to be useful,as this may preventing them from extending it locally as they require. In order toremain simple to manage, a data access model designed with this in mind cannot hopeto systematically assign trust values to every item in every dataset that will match allscientist’s opinions. On one extreme it would be globally populated automatically byan algorithm, while on the other it would be sparsely populated by a range of scientists.Both of these outcomes do not enable scientists to contextually trust different datasetsmore accurately than a manual review of relevant publications.

An autonomous trust system does not require or imply that the data is only mean-ingful to some scientists. It recognises that the factual accuracy, along with the com-pleteness of the records contribute to each scientist’s trust in the dataset and its links toother datasets. This does not imply that private investigations are more valid or usefulthan published results or shared community based opinions. It does imply that as partof the exploratory scientific method, trusted datasets must be developed and refined inlocal scenarios before being published. In this way it matches the traditional scientificpublication methodology, where results are locally generated and peer reviewed beforebeing published. In order to be useful, a linked scientific data access model may be ableto provide for both pre and post published data, and trust based selection of publisheddata to allow transitions when necessary.

Trust in the content of a document is distinct from trusting that the document wasan unchanged representation of the original document. Typically trust in the computersciences, particularly networking and security, is defined as the ability to distinguishbetween unchanged and changed representations of information. Although it may beuseful to include this aspect in a trusted scientific data access model, it is not necessaryas scientists decide on their level of trust based on the overall content available from aparticular data provider. In addition a scientist may have different levels of trust for thecontent derived from a particular provider depending on their research context, requiringthat, in two different contexts, the same representation be assigned two different levelsof trust.


1.2.3 Context sensitivity and replication

Modern scientists need to be able to formulate and execute computer understandablequeries to analyse the data that is available to them within their particular context. Interms of this research, context is defined as the factors that affect the way the query isexecuted, and not necessarily different contextual results. In previous models, such asSemRef [133], context sensitivity is defined as the process of reinterpreting a complexquery in terms of a number of schema mappings. Scientists may wish to perform acompletely different query in a different context, and the limitation to reinterpretinga complex structured query makes it impossible to provide alternative, structurallyincompatible queries to match a given scientific question. It is instead necessary for acontext sensitive query system to provide simple mappings between the scientist’s query,given as a minimal set of parameters rather than a structured query, and any numberof ways that the particular query is implemented on different datasets. Although thisdoes not provide a semantically rich link between the scientist’s query and the actualqueries that are executed on the data providers, it provides flexibility that is necessaryin a system which is designed to be used in different contexts to generate equivalentresults.

Context sensitivity in this research relates to both the scientist’s requirements andthe locations and representations of data. If scientists have the resources to locally hostall of the datasets, then they are in a position to think about optimising the performanceby using a method such as that proposed by Souza [133]. However, in these cases, thescientist still has to determine their level of trust based on the intermediate results.This may not be available if the entire question is answered using joins across datasetsin a single database without a scientist being able to verify the component steps. Theycould perform these checks if they use the datasets solely as a data access point, withhigher level filters and joins on the results according to their context.

Although there are systems that may assist them when constructing queries acrossdifferent datasets, the scientist needs to verify whether the results of each interdatasetquery match their expectations in order to develop trust in their results. The dataquality, along with their prior level of trust in the datasets, and the computer under-standable meaning attached to the data, are important to whether the scientist acceptsthe query and results or chooses to formulate it differently.

Some systems that attempt to gather all possible sources for all queries into a singlelocation, for example, SRS (Sequence Retrieval System) [46]. The aggregation of similardatasets into single locations provides for efficient queries, and, assuming the datasetsare not updated by the official providers too often, can be maintainable and trustworthy.In terms of this research the minimum requirement for scientists to replicate queries isthat the data items must be reliably identified independent of context.


1.3 Research questions

A list of research questions were created to highlight some of the current scientific dataaccess problems described in Section 1.2.

1. What data quality issues are most prevalent for scientists working with multipledistributed datasets?

2. What data cleaning methods are suitable for scientists who need to produce resultsfrom distributed datasets that are easily replicable by other scientists?

3. What query model features are necessary for scientists to be able to identifytrustworthy datasets and queries?

4. What query provenance documentation is necessary for scientists to performqueries across distributed datasets in a manner that can be replicated in a differentcontext with a minimal amount of modification to the queries?

5. Can a distributed dataset query model be implemented, deployed, and used, toeffectively access linked scientific datasets?

1.4 Thesis contributions

This thesis introduces a new context sensitive model for querying distributed linkedscientific datasets that addresses the problems of data cleaning, trust, and provenance asdiscussed in Section 1.2. The model allows users to define contextual rules to automatethe data cleaning process. It provides context sensitive profiles which define the trustthat scientists have in datasets, queries, and the associated data cleaning methods. Themodel enables scientists to share methodologies through the publication and retrievalof their queries’ provenance. The thesis describes the implementation of the model in aprototype web application and validates the model based on examples that were difficultto resolve without the use of the model and the prototype implementation.

The contributions will be evaluated by comparing the model and implementationfeatures and methodology against other similar models in terms of their support forcontext-sensitive, clean, and documented access to heterogeneous, distributed, linkeddatasets. These features are necessary to provide support to current and future scientistswho perform experiments and analyse data based on public datasets.

1.5 Publications

• Ansell. [2011] : Model and prototype for querying multiple linked scientific datasets.Future Generation Computer Systems. doi:10.1016/j.future.2010.08.016

• Ansell, Hogan, and Roe. [2010]. Customisable query resolution in biology andmedicine. In Proceedings of the Fourth Australasian Workshop on Health Infor-matics and Knowledge Management (HIKM2010).


• Ansell. [2009]. Collaborative development of cross-database Bio2RDF queries.Presentation at EResearch Australasia 09.

• Ansell. [2008]. Bio2RDF: Providing named entity based search with a commonbiological database naming scheme. Presentation at BioSearch08 HCSNet Next-Generation Search Workshop on Search in Biomedical Information.

• Belleau, Ansell, Nolin, Idehen, and Dumontier. [2008]. Bio2RDF’s SPARQLPoint and Search Service for Life Science Linked Data. Poster at Bio Ontologies2008 workshop.

Other publications:

• Ansell, Buckingham, Chua, Hogan, Mann, and Roe. [2009]. Enhancing BLASTComprehension with SilverMap. Presentation at 2009 Microsoft eScience Work-shop.

• Ansell, Buckingham, Chua, Hogan, Mann, and Roe. [2009]. Finding FriendsOutside the Species: Making Sense of Large Scale BLAST Results with Silvermap.Presentation at EResearch Australasia 09.

• Rosemann, Recker, Flender, and Ansell. [2006]. Understanding context-awarenessin business process design. Presentation at Australasian Conference on Informa-tion Systems.

1.6 Research artifacts

The research was made up of a model and a prototype implementation of the model. Themodel was designed to overcome the issues that were identified related to data access inscience in Section 1.1. The prototype was then implemented to provide evidence for theusefulness of the model. The prototype was implemented using Java and Java ServletPages (JSP). It contained approximately 35,000 lines of program code4. The prototypeincluded a human-understandable HTML interface, as shown in Figure 1.13, along witha variety of computer-understandable RDF file formats including RDF/XML, Turtle,and RDFa. The 0.8.2 version of the prototype was downloaded around 300 times fromSourceforge 5.

The prototype was primarily tested on the Bio2RDF website. It provided with away to resolve queries easily across their distributed linked datasets, including datasetsusing the normalised Bio2RDF resource URIs[24] and datasets using other methods foridentifying data and links to other data. Bio2RDF is a project aimed at providingbrowsable, linked, versions of approximately 40 biological and chemical datasets, alongwith datasets produced by other primary and tertiary sources where possible. Thedatasets produced by Bio2RDF can be colocated in a single database, although datasetsare all publicly hosted in separate query endpoints.

4http://www.ohloh.net/p/bio2rdf/analyses/latest5http://sourceforge.net/projects/bio2rdf/files/bio2rdf-server/bio2rdf-0.8.2/

http://www.ohloh.net/p/bio2rdf/analyses/latest

http://sourceforge.net/projects/bio2rdf/files/bio2rdf-server/bio2rdf-0.8.2/


Figure 1.13: Bio2RDF website

The configuration information necessary for the prototype to be used as the enginefor the Bio2RDF website was represented using approximately 27,000 RDF statements.A large number of datasets were linked back to their original web interfaces and descrip-tions of the licenses defining the rights that users of particular datasets have. In totalthere were 170 namespaces that were configured to provide links back to the originalweb interface as part of the Bio2RDF query process, and there were 176 namespacesthat were configured to provide links back to a web page that describes the rights thatusers of the dataset have.

The Bio2RDF website was accessed 7,415,896 times during the testing period and theprototype performed 35,694,219 successful queries and 1,078,354 unsuccessful queries onthe set of distributed datasets to resolve the user queries. The website was accessedby at least 10,765 unique IP addresses, although two of the three mirrors were locatedbehind reverse proxy systems that did not provide information about the IP of the useraccessing the website. The three instances of the prototype running on the mirrors weresynchronised using a fourth instance of the software that provided the most up to dateversion of the Bio2RDF configuration information to each of the other mirrors at 12hourly intervals.

1.7 Thesis outline

The thesis starts with an introduction to the problems that form the motivation for thisresearch in Chapter 1. The related research areas are described in Chapter 2, startingfrom a broad historical perspective and going down to related work that attempts tosolve some of the same issues. Chapter 3 contains a description of the model that was


created in order to make it simpler for scientists to deal with the issues described inChapter 1. Chapter 4 describes how the model and prototype implementation can beintegrated with scientific practices. The prototype that was created as an implementa-tion of the model is described in Chapter 5. Chapter 6 along with a discussion of issuesthat were addressed by the model and prototype along with outstanding issues and adescription of the methodology and philosophy that was used to guide the research ata high level. A conclusion and a brief description of potential future research directionsare given in Chapter 7.

Chapter 2

Related work

2.1 Overview

For most of scientific history, data and theories have been accumulated in personalnotebooks and letters to fellow scientists. The advent of mechanical publishing inthe Renaissance period allowed scientists to mass produce journals that detailed theirdiscoveries. This provided a broader base on which to distribute information, makingthe discovery cycle shorter. However, data still needed to be processed manually, andpublication restrictions made it hard to publish raw data along with results and theories.The use of electronic computers to store and process information gave scientists efficientways to permanently store their data, allowing for future processing. However, initiallyelectronic storage costs were prohibitive, and scientists generally did not share raw datadue to these costs. The creation of a global network to electronically link computersystems provided the impetus for scientists to begin sharing large amounts of raw data,as the costs were finally reduced to economic levels. This chapter provides a shortsummary of a range of data access and query methods that have been proposed andimplemented.

The networked computers, particularly those that make up the Internet, are nowvital to the process of scientific research. The Internet is used to provide access toinformation that can retrieve as they require, with over 1000 databases in the biologydiscipline [50]. A scientist needs to request information from many different locationsto process it using their available computing resources. There are many ways that sci-entists could request the information, making it difficult for other scientists to replicatetheir research based on their publications. They can publish computer understandableinstructions detailing ways for other scientists to replicate their research, but otherscientists need to be able to execute the instructions locally. This is difficult as thecontextual environment is different between locations, including different operating sys-tems, database engines, and dataset versions. An initial step towards easy replicationby peers may be to use workflow management systems to integrate the queries andresults into replicable bundles. Applications such as Taverna [105] make it possible touse workflows to access distributed data sources.

The methods that are most commonly used by scientific workflow management

25

26 Chapter 2. Related work

systems are Web Services, in particular SOAP XML Web Services. Although XMLis a useful data format, it relies on users identifying the context that the data is givenin, and documents from different locations may not be easy to integrate as the XMLmodel only defines the document structure, and not the meaning of different parts of thedocument. Web Services can be useful as data access and processing tools, especially ifscientists are regularly working with program code, so they are useful if the context isidentifiable by the scientist. However, the use of Web Services does not make it simpleto switch between the use of distributed services and local services, as users need toimplement the computer code necessary to match the web service interface to their owndata. The method of data access is not distinctly separated from either the queries orthe data processing steps within workflows, so scientists need to know how to eitherreverse engineer the web service, modify a workflow to match their context if the webservice is not available, or access the data locally using another method.

A scientist needs to be able to identify links between different pieces of data todevelop scientifically significant theories across different datasets. The XML data for-mat, for example, is used as the basis for Web Services and many other data formats.However, XML and other common scientific data formats do not contain a standardmethod for identifying data items or links to other data items in a way that is generi-cally recognisable. In order to link between data items in different documents a genericdata format, RDF (Resource Description Framework) was developed. RDF is able toshow links between different data items using Uniform Resource Identifiers (URIs) asidentifiers for different data items. RDF statements, including URIs and literals, suchas strings and numbers, provide the structure for links between relevant URIs and prop-erties of data items, respectively. RDF is used as the basis for the Semantic Web, whichis ideally characterised by the use of computer understandable data by computers toprovide intelligent responses to questions.

The original focus for the Semantic Web community was to define the way the com-munity agreed meanings for links, commonly known as ontologies or vocabularies, wouldbe represented. This resulted in the development of OWL (Web Ontology Language),as a way of defining precisely what meaning could be implied from a statement given inRDF. Although this was useful, it did not fulfil the goals of the Semantic Web, as it is aformat, rather than a dataset, and the datasets suffered due to both data quality prob-lems and lack of contextual links between the datasets that were published. The W3Ccommunity recognised this as an issue. Tim Berners-Lee formed a set of guidelines,known as Linked Data1, describing a set of best practices using URIs in RDF docu-ments. Linked Data uses HTTP URIs to access further RDF documents containinguseful information about the link.

According to the best practices given by the Linked Data community, URIs shouldbe represented using the HTTP protocol, which is widely known and implemented, somany different people can get access to information about an item using the URI thatwas used to identify the item. When a Linked Data URI is resolved, the results should

1http://www.w3.org/DesignIssues/LinkedData.html

http://www.w3.org/DesignIssues/LinkedData.html

Chapter 2. Related work 27

contain relevant further Linked Data URIs that contain contextually related items, asshown in Figure 2.1. This provides a data access mechanism that can be used to browsebetween many different types of data in a similar way to the HTML document web.

The Linked Data guidelines encourage scientists to publish results that contain usefulcontextual links, including links to data items in other scientific disciplines. Althoughsome scientists wish there to be a single authoritative URI for each scientific data item,RDF is designed to allow unambiguous links between multiple representations of a dataitem using different properties. In addition, Linked Data does not specify that thereneeds to be a single URI for each data item. A scientist needs to be able to createalternative URIs so they can reinterpret data in terms of a different schema or newevidence, without requiring the relevant scientific community to accept and implementthe change beforehand.

Retrieve Linked Data URI Parse RDF document and store in local database

If there are new references, retrieve them

Analyse document for references to other Linked Data

Perform query on the local database

Figure 2.1: Linked Data Naive Querying

Despite its general usefulness, plain Linked Data is not an effective method forsystematic querying of distributed datasets, particularly where links are not symmetric,as shown in Figure 2.2. Even when all Linked Data URIs are symmetric, many URIsneed to be resolved and locally stored before there is enough information to makecomplex queries possible. In the traditional document web, textual searches on largesets of data are made possible using a subset of the documents on each site. Althoughthis sparse search methodology is very useful for textual documents is not useful forvery specific scientific queries on large, structures, linked, scientific datasets.

The RDF community developed a query language named SPARQL (SPARQL QueryLanguage for RDF) 2 to query RDF datasets using graph matching techniques. SPARQLprovides a way to perform queries on a remote RDF dataset without first discoveringand resolving all of the Linked Data URIs to RDF documents. However, it is impor-tant that resolvable Linked Data URIs are present so that the data at each step isindependently accessible outside of the context of a SPARQL query.

SPARQL is useful for constructing complex graph based queries on distributeddatasets, however, it was not designed to provide concurrent query access to multi-ple distributed datasets. In order for scientists to take advantage of publicly accessible

2http://www.w3.org/TR/rdf-sparql-query/

http://www.w3.org/TR/rdf-sparql-query/


Linked Data URI about the concept X:<http://firstorganisation.org/resource/concept/X>

Another Linked Data URI about the concept X:<http://otherorganisation.org/concept:X>

Existing link

No link

Figure 2.2: Non Symmetric Linked Data

scientific datasets, without requiring them to store all of the data locally, queries mustbe distributed across different locations. Figure 2.3 shows the concept of data producersinterlinking their datasets with other sources before offering the resulting data on thenetwork contrasted with the data silo approach where all data is stored locally, and themaintenance process is performed by a single organisation rather than by each of thedata producers.

Federated SPARQL systems distribute queries across many datasets by splittingSPARQL queries up according to the way each dataset is configured [40]. An exampleof a dataset configuration syntax is VoiD 3, which includes predicates and types, labelsfor overall datasets, and statistics about interlinks between datasets. The componentqueries are aggregated by the local SPARQL application before returning the results tothe user. Although there is no current standard defining Federated SPARQL, a futurestandard may not be useful for scientists if it does not recognise the need for the datasetto be filtered or cleaned prior to, or in the act of querying. In addition, the methodand locations used to access distributed datasets are embedded directly into querieswith current Federated SPARQL implementations4, in a similar way to web servicesand workflow management systems.

Federated SPARQL strategies allow scientists to perform queries across distributedlinked datasets, but the cost is that the datasets must all use a single URI to identifyan item in each dataset, or they must all provide equivalency statements to all of theother URIs for an item. This requirement is not trivial, and a solution has not beenproposed that provides access to heterogeneous datasets in a case where the providersdo not agree on a single URI structure. Proposals for single global URIs for each datasethave not so far been implemented by multiple organisations, and even when they are

3http://vocab.deri.ie/void/4http://esw.w3.org/SPARQL/Extensions/Federation

http://vocab.deri.ie/void/

http://esw.w3.org/SPARQL/Extensions/Federation


Data silo maintenance

Distributed dataset maintenance

Source 1 Source 3Source 2

Copy across network

Copy across network

Copy across network

Single organisation curate and interlink

sources

Store and access data

locally

Source 1 Source 3Source 2

Publisher curates and interlink Source 1

with others


with others


with others

Access relevant data

across network

Figure 2.3: Distributed data access versus local data silo


implemented in the future, all datasets would have to follow them for the system tobe effective. Although Linked Data URIs are useful, they provide a disincentive tosingle global URIs. If a URI needs to be resolved through a proxy provided by a singleorganisation, there are many different social and economic issues to deal with. Thisorganisation may or may not continue to have funding in the future and they may notrespond promptly to change requests. Each of these issues reduces the usefulness thatthe single URI may otherwise provide.

There are non-SPARQL based models and implementations that allow scientists tomake use of multiple datasets as part of their experiments. They are generally based onthe concept of wrapping currently available SQL, SOAP Web Service, and distributedprocessing grids to provide for complex cross-dataset queries. The wrappers rely on aschema mapping between the references in the query and the resources that are availablein different datasets. In some systems distributed queries rely on the ability to maphigh level concepts from ontologies to the low level data items that are actually used indatasets. These types of mappings require the datasets to be factually correct accordingto the high level conceptual structure before the query can be completed, as there isno way to simply ask for data independent of the high level concepts. In these cases itmay be possible to encode the low level data items directly as high level concepts, butthis reduces the effectiveness of the high level concepts in other queries.

2.2 Early Internet and World Wide Web

Although it was originally sponsored by ARPANET, part of the US defence department,the distributed network of computers that is now known as the Internet, has been usedto transmit scientific information in primitive forms since it was created. Email, theinitial communication medium, was created to enable people in different locations toshare electronic letters, including attached electronic documents. By its nature, emailis generally private, although many public email lists are used to communicate withincommunities. Initially, the lack of data storage space made it hard for scientists tocreate and curate large datasets, resulting in a bias towards communication of scientifictheories compared to raw data.

The World Wide Web was created by scientists at CERN in Switzerland to shareinformation using the Internet. It is based on the HTML (HyperText Markup Language)5 and HTTP (Hypertext Transfer Protocol) 6 specifications. The HTML specificationenables document creators to specify layouts for their documents, and include links toother documents. The use of HTTP, and in some cases HTTPS and FTP, form thecore transport across which HTML and other documents are served using the Internet.HTML links, generally HTTP URIs, are created without a specific purpose, other thanto provide a navigational tool.

HTML is regularly used by scientists to browse through data. HTML versions ofscientific datasets display links to scientific data both within the dataset and in other

5http://www.w3.org/TR/html4/6http://tools.ietf.org/html/rfc2616

http://www.w3.org/TR/html4/

http://tools.ietf.org/html/rfc2616


datasets using HTTP URLs. These interfaces enable scientists to navigate betweendatasets using the links that the data producer determined were relevant.

With recent improvements in computing infrastructure, including very cheap storagespace and large communication bandwidths, scientists are now able to collaborate inrealtime with many different peers using the Internet. The massive scale electronicsharing of data, which allows this collaboration, has been described as a “4th paradigm”for science [71]. The first three paradigms are generally recognised as being based ontheory, empirical experimentation, and numerical simulation, respectively. Real-timecollaboration and data sharing enables scientists to process information more efficiently,arguably resulting in shorter times between hypotheses, experiments, and publications.

2.2.1 Linked documents

Initially the Web was used to publish static documents containing hardcoded linksbetween documents. To overcome the maintenance effort required to keep these staticdocuments up to date, systems were created to enable queries on remote datasets,without having to download the entire dataset. The results of these queries may beprovided in many different formats, but there is no general method for identifying linksbetween datasets, and in some cases no description of the meaning for links that aregiven. In HTML, where links are commonly provided, the URL link function is notnecessarily standardised between datasets, making it potentially hard to match thereference to references in an HTML page from a different data provider.

Linked documents, most commonly created using HTML, are popular due to the easewith which they can be produced. However, the semantic content of HTML documentsis limited to structural knowledge about where the link is in the document, and whatthe surrounding elements are. Clearly linked documents provide information about theexistence of a relationship between documents, but they do not propose meanings forthe relationship. They can be annotated with meaningful information, with a popularexample being the Dublin Core specifications 7. Dublin Core can be used to annotatedocuments with standard properties such as author and title. Textual search engines,supplemented with knowledge about the existence of links, make it possible to efficientlyindex the entire web, including annotations such as those provided by Dublin Core.

2.3 Dynamic web services

Dynamic processing services were created to avoid having to retrieve entire data filesacross the Internet before being able to determine if the data is useful. Initially thesewere created using ad hoc scripts that processed users inputs using a local datastoreand returned the results. However, these services required each new service to be un-derstood by humans and implemented using custom computer programs. Web servicescan be created that enable computers to understand the interface specification, and pro-grammatically access the service, without a human programmer having to understand

7http://dublincore.org/

http://dublincore.org/


each of the parameters and code the interface into code.Programmatic web services may be loosely or tightly specified. One example of a

tight specification, where the interactions are specified in a tight contract, is the SOAPprotocol 8. The SOAP protocol enables computers to directly obtain the interfaces, thatin prior cases needed to be understood by humans. Web services are used by scientiststo avoid issues with finding data in other ways, as the data is able to be manipulated inprogram code with the SOAP interface being the specification of what to expect. Webservices may be just used for data retrieval, but they are also commonly used as specificquery interfaces to enable scientists to use remote computing resources to analyse theirdata. Although the SOAP specification allows for multiple distributed endpoints to beused with a single Web Service, in most cases only single data providers are given withalternative ways to access the single provider. This causes issues with overloading ofparticular services and can in some cases stop scientists from processing their data ifthe service is unavailable.

The nature of SOAP web services, with a tightly bound XML data contract, enablesthem to be reverse engineered and implemented locally. However, this process requiresthe re-creation of the code that was used to get the data out of a database or processingapplication, something which is not necessarily trivial. The use of Semantic markup todescribe and compose Web Services has been used with limited success, although thelack of success may be due to the nature of the ontologies rather than a defect in themodel [99].

Although web processing services are useful, the data may be returned in a formatthat is not understandable by someone outside of the domain. In the case of SOAP,users are required to understand XML, or alternatively, write a piece of code to translateit to their desired format. In the case of other web processing services, the data maybe returned in any of a number of data formats, some of which may be common tosimilar processing services. Although the sharing of specific data using these dynamicweb services is useful, scientists currently end up having to spend a large amount oftheir time translating between data formats as required by the different services. Thetranslation between different data formats also makes it hard to recognise referencesbetween data items without first understanding the domain and the use of differentmethods by each domain to construct the references.

2.3.1 SOAP Web Service based workflows

Current scientific workflow management systems focus on data access using a combina-tion of SOAP Web Service calls and locally created code. Although this is a useful wayto access data reliably as it includes fault aware SOAP calls, it creates a reliance onXML syntax as the basic unit of data interchange. The tightly defined XML specifica-tion, detailed in the SOAP contract, prevents multiple results being merged into singleresult documents easily if the specification did not allow this initially. Although thereis some work being done to enhance the semantic structure of the XML returned by

8http://www.w3.org/TR/soap/

http://www.w3.org/TR/soap/


web services, it does not contain a clear way to integrate web services which have beenimplemented by different organisations. A discovery protocol such as Simple SemanticWeb Architecture and Protocol (SSWAP) provides a way of discovering the semanticcontent published in web services, but it does not provide a way for scientists to di-rectly use the web services discovered by the protocol [102]. The ontology descriptionof the inputs, operations, and outputs of web services is useful in addition to the WSDLdescription which defines the structure of the data inputs and outputs, but it needs tointegrate with the data, and provide a mapping between data structures before it wouldbe useful as a data access protocol.

Perhaps the biggest results in this area have been attained by the manually codedWeb Service wrappers created by the SADI project [143]. Although it is useful, themechanism for identifying which datasets are relevant to each provider in the backgroundis done on a case by case basis inside of wrappers. The scientists, high level, query cannotbe used in different contexts to integrate data that was not created using identicalidentifiers and the same properties. SADI uses a single query to determine which sub-queries are necessary. If datasets in future follow the exact structure and identifiers arenot changed, than the single query can be adapted or reused to fit future data providers.SADI model allows for future datasets to be represented using different data structures,as long as they can be transformed using a manually coded wrapper or using an OWLbased ontology transformation, as the data structure given in the query is fundamentalto the central query distribution algorithm. This restriction ties the scientist closelyto the actual data providers that are used. This makes it difficult to share or extendqueries in different contexts. The parts of the query that need to be modified may notbe easy to identify, given that each query is executed as a combined workflow.

For example, although the datasets are equivalent, SADI would find it difficult toquery across the data providers shown in Figure 2.4, as there is both variation in thereferences and variation in the data structures. In the example, some information for theresults needs to come from each dataset, but there is no single way to resolve the query,as there are different ways of tracing from the drug to the fact that the gene is locatedon chromosome X depending on whether the information about its gene sequence isrequired. In addition to this, the property names are duplicated between datasets fordifferent purposes. In the Drug dataset, the property "Target’s gene" contains a twopart reference to a dataset along with the identifier in the dataset, while in the Diseasedataset, the property contains a symbol without a reference to a dataset, and the symbolis not identical to the identifier in the Drug dataset.

2.4 Scientific data formats

Scientific data that will be used by computer programs, as opposed to the HTMLthat is used by humans, is presented in many different formats. These formats aregenerally text formats, as binary specifications that are common for some programsare not transportable and useable in different scenarios, something which is importantto scientists. Although in the past scientists created new file formats for each area,


Heterogeneous datasets

Drug Dataset

Get the name, location, genetic structure, and

taxonomy information for genes and diseases that are relevant to the drug Marplan Query

Useful results

Name : MarplanTargets disease : NeuritisTargets gene : ncbi_gene, 4129

Disease Dataset

Name : NeuritisTargets gene : hgnc, MAOB

NCBI Gene Dataset

Name : Monoamine oxidase BIdentifier : 4129Gene symbol : hgnc, 6834Taxonomy : 9606Genetic sequence : AGGCT...Location : Xp11.23

Datasets

HGNC Gene Dataset

Name : Monoamine oxidase BIdentifier : 6834Gene symbol : MAOBTaxonomy : HumanChromosome : X

Monoamine oxidase B, X chromosome, AGGCT..., Human, Neuritis

Figure 2.4: Heterogeneous datasets


new scientific document formats are generally standardised on using XML as the basicformat. The common use of XML means that documents are easily parseable, and ifthey follow a particular schema, they can be validated easily by programs.

The use of XML along with XML Schema 9, makes it simple to construct completedocuments and verify that they are syntactically valid. However, the strict nature ofXML schema verification does not allow extension of the document. In addition, XMLdoes not contain a native method for describing links between documents. AlthoughHTML contains a native method for linking between documents, it is not designed todescribe the links in terms of recognisable properties.

In addition to XML based formats, scientists share documents using the ASN.1standard (ISO/IEC 8825-1:2008) 10 and custom, domain specific, data formats basedon computer parseable syntax descriptions encoded using EBNF (ISO/IEC 14977:1996)11. These documents may support links to different datasets, but they require knowledgeof the way each link is specified to determine its destination, and the data item thatthe link refers to may not be directly resolvable using the link.

2.5 Semantic web

The Semantic Web was designed to give meaning to the masses of data that are repre-sented using electronic documents. The term was defined using examples which focuson the ability of computerised agents to automatically reason across multi-disciplinedistributed datasets without human intervention to efficiently process large amounts ofinformation and generate novel complex conclusions to simple questions [27]. The Se-mantic Web is commonly described using idealistic scenarios where data quality issuesare non-existent, or are assumed not to affect the outcome, data from different locationsis generally trustworthy, and where different data providers all map their data to sharedvocabularies. However, the slow process of developing of the necessary tools has shownthat the achievable operational level of a global Semantic Web may be very limited. Forinstance, Lord et al. [91] found that, based on their experience, “[the] inappropriatenessof automated service invocation and composition is likely to be found in other scientificor highly technical domains.”

A completely automated Semantic Web is still possible, however, it is more likelythat human driven applications can derive novel benefits from the extra structure andconditions provided by semantically marked up, shared, data. In comparison to agentbased approaches, human data review incorporates curation and query validation pro-cesses at each step.

Although the Semantic Web (SW) may be a user navigable dataset which includesreferences to the current HTML web, the major features are independent of the way theweb is used today. These features include a focus on computer understandability, anda common context between documents to use as the basis for interpreting any meaning

9http://www.w3.org/TR/xmlschema-1/10http://www.iso.org/iso/catalogue_detail.htm?csnumber=5401111http://www.iso.org/iso/catalogue_detail.htm?csnumber=26153

http://www.w3.org/TR/xmlschema-1/

http://www.iso.org/iso/catalogue_detail.htm?csnumber=54011

http://www.iso.org/iso/catalogue_detail.htm?csnumber=26153


that may be given.A scientist needs to unambiguously describe the elements of a query using terminolo-

gies that are compatible with the datasets they have available, including some commonlyused ontologies described in Appendix B. This process requires a way to apply rules toeach of the terms in a query to determine whether they are recognisable and not am-biguous, and convert them to fit the context of each dataset as necessary. This processis not trivial, and no general solution has been found, although some solutions focus onthe use of Wikipedia as a central authority to disambiguate terms [18]. Unfortunately,Wikipedia only contains a subset of terms needed by scientists, as it is defined by itscommunity as a general knowledge reference, rather than a scientific knowledge refer-ence. Although new terms could be added at any time to Wikipedia, they can just asquickly be deleted, meaning that it can’t be permanently used as a reference.

The Semantic Web is currently being developed using the RDF (Resource Descrip-tion Framework) specification for data interchange 12. RDF is a graph based data formatthat provides a way to contextually link information from different datasets using URIlinks and predicates, which are also represented as URIs. These predicate URIs can bedefined in vocabularies. In comparison to domain specific data formats and XML, RDFis designed to be an extensible model, so information from different locations can bemeaningfully merged by computers.

RDF is based on URIs, of which the HTTP URLs that are commonly used bythe document web, are a subset. The RDF model specifies that the use of a URI indifferent statements implies a link between the statements. The URI should containenough information to make it easy for someone to get information about the itemif they understand the protocol used by the URI. In the past, scientists have beenhesitant to create HTTP URIs due to perceived technical and social issues surroundingthe HTTP protocol and the DNS system that HTTP is commonly used with [41]. SomeRDF distributions such as the OBO schema recommend instead that scientists representlinks in RDF using other mechanisms such as key-value pairs with namespace prefixesin one RDF triple and the identifier in another triple. This use of RDF does not enablecomputers to automatically merge different RDF statements, as there are no sharedURIs between the statements.

2.5.1 Linked Data

The Semantic Web community has defined guidelines for creating resolvable, shared,identifiers, in the form of HTTP URIs, under the banner “Linked Data” 13. The guide-lines codify a way to gain access to factual data across many different locations, withoutrelying on locally available databases or custom protocols. Although its specificationsare independent of many of the goals of the Semantic Web, it is designed to produce aweb of contextually linked data, that may then be incrementally improved to representa semantically meaningful web of information.

12http://www.w3.org/TR/rdf-syntax-grammar/13http://www.w3.org/DesignIssues/LinkedData.html

http://www.w3.org/TR/rdf-syntax-grammar/



In comparison to linked documents, Linked Data provides the basis for publishinglinks between data items that are not solely identifiers for electronic documents. UsingLinked Data, an HTTP link can be resolved to a document that can be processed andinterpreted as factual data by computers without intervention by a human.

The Linked Data guidelines refer to the ability to use Uniform Resource Identifiers(URIs) to represent both an item and the way to get more information about theitem. Although this in itself is not new, the Linked Data design recommends someconventions that enable virtually all current computing environments to get access to theinformation. These conventions are centred around the use of the Hyper Text TransferProtocol (HTTP) and the Domain Name System (DNS), along with the reliance ofthese systems on the TCP/IP (Transport Control Protocol/Internet Protocol). Thedocuments representing Linked Data are transferred in response to HTTP requests asfiles encoded in one of the available RDF syntaxes based on HTTP Content Negotiation(CN) headers.

The RDF specification is ambivalent to the interpretation of data, other than to saythat URIs used in different parts of a document should be merged into a single node inan abstract graph representing the document. There are a large number of RDF parsersavailable, and the RDF specification is not proprietary, making RDF-based Linked Dataaccessible to a large number of environments currently using the linked document web.Typical web documents can also be integrated as Linked Data, with the most commonmethod being RDFa14 that can be layered onto a traditional HTML document. Theuse of RDFa makes it possible to include computer understandable data and visuallymarked up information in the same document, using HTTP URIs for semantic links toother documents.

Scientists can utilise Linked Data to publish their data with appropriately annotatedsemantic links to other datasets. Linked Data systems are not centralised, so there isno single authority to define when datasets can be linked to other datasets. This meansthat scientists can use Linked Data to publish their results with reference to eitheraccepted scientific theories or new theories as necessary, and have the links recognisedand merged automatically by others.

2.5.2 SPARQL

Although Linked Data is useful as an alternative to linked documents, it is not a usefulreplacement for dynamic web services. In terms of its use by scientists, Linked Datafocuses on resolving identifiers to complete data records, as opposed to the filtered oraggregated results of queries. In order to effectively query using Linked Data, scientistswould need to resolve all of the URIs that they knew about and perform SPARQLqueries on the resulting dataset. This is not efficient, as there may be millions of URIspublished as part of each dataset, and there are no guarantees that the documents willnot be updated in future.

Although Linked Data URIs are useful for random data access, they can be used in

14http://www.w3.org/TR/rdfa-syntax/

http://www.w3.org/TR/rdfa-syntax/


published datasets to enable queries over multiple datasets using SPARQL. SPARQL,a query language for RDF databases, is a graph based matching language that is basedon the RDF triple. The RDF semantics specify that co-occurrences of a URI can berepresented as a single node on an abstract graph representing a set of RDF triples. Insome ways SPARQL is similar to SQL, which is used to perform queries on relationaldatabases, although it is not predicated on a set of tables with defined columns to definevalid queries based on. There are many SPARQL endpoints that are publicly available,including a large number of scientific datasets 15.

SPARQL queries are constructed based on the structures of the RDF statements ineach dataset. In addition, SPARQL queries can be used to generate results without usingan RDF database. For example, relational databases can be exposed as RDF datasetsusing SPARQL queries [29]. SPARQL queries are difficult to transport them betweendifferent data providers unless the RDF statements in each location are identical. Theyrely on merging different triples into graphs using either URIs or Blank Nodes, (whichare locally referenceable RDF nodes). It is rare that different datasets use the sameURIs to denote every data record and link.

In practice, different datasets use different URIs due to two main issues. Firstly,there is a lack of standardisation for URIs, as the original data provider may not defineLinked Data HTTP URIs for their data records. Secondly, even if the original dataprovider defines Linked Data HTTP URIs, other entities may customise the data andwish to produce their own version using different URIs. It is very useful in terms ofSPARQL queries, for all datasets to use the same URIs to identify an item, as it isdifficult to transport SPARQL queries between data providers. However, it is moreuseful in terms of simple data access to use locally resolvable Linked Data HTTP URIs.

2.5.3 Logic

RDF and Linked Data are gradually expanding and improving to provide a basis onwhich semantically complex queries can be answered using formal logic. The logiccomponent specifies the way that RDF statements relate to each other and to querieson the resulting sets of RDF statements. In the context of RDF this requires the use ofcomputer understandable logic to identify the meaning of URIs and Blank Nodes basedon the statements they appear in. The most popular and mature logic systems arecurrently created using Description Logics (DL), with the most popular language beingthe family of OWL (Web Ontology Language) variants. OWL-DL, a popular variant,is based on the theory of Description Logics, where the universe is represented using aset of facts, which are all assumed to be true for the purposes of the theory.

The theory is by design monotonic, meaning that the addition of new statements willnot change current entailments, where entailments are additional statements that areadded based on the set of logic rules, which may also be statements under consideration.A non-monotonic theory would make it possible for extra statements to contradict, pos-sibly override, or remove, current entailments, something that would currently require

15http://esw.w3.org/SparqlEndpoints

http://esw.w3.org/SparqlEndpoints


one or more statements to be discarded before any reasoning was able to be consistentlyperformed.

Given that scientists are forced to consider multiple theories as possibly valid beforethey are consistently proven to be valid using empirical experimentation, they should nothave to apply a single logic to the entire Semantic Web (also termed by Tim Berners-Leeas the Giant Global Graph (GGG) 16). However, there are projects that seek to applyontologies to all interlinked scientific datasets to facilitate queries on the GGG. Scientistsrequire the ability to decide whether statements, or entire datasets are irrelevant to theirquery, including the consequences of logic based entailment.

2.6 Conversion of scientific datasets to RDF

RDF versions of scientific datasets, have been created by projects such as Bio2RDF[24], Neurocommons [117], Flyweb [149], and Linked Open Drug Data (LODD) [121]to bootstrap the scientific Semantic Web, or at least the scientific Linked Data web.Where possible, the RDF documents produced by these organisations utilise HTTPURIs to link to RDF documents from other datasets as they are produced, includingreferences to other organisations. This is useful, as it matches the basic Linked Data17 goals which are designed to ensure that data represented in RDF is accessible andcontextually linked to related data.

In some cases, the same dataset is provided by two different organisations using twoor more different URIs. For example, there is information about the NCBI Entrez Genedataset in Neurocommons using the URI http://purl.org/commons/record/ncbi_

gene/4129 and there is similar information in Bio2RDF, including the Neurocommonsdata, where it is available, using the URI http://bio2rdf.org/geneid:4129. If theschemas used by these organisations are different, this forces scientists to accept onerepresentation at the expense of the other when they create SPARQL queries.

The data cleaning process is a vital part of the RDF conversion process. All datais dirty in some way [70]. The most essential data cleaning process is the identificationof identical resources in different datasets. In non-RDF datasets, identifiers are nottypically URIs, so they require context in order to be unambiguously linked with otherdatasets. The lack of context means that syntactically it is not easy to verify whethertwo identifiers refer to the same record. The successful conversion of scientific datasetsto RDF, with defined links between the datasets, makes it possible for computers toautomatically merge information. Even if the data is not semantically valid, there maybe unexpected, but insightful results due to the implications of merges and rules thatare applied to the combined data.

RDF conversion may incur a cost in relation to query performance and storagespace, as an RDF version of a traditional relational database is likely to be largerthan a relational database dump format such as SQL. However, the RDF version isfunctionally equivalent and can be automatically merged with other RDF datasets.

16http://dig.csail.mit.edu/breadcrumbs/node/21517http://www.w3.org/DesignIssues/LinkedData.html

http://purl.org/commons/record/ncbi_gene/4129


http://bio2rdf.org/geneid:4129

http://dig.csail.mit.edu/breadcrumbs/node/215



Compared to the equivalent relational database, an RDF database is typically larger, asRDF databases are typically very highly normalised, while relational database schemastend to be designed with a tradeoff between normalisation and query efficiency. Thesize of the database limits the efficiency of queries, a factor which makes it hard topractically merge a large number of datasets into single RDF databases. In terms ofscientific datasets the file size expansion, from experience, can be anything from 3 to10 times, depending on the ratio of plain text to URIs in the datasets.

2.7 Custom distributed scientific query applications

A number of systems have been developed for the single purpose of querying distributeddatasets. Some of these systems are focused on a single domain, such as DistributedAnnotation System (DAS) [110] which provides access to distributed protein and geneannotations, while others are focused on a mix of domains related to a single topic,such as the range of datasets related to cancer research that are accessible using thecaGrid infrastructure. Other systems such as OGCS-DAI are built on grid computingresources, while others such as BioMart are designed to provide simple methods ofmirroring datasets locally, along with the ability for scientists to visually constructqueries by picking resources from different datasets from lists that are provided by thedatasets.

The Distributed Annotation System allows distributed standardised querying onbiological data. However it does not use RDF, as they are focused on a very specificdomain. The biological datasets are represented using discipline specific interfaces andfile formats which can be understood by most bioinformatics software. In comparisonto an RDF based query solution, the current DAS implementation suffers in that itrequires software updates to all of the relevant sites to support any new classes of in-formation. RDF based systems can be extended without having multiple data modelsimplemented in their software. RDF based solutions enable users to extend an originalrecord structure in a completely valid way using their own predicates, enabling cus-tomised configuration-based additions as well as the up to date annotation data thatDAS is designed to provide.

A variety of different computing and storage grids have been setup in the past toallow scientists to get access to large scale computing resources without having to havea super computer available locally. An example a large scientific grid is the caGrid,developed to enhance the transfer of information between cancer research sites [127]. Itincludes an example of a query system that requires a complex ontology to determinewhere queries and subqueries go. It is able to utilise SPARQL as a query language, butit translates queries directly to the an internal query format using ontologies to definemappings between different services.

It does not allow users to perform these queries outside of the caGrid computingresources, as it requires the use of its internal query engine to execute queries. Althoughsoftware to communicate with caGrid is provided for general use, it is highly specificand would be difficult to replicate in other contexts. It is able to provide provenance,


based on the internal queries that are executed as part of the query, but its emphasis onontologies as the sole mapping tool means it is unable to be used as a generic model forquerying many datasources, irrespective of their syntactic data quality. The mappingsare defined in a central registry, making it difficult for scientists to define their ownmappings, or integrate their own datasets with different, unpublished, mappings.

OGCS-DAI is a general purpose grid based mapping facility, but it does not havea single standard data format, so it relies on services providing mappings between allpossible formats in a similar way to general workflow engines [59]. The use of theresulting documents by scientists to utilise links between datasets, requires mappingbetween the identifiers used by datasets, something which is not simple given that thesystem uses a single binary unique identifier property that is presumed to be largeenough to be suitable for the distributed system, but is not used in any other models,making mapping applications rely on other properties to decide which data items in thesystem match linked data from other applications.

The OGCS-DAI model allows users to integrate different datasets, but these changesmay not be replicable, as the record of the changes may not be visible in any of theprovenance information that could be produced by the system. It contains some degreeof user context sensitivity, through the use of profiles that decide what level of ontologysupport is required by different users, but this context sensitivity does not map tothe query or datasets sections. In particular, users must completely understand thestructure of the underlying datasets to successfully perform a query, both the ontologies,the way the data is arranged, and its syntactic structure, including whether the dataappears in in lists, or in single items.

The understanding of the generic underlying architecture does not necessarily in-clude identifying links between datasets, or the concept of data transformations outsideof the metadata that is provided by the service. The metadata is solely provided byservices, although this may be extended through software extensions to support userdefined metadata. The overall focus of the system on a grid infrastructure, as opposedto an abstract set of linked datasets makes it unsuitable for general use in local contextswhere users may not be able to simulate the grid infrastructure. The model does notseem to be designed for heterogeneous systems where users want to integrate datasetsfrom grid sources, local sources, and other non-grid related sources.

A general purpose scientific example of a software that can be used to mirror datasetslocally, and perform queries both locally and on the foreign datasources is BioMart [130].It can be used to construct queries across different datasets, although there are no globalreferences internally, so it requires users knowing which fields map to other datasets, orrelying the mapping provided by the dataset author. The datasets are curated by theauthors, who then publish the resource, and register it with the BioMart directory. Themapping language that is used to map queries between datasets does not provide theability for users to customise the mappings that are used by the distributed datasets,or utilise any dataset that is not available using either the BioMart format, or a genericSQL database. As the mappings are based on the knowledge that scientists have chosen


to query datasets from the mart, the context of the links are hard to define outside ofthe BioMart software.

The query that is executed by the BioMart software can be used as part of a prove-nance record, as it contains references to the datasets that were used, but it focusessolely on the name of the dataset, and there are no references to where to potentiallyfind the dataset other than to use the internal BioMart conventions for searching fordatasets with that name. The ability of the software to mirror datasets locally providessome degree of context, although the process relies on the scientists ability to arrangefor the data quality to be verified and corrected before they perform queries.

Some systems aim to merge and clean scientific data for use in local repositories[4, 70, 74, 141]. Some systems attempt to dynamically merge data to single localrepositories according to users instructions [72]. Some projects have managed to providelocalised strategies using high performance computing to handle the resulting large setof scientific data [28, 60, 106]. However, it is impractical to expect every dataset to becopied into a single local database for complex queries by users without access to thesehigh performance computing facilities. These facilities incur an associated managementoverhead that is not practical for many researchers, particularly if the datasets areregularly updated.

The BIRN project, although able to query across distributed datasets, relies on acomplete ontology mapping to distribute and perform queries [17]. It relies on databasesusing the relational model, with a single ontology type field to denote what the typeof the record is. This is not practical for large cross-discipline datasets, as BIRN isfocused on a single domain, biomedicine, where the ontologies cleanly map betweendatasets. Arbitrary queries for items, where one does not know what properties will beavailable, are hard in a relational model due to the lack of recognition of links betweendatasets. In a similar way, [75] and [145] are useful systems, but they lack the abilityto provide users with arbitrary relations, and therefore users must still understand theway the data is represented to distinguish links from other attributes. These models donot recognise the difference in syntactic data quality, except for cases where mappingscan be derived using the structure of the record, and there is no allowance provided forsemantic data quality normalisation methods.

The majority of systems that distribute SPARQL queries across a number of end-points, convert single SPARQL queries into multiple sub-queries, before joining andfiltering the results to match the original query. These systems, broadly known asFederated SPARQL, generally require that users configure the system with specificknowledge of properties and types used by each dataset [143]. In many cases they alsoassume that the URI for a particular item will be the same in all datasets to enable atransparent, OWL (Web Ontology Language) inference based, joined results [1, 85, 111],in a similar way to BIRN, although using the RDF model.

The SRS system [46] provides an integrated set of biological datasets, with a customquery language and internal addressing scheme. Although the internal identifiers areunambiguous in the context of the SRS system, they do not have a clear meaning when


used in other contexts. In comparison to the many formats offered by SRS, and thenative document formats of particular scientific datasets, the use of RDF for both doc-uments and query results provides a single method, normalised namespace based URIs,to reference items from any of the involved datasets. SRS provides an internal querylanguage that makes use of the localised database, giving it performance advantages overthe similar distributed RDF datasets provided by Bio2RDF. The approximate numberof RDF statements that are required to represent each of the largest 14 databases inthe Bio2RDF project are shown in Table 2.1, illustrating the scale of the informationprovided currently in distributed RDF datasets.

In comparison to the variety of RDF datasets available, SRS requires that there isonly a single internal identifier for each record, enabling efficient indexing and queries.Although SRS provide versions of each record in a range commonly used formats, thecentral data structures are proprietary and not open and standardised, so it is difficultto provide interoperability between sites. This reduces the ability of other scientists toreplicate results if they do not have access to an institution with an SRS instance, withthe same datasets.

The OpenFlyData model aims to abstract over multiple RDF databases by usingthem as sources for a large local repository, before directing queries at that virtualdataset. However it does not aim to systematically normalise the information or directqueries at particular component datasets [150], and it does not contain a method forscientists to express the amount of trust that they personally have in each dataset.

Database Approximate RDF statements

PDB 14,000,000,000Genbank 5,000,000,000Refseq 2,600,000,000Pubmed 1,000,000,000Uniprot Uniref 800,000,000Uniprot Uniparc 710,000,000Uniprot Uniprot 220,000,000IProClass 182,000,000NCBI Entrez Geneid 156,000,000Kegg Pathway 52,000,000Biocyc 34,000,000Gene Ontology (GO) 7,400,000Chebi 5,000,000NCBI Homologene 4,500,000

Table 2.1: Bio2RDF dataset sizes

2.8 Federated queries

Many systems attempt to automatically reduce queries to their components, and dis-tribute partial queries across multiple locations. These systems presume that a number


of conditions will be satisfied by all of the accessible data providers, including dataquality, semantic integrity, and complete results for queries. In science it is necessary torely on datasets in multiple locations instead of locally loading all datasets. This is dueto the size of the datasets, and the continual updating that occurs, including changesto links between datasets based on changes in knowledge.

In terms of this research into arbitrary multiple scientific dataset querying, the mostcommon federated query systems are based on SPARQL queries that are split across anumber of different RDF databases. Most systems also allow the translation of SPARQLqueries to SQL queries [29, 40, 45]. Others are focused on a small number of cooperatingorganisations such as BIRN [17] and caGRID [120], or they are only focused on a singletopic such as DAS [110]. These systems require estimates of the number of results thatwill be returned by any particular endpoint to optimise the way the results are joined.These specialised systems also require that there is a single authoritative schema for allusers of the system, although there may be mappings between this schema and eachdataset.

Some federated SPARQL systems also require query designers to insert the URL’sof each of the relevant SPARQL endpoints into their queries by redefining the meaningof a SPARQL keyword, making complex and non-RDF datasets inaccessible, and intro-ducing a direct dependency on the endpoint which reduces the ability of scientists totransport the query according to their context [147]. Federated SPARQL systems thatfocus on RDF query performance improvements may not require users to specify whichpredicates they are accessing [63].

There are a number of patterns that are implemented by federated query systems,including Peer-to-Peer (P2P) data distribution, statistics based query distribution, andthe use of search engines to derive knowledge about the location of relevant data.

The first type of federated query relies on an even distribution of facts across the lo-cations to efficiently process queries in parallel. An even distribution of facts is generallyachieved by registering the peers with a single authority, and distributing statementsfrom a central location. However, there may be alternatives that self-balance the systemwithout the use of a single authority. In either case, the locations all need to acceptthat they are part of the single system and they must all implement the same modelto effectively distribute the information without knowledge about the nature of theinformation. This method is generally implemented in situations where queries needto be parallelised to be efficient, while the data quality is known, or not important tothe results of the query. This method is not suitable for a large group of autonomousdistributed datasets, as updates to the any of the datasets will require that the virtuallyevery peer is updated.

The second type of federated query is designed to be used across autonomous, multi-location datasets. It requires knowledge about the nature of the facts in each dataset,and uses this knowledge to provide a mapping between the overall query, and the datasetlocations. The majority of federated SPARQL systems are designed around the basicconcept of a high level SPARQL query being mapped to one or more queries, including


SPARQL, SQL, Web Services, and others. As the system requires detailed statisticsto be efficient, most research focuses on this area, with the data quality assumed tobe very high, and in most cases, the semantic quality to be very good, resulting insimple schema mappings between different datasets based solely on the information inthe overall query. These methods have difficulties with queries that include unknownpredicates, as this aspect is used to make overall decisions about which locations willbe relevant. The VoiD specification includes a basic description of prefixes that canbe used to identify namespaces, but if the URI structure is complex, the unnormalisedprefix alone is not useful in deciding which locations will be relevant [3, 40]. TheVoiD specification assumed that all references to a dataset will have equivalent URIs,so any datasets that use their own URIs, perhaps to enable their users to resolve aslightly different representation of the document locally, will not work with the VoiDspecification.

The third type of federated query is designed to be used against a central knowledgestore. This store contains directions about where facts about particular resources canbe found. The most common type of central knowledge store is in the form of a searchengine. This strategy dramatically reduces the amount of information that is necessaryfor a client, while reducing the time required to find resources, assuming that the searchengine has complete coverage of the relevant datasets. In many cases, items can bedirectly resolved using their identifier, especially if the authority is recognised, and theURI can be used to discover queryable sources of information about the item. However,a search engine may provide more information than a custom mashup that is created andmaintained by users, at the expense of the maintenance and running costs surroundinga very large database.

In contrast to federated query systems, a pure Linked Data approach is extremelyinefficient in the long term, as it fails to appreciate the scale of the resources that are inuse. It inevitably requires users to have alternative access to large datasets, for example,a SPARQL endpoint, for any useful, efficient queries. Some systems may provide a mixof federated query strategies, including a search engine together with a basic knowledgeof which datasets are located in each endpoint. These methods are shown in Figure 2.5.For example, the Linked Data methods used to resolve information about a given URIare shown in Figure 2.6, including the possibility of an arbitrary crawl depth based onthe discovered URIs.

These federated systems, along with Linked Data and SPARQL queries are verybrittle. If one of the SPARQL endpoints or Linked Data URIs is unavailable, thesystem may break down. Scientific results need to be replicable by peers as easily aspossible. In order to overcome this brittleness an ideal system needs to provide a contextsensitive query model that scientists can personally trust by virtue of their knowledge ofthe involved datasets and the way different representations of a dataset are related. Anideal scientific data access system cannot rely specifically on the use of specific propertiesor URIs. By decoupling the scientist’s query from the data locations, an ideal systemis free to negotiate with alternative endpoints that may provide equivalent information


Common Linked Data Query Methods

Linked Data Resolver

SPARQL Query

RDF

HTML RDFa N3 XML

SPARQL Endpoint and Graph URI

Original HTTP URI Search Engine

Service DescriptionHTTP URI with known links

Figure 2.5: Linked Data access methods

using alternative URIs or properties. It can negotiate based on the semantic meaningof the query, recognising multiple Linked Data URIs as equivalent and different sets ofproperties as equivalent in the current context without requiring the properties or URIsto always be equivalent.

Common Linked Data Query Methods

Linked Data Resolver

SPARQL Query : DESCRIBE <http://ns1.org/resource/id2>

HTML RDFa

N3 RDF/XML

SPARQL Endpoint and Graph URI

http://ns1mirror.org/sparql

Original HTTP URI Search Engine : Pre-crawled

Service Description with URI Prefix

http://ns1.org/resource/HTTP URI with known links

HTTP GET http://ns1.org/resource/id2

HTTP GET http://ns2.org/

name/id-2

Crawl all discovered URIs

to arbitrary depth

Figure 2.6: Using Linked Data to retrieve information

Chapter 3

Model

3.1 Overview

Scientists face a number of difficulties when they attempt to access data from multiplelinked scientific datasets, including data cleaning, data trust, and provenance relatedissues, as described in Chapter 1. A model was designed to make it easy for scientiststo access data using a set of queries across a set of relevant data providers, with de-sign features to support data cleaning operations and trusted, replicable queries. Themodel addresses the trust, quality, context sensitivity and replication issues discussedin Section 1.1.

The main design goals for the model are to provide for replicability using a mappinglayer between a scientist’s query and the actual queries that are executed. The mappingtakes on two parallel forms, one of which maps the scientist’s query to a concrete querythat could then be directly executed, and another which map the notion of datasetsto data locations. The scientist’s query can be mapped to different languages andinterfaces as necessary, while the textual strings ensure that there can be more than onenamespace inferred from any query, enabling future scientists to merge different uses ofthe model without affecting backwards compatibility.

The mapping layers ensure that no one system or organisation can be a sole pointof failure for replicating a query, as long as data is freely and openly accessible. Inaddition, the queries are designed to be replicated based solely on textual provenancedocuments, as compared to similar mapping systems that require coded mapping pro-grams as common components in the mapping workflow. Coded, particularly compiled,mapping programs would limit the ability to implement the model in future using dif-ferent technologies.

Scientists can choose to modify the system to suit their current local context, withoutaffecting the way other scientists perform the same query using public data providers.This makes it possible for scientists to have contextually applicable data access tothe relevant linked scientific datasets across their research. Scientists do not have topublicise their lack of trust or their opinion of the data quality in different datasetsto extend the system, as there is no global point of entry for the system. When thescientist is ready to publish their results, other scientists should be able to examine

47

48 Chapter 3. Model

their queries, and the provenance of their results, as textual documents which shouldlower a technological barrier to reuse, compared to implementation specific systems andglobal registries.

Datasets may not be trusted if they contain out of date information; they may notcleanly link to other datasets; or they may cleanly link to other datasets, and be up todate, but may still not be the best choice for a scientist in their context. The modelis aimed at providing scientists with more direct control over which data providers areaccessed when queries are replicated, as current Linked Data and distributed SPARQLquery models do not fully support this notion, as shown in Figure 2.6. The modelattempts to provide a way for scientists to describe the nature of the distributed linkeddataset queries that are performed in response to a set of their queries, with an emphasison collaboration with other scientists and replication by scientists with access to differentfacilities containing similar data.

The model enables context sensitive, distributed linked data access by mapping asingle user query to many different locations, denormalising the query based on thelocation, and normalising the results to form a single set of homogeneous results. Itmaps user queries to query types based on query matching expressions. This mappinggenerates a set of parameters for each query type, including a set of namespaces thatare unique to the query type. For each of these parameter sets, a set of providerswhich implement the query type are chosen, including inclusion or exclusion based onthe namespaces identified in the query parameters. The query parameters are thenprocessed using normalisation rules assigned to the provider using two stages for thequery parameters and the query after the query parameters are inserted into the querytemplate. The resulting query may optionally be compiled and transformed using usingabstract syntax transformation rules.

The query is then executed based on the type of the provider. For example, aSPARQL query could be submitted using an SQL interface or an HTTP interface, or aquery could be a simple HTTP GET to the endpoint URL. The results are normalisedfor each query on each provider in two additional stages, before and after parsing toRDF triples. The RDF triples from all of the queries on all of the chosen providers arethen aggregated into a single pool of RDF statements with two additional normalisationstages, before and after serialising the results to an RDF representation.

A comparison of the basic components of the model to the typical federated RDFbased query models access is shown in Figure 3.1. The typical query models, describedin Section 2.8, all fail to see data quality as an inherent issue, preferring to focuson semantic or RDF statement level normalisation without allowing for any syntacticnormalisation. The query model described here allows for a wide range of normalisationmethods, along with the ability to transparently integrate data from providers whichhave not agreed on a particular URI model, as the typical query models all requirethat URIs be identical, or that all providers have access to–and have therefore agreedto–mappings between the different URIs.

In the example from Section 1.1.2, scientists need to access data from a variety of

Chapter 3. Model 49

Figure 3.1: Comparison of model to federated RDF query models

50 Chapter 3. Model

datasets including genomic, drug, and chemical data. However, they are unable to easilydistinguish between trusted data and untrusted data, as some datasets may be usefulfor some queries but not others, and the datasets are very large, including numerouslinks between datasets. They are unable to systematically alert other scientists toirregularities in the data quality, as the current data access and publishing methods donot provide ways for scientists to provide corrections or modifications as part of thedata access model.

The model is designed around the process of scientists asking questions that can beanswered using data from linked datasets. The question contains parameters includingthe names of datasets and identifiers for data items that are relevant, along with otherinformation that defines the nature of the query, as shown conceptually in Figure 3.2.For example, if the scientist in the example wanted to access information about a Drugnamed Marplan in DrugBank, they could access the data using the steps shown inFigure 3.3. To distinguish between different parts of a dataset, each dataset in themodel is given one or more namespaces based on the way the dataset has assignedidentifiers to data items internally. The parameters, which in the example include theDrugBank Drug namespace, and the search term, “Marplan”, are matched against theknown types of queries and data providers to determine which locations and servicescan provide answers to the question. These parameters are applied to templates forthe query types and providers. The data providers may not all be derived from theDrugBank dataset, as shown by the inclusion of an equivalent search on the DailyMeddataset in addition to the DrugBank query.

Data providers are defined in terms of their query interfaces, which types of queriesthey are useful for, what namespaces they contain information about, and what datanormalisation rules are necessary. This information is used to match each provider toa users query, before constructing queries for each provider based on the parameters,namespaces and data normalisation rules. Scientists can make different providers tochoose relevant data cleaning rules in the context of both queries and providers, animprovement on previous models that require global or location specific data cleaningrules.

The model includes profiles that allow scientists to state their contextual preferencesabout which providers, query types, and rules, are most useful to them. The selectedproviders, query types, and normalisation rules will be selected to answer questions,even if previous scientists did not use any of the same methods to answer the questionin the past. The profile can explicitly include or exclude any provider, query type, ornormalisation, as well as specifying what behaviour to apply to elements that do notexactly match the profile. To allow more than one profile to be used consistently, theprofiles are ordered so scientists can override specific parts of any other profiles withouthaving to recreate or edit the existing profile. Profiles are designed to allow scientists tocustomise the datasets and queries that they want to support without having to changeprovenance records that were created by previous scientists. This feature is unavailablein other models that allow scientists to answer questions using multiple distributed

Chapter 3. Model 51

Search for Drug in DrugBank

DrugBank data source 1

DrugBank data source 2

Search interface

Data access parameters :Query type = Search by Namespace Namespace = Drugbank DrugsSearch term = Marplan

Data Providers

Actual queriesSearch interface

Integrate results

Resolved data

Figure 3.2: Query parameters

Search for Drug in DrugBank

DrugBank Dailymed

SPARQL Regex search

Data access parameters :Query type = Search by Namespace Namespace = Drugbank DrugsSearch term = Marplan

Data Providers

Actual queries

Integrate results

Resolved data

http://www4.wiwiss.fu-berlin.de/drugbank/sparql

http://www4.wiwiss.fu-berlin.de/dailymed/sparql

CONSTRUCT ?s ?p ?o . WHERE ?s ?p ?o .

FILTER(REGEX(?o, "Marplan"))

SPARQL Regex searchCONSTRUCT ?drugbankLink

bio2rdf:linkedToFrom ?s . WHERE ?s ?p ?o .

?s owl:sameAs ?drugbankLink .FILTER(REGEX(?o, "Marplan"))

FILTER(REGEX(STR(?drugbankLink,"http://www4.wiwiss.fu-berlin.de/drugbank/"))

Search for any uses of the word Marplan in Dailymed that have references to DrugBank

Search for any uses of the word Marplan in DrugBank

Figure 3.3: Example: Search for Marplan in Drugbank

52 Chapter 3. Model

linked datasets.Scientists can take advantage of the model to access a large range of linked datasets.

They can then provide other scientists with computer understandable details describinghow they answered questions, making it simple to replicate the answers. This goal,generally known as process provenance, is supported by the model through the use ofa minimal number of components, with direct links between items, such as the linksbetween providers and query types. The model can be used to expose the provenancefor any scientist’s query consistently because each of the components are declared usingrules or relationships in the provenance record. This enables other scientists to feedthe provenance record into their implementation of the model to replicate the queryeither as it was originally executed, or with substitutions of providers, queries, and/or,normalisation rules using profiles.

Complete provenance tracking requires a combination of the state of the datasets, thescientific question that was being answered, any normalisation or data cleaning processesthat were made to the data, and the places that the data was sourced from. The modelis able to produce query provenance details containing the necessary features to replicatethe data access, including substitutions and additions using profiles. The creation ofannotations relating to the state of a dataset is difficult in general, as scientific datasetsdo not always contain identifiable revisions, however, if scientists have access to theseannotations they can include them using query types and providers that complementtheir queries.

The model requires a single data format to integrate the results of queries from dif-ferent information providers without ambiguity. This makes it possible to use and trustmultiple data providers without requiring them to have the same or even a compatibledata schema. Other distributed query models require a global schema making it diffi-cult to substitute data providers. The single data format, without a reliance on a singleschema, enables scientists to easily integrate data from different locations, graduallymodifying data quality rules as necessary, without initially having perfect data qualityand a global schema, as is assumed by the theory proposed in Lambrix and Jakoniene[83]. A single extensible data format enables scientists to include provenance annota-tions with data, where most other provenance methods need be separated into differentdata files [17, 35, 60, 74].

3.2 Query types

The model directs query parameters to the relevant data providers and normalisationrules using query types. For example, a scientist needs to initially find all references to adisease as shown in Figure 3.4, while later only needing to find references from datasetsthat they trust as shown in Figure 3.7, or are interested in as shown in Figure 3.5and Figure 3.6. In each case the scientist needs to describe the way each query wouldbe structured for each dataset. If there are different methods of data access, differentstructures, or different datasets, they need to be encapsulated in separate query typesto determine which providers, namespaces, and normalisation rules are relevant to each

Chapter 3. Model 53

query type.

Each query type represents a query, independent of the actual providers that it maybe applied to, even if it is designed to be specific to a particular provider. However, eachquery type is able to recognise the namespaces that it is targeted at without referenceto providers. If a namespace is recognised it may be used to restrict the set of providersthat make the relevant query available in the context of that namespace. This structuremakes it possible to extend or replace queries that other scientists have created for allnamespaces, without changing the syntax. For example, a scientist may want to addannotations to the data description for a namespace. They do not have to modify anyof the other query types that make it possible to get the base information in order toadd their annotations, and they do not want their annotation query being used as adatasource for any other namespaces. In the model they simply create a new querytype that recognises the query parameters that are used for the base information querytypes, and restrict it to their namespace, as long as the namespace prefix is identifiablein the parameter set.

When other scientists wish to replicate the query in a different context, e.g., adifferent laboratory or using a different database, they can create semantically equivalentqueries to use another data access interface. They can reuse the generic parameters aspart of their new query type minimising the number of external changes that need tobe made for an application to support the new context. The model allows scientists toeasily substitute the query types without having to modify the original query types usingprofiles. The profiles dynamically include and exclude query types without changes toprovenance records, other than additions where necessary.

In the example shown in Figure 3.3, there was one set of parameters defined forthe two query types, with the profile defining which of the query types were used ineach situation. It was necessary to create two query types so that the scientist had areference to use when defining which strategy they preferred. A use case for this wouldbe to enable a scientist to transparently substitute their limited query into a publicprovenance record which originally used any and all possible datasets as sources for thequery. They would need to do this to be sure that they could validate the results usingdatasets that they had personally curated.

In comparison with typical Linked Data and Federated SPARQL query models,as shown in Figure 3.1, the model introduces a new layer between an overall queryand the data providers. This layer makes it possible for scientists to perform querydependent normalisation, trust specific endpoints in relation to queries without havingto specify a particular endpoint in their query, and enables scientists to reliably recreatethe query using information in the provenance record. The contextual normalisation andtrust is not possible if the configured information is generically applied to all queries,as the replicated query would be linked back to datasets and endpoints as the basicmodel elements, instead of query types which can contextually define namespaces andparameter relevance. These namespaces are more abstract than datasets in models suchas VoiD which include references to URI structures and endpoints as part of the core

54 Chapter 3. Model

Search for References to a Disease

DrugBank Dailymed

Data access parameters :Query type = Find references Data item namespace = diseasome_diseasesData item identifier = 1689

Query

Integrate results Resolved data

Search for any references to the

disease

DatasetsSider MediCare

Figure 3.4: Search for all references to disease

Chapter 3. Model 55

Search for References to a Disease In Namespace

DrugBank Dailymed

SPARQL Reference search

Data access parameters :Query type = Find references in namespaceData item namespace = diseasome_diseasesData item identifier = 1689Search namespace = sider_drugs

Query




CONSTRUCT ?s ?p <http://www4.wiwiss.fu-berlin.de/diseasome/

resource/diseases/1689> . WHERE ?s ?p <http://

www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/1689> .

Search for references to the disease in a certain namespace

Datasets

Sider

http://www4.wiwiss.fu-berlin.de/sider/sparql

MediCare

http://www4.wiwiss.fu-berlin.de/medicare/sparql

Does not contain namespace sider_drugs Does not contain namespace sider_drugs

Does not contain namespace sider_drugs

Figure 3.5: Search for references to disease in a particular namespace

56 Chapter 3. Model

dataset description.


DrugBank Dailymed


Data access parameters :Query type = Find references (Only Sider Drugs)Data item namespace = diseasome_diseasesData item identifier = 1689

Query







Search for any references to the disease

Datasets

Sider


MediCare


Not configured to handle "Find references (Only Sider Drugs)"



Figure 3.6: Search for references to disease using a new query type

Each query type defines a relevant set of parameters for each type of data accessinterface that may be used on a provider using the query type. If a provider with a newtype of data access interface is included, current query type definitions can be updatedto include a new parameterised template. These changes allow scientists to migratebetween datasets gradually, perhaps while they test the trustworthiness or reliabilityof a new dataset or interface, without requiring them to change the way they use themodel to access their data. This change is important as it provides a configurationdriven method of migration, where other methods require scientists to change the waythey access the data interface to match a new dataset.

Query types and data providers can be assigned namespaces. Each of the assignednamespaces are recognised by the model based on parameters that the scientist usesin their query. In some cases the assigned namespace does not need to match thenamespace in the query in order for the query type or data provider to be relevant. Inthe example shown in Figure 3.4, the scientist indicated that they wanted referencesto an item, including references from other datasets. This meant that although it wasimportant to know the namespace and identifier for the item, any query type or providerwould be relevant.

Similarly, if the scientist wanted to restrict their search for references to a specificnamespace they could add another parameter, and specify that the parameter was a

Chapter 3. Model 57

namespace that the model needed to use to plan which query types and data providerswould be relevant, as shown in Figure 3.5. If they wanted to avoid changing the queryparameters to ensure that the query was compatible with their current processing tools,they could create a new query type with the same parameter set and only apply it todatasets containing the namespace as shown in Figure 3.6.

Although it is easier to maintain the applications using the model using a new querytype, the model is easier to maintain if namespaces are given as parameters, as the querymay be reusable in the context of other namespaces. It is important to note that theactual queries that would be executed on the data providers did not need to change inthe example shown in Figure 3.6, however, the different requirements made it necessaryto use the same query in different query types.


DrugBank Dailymed


Data access parameters :Query type = Find references (Only Sider Drugs)Data item namespace = diseasome_diseasesData item identifier = 1689

Query








Datasets

Sider


MediCare


Restrictred by profile Restrictred by profile

Restrictred by profile

Sider

http://myuniversity.edu/sider_curated/sparql

Restrictred by profile Included by profile

Figure 3.7: Search for all references in a local curated dataset

3.2.1 Query groups

In some cases there are different ways to make up queries to answer the same question,depending on the dataset that is used to answer the question. In some circumstances,only one of these queries needs to be performed for a scientist to get sufficient infor-mation to answer their query. However, together with normalisation rules, two queriesmay be able to generate the same results from two separate providers. In these cases ascientist may wish to group queries together to make their process more efficient, whilestill having redundancy incase a provider is unresponsive. Query types that are grouped

58 Chapter 3. Model

together should have the same parameters, so that scientists do not have to know theyare using different queries, unless they check the provenance.

To reduce the number of provider queries that are unnecessarily performed, themodel includes an optional item that can be used to group query types into formalgroups according to a scientist’s knowledge that the query types are equivalent. Thegroups ensure that the same query is not performed on an equivalent provider morethan once because it is known to produce a redundant set of results. The groups alsoensure that the same query may only need to be executed on one of many providers, orprovider groups.

A query group must be treated as a transparent layer that only exists to exhibitthe semantic equivalence of each of the query types it contains. A user can optimisethe behaviour of the model by choosing all or any query types out of each group basedon its the local context. As shown in Figure 3.8, the query groups are not structurallyequivalent, in that they operate on the same queries, but the queries are not equivalent,as the query parameters need to be translated in four different ways to match thedata provider interfaces. The example namespace, Digital Object Identifier (DOI), isinherently difficult to query due to the fact that there is no single organisation thatmaintains a complete copy of the dataset. There are currently only a small number ofDOI registration agencies, and there is useful information about DOIs available manydifferent domain specific datasets, so there needs to be a variety of different types ofqueries.

Query Groups

Query Group: Construct document from DOI

Dublin Core IdentifierBlue Obelisk DOI

Common parameters - Namespace : doi

Identifier:10.*

SPARQL Endpointhttp://pele.farmbio.uu.se/

nmrshiftdb/sparql

Bio2RDF URI Query Types

Query Group

Data provider

Templated query

HTTP GET

REST Interfacehttp://bioguid.info/

openurl.php?display=rdf&id=

namespace:identifier

SPARQL Endpointhttp://uniprot.bio2rdf.org/

sparql

SPARQL Endpointhttp://cpath.bio2rdf.org/

sparql

CONSTRUCT <http://bio2rdf.org/namespace:identifier> ?p ?o WHERE <http://bio2rdf.org/namespace:identifier> ?p ?o

CONSTRUCT <http://bio2rdf.org/namespace:identifier> ?p ?o WHERE ?s <blueobelisk:doi> "identifier" . ?s ?p ?o .

CONSTRUCT <http://bio2rdf.org/namespace:identifier> ?p ?o WHERE ?s <dc:identifier> "namespace:identifier" . ?s ?p ?o .

No Template

Figure 3.8: Query groups

3.3 Providers

Scientists gain access to datasets using many different methods, for example, they mayuse HTTP web based interfaces, Perl scripts which access SQL databases, or SOAP

Chapter 3. Model 59

XML based web services. The model accommodates different access methods using dataproviders that are dataset endpoints which support one or more query types. Theseproviders may be implemented in different ways, but each provider is linked to querytypes, namespaces and normalisation rules. These links give the model informationabout how a query is going to be formed, which datasets are relevant, and what dataquality issues exist. Each provider represents a single method of access to either thewhole, or part of, a dataset, and different providers can utilise the same access methodand location for a dataset depending on the scientist’s context.

Each provider is associated with a set of data endpoints that can all be accessed inthe same way. Each of the endpoints must handle all of the query types and namespacesthat are given for the provider. In the example shown in Figure 3.9, there are threeproviders representing four endpoints, three actual queries, and two query types. Eachof the endpoints connected to each provider can be used interchangeably, reducing thequery load on redundant endpoints. The two endpoints connected to the Blue ObeliskDOI query type are not equivalent, as each provider can only have one templatedSPARQL Graph URI, so they must be placed in a provider group before they can beconsidered to be equivalent.

Providers


Blue Obelisk DOI


Identifier:10.*


nmrshiftdb/sparqlSPARQL Graph URI: Not

applicable

Bio2RDF URI Query Type

Query Group

Endpoint

Templated query

SPARQL Endpointhttp://localhost:8890/sparql

SPARQL Graph URI: <urn:graphs/nmrshiftdb>

SPARQL Endpointhttp://quebec.cpath.bio2rdf.org/

sparqlSPARQL Graph URI:

<http://bio2rdf.org/cpath>

CONSTRUCT <http://bio2rdf.org/namespace:identifier> ?p ?o WHERE sparqlGraphUriStart<http://bio2rdf.org/namespace:identifier> ?p ?o .sparqlGraphUriEnd

CONSTRUCT <http://bio2rdf.org/namespace:identifier> ?p ?o WHERE sparqlGraphUriStart?s <blueobelisk:doi> "identifier" . ?s ?p ?o . sparqlGraphUriEnd

SPARQL Endpointhttp://cu.cpath.bio2rdf.org/



Actual query

CONSTRUCT <http://bio2rdf.org/doi:10.*> ?p ?o WHERE GRAPH <http://bio2rdf.org/cpath> <http://bio2rdf.org/doi:10.*> ?p ?o .

CONSTRUCT <http://bio2rdf.org/doi:10.*> ?p ?o WHERE ?s <blueobelisk:doi> "10.*" . ?s ?p ?o .

CONSTRUCT <http://bio2rdf.org/doi:10.*> ?p ?o WHERE GRAPH <urn:graphs/nmrshiftdb> ?s <blueobelisk:doi> "10.*" . ?s ?p ?o .

ProviderLocalhostBlue Obelisk DOI

FarmBio Nmrshiftdb

Blue Obelisk DOI

Bio2RDF Mirrored CPath DOI

Figure 3.9: Providers

3.3.1 Provider groups

Providers are thin encapsulation layers over endpoints. However, they do not providethe necessary elements on their own to encapsulate data sources that do not haveidentical interfaces. Although a simple solution is to implement two providers, this is

60 Chapter 3. Model

not efficient, as they will both be used in parallel for all of the relevant queries. A moreuseful solution, which allows semantically equivalent providers to be grouped, allows theimplementation to know that the providers do not all have to be implemented in thesame way to be equivalent in terms of the RDF triples that they will return in responseto the same user queries.

In the example shown in Figure 3.10, there are two providers that have equivalentSPARQL Graph URIs, so they do not need to be grouped, as the interface URL canbe duplicated on a single provider. There are two providers that match the same queryand can be used equivalently, but there are different SPARQL Graph URIs, so it isnecessary to use a provider group to make sure that the model can recognise that thetwo providers are equivalent.

Provider Groups


Blue Obelisk DOI


Identifier:10.*


nmrshiftdb/sparqlSPARQL Graph URI: Not

applicable

Bio2RDF URI Query Type

Query Group

Data provider

Templated query

SPARQL Endpointhttp://localhost:8890/sparql

SPARQL Graph URI: <urn:graphs/nmrshiftdb>

SPARQL Endpointhttp://quebec.cpath.bio2rdf.org/




CONSTRUCT <http://bio2rdf.org/namespace:identifier> ?p ?o WHERE sparqlGraphUriStart?s <blueobelisk:doi> "identifier" . ?s ?p ?o . sparqlGraphUriEnd

Provider Group

Blue Obelisk DOI NMRShiftDB

SPARQL Endpointhttp://cu.cpath.bio2rdf.org/



Not Applicable: Providers Match

Figure 3.10: Provider groups

Provider groups do not functionally change the nature of a query unless users decidenot to use all redundant, equivalent, providers as sources for a query. In this case,the Provider Groups would encapsulate a set of heterogeneous providers which areequivalent, and could be considered substitutes for any of the query types on all of theproviders, based on the query groups that the query types are members of. Any of theproviders could then be chosen at random, or using another strategy, to answer a givenquery.

3.3.2 Endpoint implementation independence

The system does not require SPARQL endpoint access, as other resolution methodscould be designed and implemented as part of the current provider model, howeverthe result of each query needs to be in RDF to be included with the results of otherqueries. In particular, if RDF wrappers are written for Web Services that currently

Chapter 3. Model 61

form the bulk of the bioinformatics processing services, they can be included as sourcesof information for queries that they are relevant to.

In some cases, there is a need to provide simple static mappings between data itemsand other known URIs. The number and scale of these mappings may make it moreefficient to dynamically generate the mapping on request. In these cases, the mappingtool could be located in the model if the information content required for the mappingis provided in the query, or it can be created with a lightweight widget that takesthe initial data item as input and provides the relevant RDF snippet as output. InFigure 3.8, there are four different providers. There is one direct HTTP URL that canbe accessed using HTTP GET, while the others must be accessed using HTTP Post.The information available from each provider may be equivalent, however the endpointsmust all be accessed differently in response to the same query.

3.4 Normalisation rules

Normalisation rules are transformations that allow scientists to define what changes arenecessary to the results of queries from particular providers to form a clean set of results.Although the ability to assign transformations to the results of queries is not novel, inthe context of this work, it enables scientists to utilise simple alternatives to coding orarbitrary workflow technology in most cases, although the transformations themselvesmay be complex depending on the situation. Normalisation rules allow different scien-tists to express their opinions about data by changing the way it is represented to fittheir context. The changes may be simply related to data quality, but they may alsobe related to trust, as untrusted elements can be removed or the weight of an assertioncan be changed using the rules. The level of trust that scientists put into the results ofqueries could be increased by publishing these modifications along with the data, so asnot to portray the information as being directly sourced from the original provider.

As well as allowing scientists to define and share their ideas about what changesneed to be made to data sourced from particular providers, normalisation rules can beformulated to apply to more than one provider as applicable, making them availableas reusable tools that may apply to many entities. This may reduce the complexity ofdealing with a large number of datasets, if at least some of them are compatible, evenif they do not match the normalised form that this model expects.

The data cleaning process is more difficult for science datasets compared to businessdatasets. The most well known data cleaning methods involve business scenarios, wherethe conventions for what clean data looks like are well recognised. For example, it iseasy to identify whether a customer record is complete if it contains a valid address andphone number, and the email address fits a simple pattern. In comparison, in scienceit is difficult to identify whether a gene record is complete, as it may be valid, butnot be linked directly to a known gene regulation network, or it may be valid but nothave a protein sequence, as it is a newly discovered type of gene that is not used togenerate proteins. The scientific data quality conditions rely on a meaningful, semanticunderstanding of the data, where data is assumed to be clean in business scenarios if it

62 Chapter 3. Model

is syntactically correct. The normalisation rules used with the model need to supportboth syntactic and semantic normalisation, within the bounds of the amount of datathat is accessible to the model.

In the example shown in Figure 3.11, there are three different DOI schemas, theBIBO, Blue Obelisk and Prism vocabularies. The datasets in the example containthree different types of URIs, Bio2RDF, BioGUID and NMRShiftDB, so there must beboth semantic and syntactic normalisation rules to integrate the results into a single ho-mogeneous document. In the case of one property, the Dublin Core Identifier, there arefour alternatives that have not been normalised to demonstrate the range of data thatis available. The results of the query may include data from the four endpoints, how-ever, in the example, there are no results from NMRShiftDB or the Bio2RDF endpointsfor this DOI. The schema that was chosen by the scientist as the most applicable wasthe Prism vocabulary, so the normalisation rules mapping the Prism vocabulary to theBIBO and Blue Obelisk vocabularies are not shown. The URI scheme that was preferredwas the normalised Bio2RDF URI, so the mapping rules are shown for NMRShiftDBand BioGUID to Bio2RDF. The normalisation rules were only applied to the resultsfrom BioGUID in the example as it was the only provider that returned information,and it contained redundant information given using two different vocabularies.

Normalisation Rules


Dublin Core IdentifierBlue Obelisk DOI


Identifier:10.1080/0968768031000084163


nmrshiftdb/sparql

Bio2RDF URI Query Types

Query Group

Data provider

Templated query

HTTP GET

REST Interfacehttp://bioguid.info/

openurl.php?display=rdf&id=


SPARQL Endpointhttp://uniprot.bio2rdf.org/

sparql

SPARQL Endpointhttp://cpath.bio2rdf.org/

sparql


CONSTRUCT <http://bio2rdf.org/namespace:identifier> ?p ?o WHERE ?s <blueobelisk:doi> "identifier" . ?s ?p ?o .

CONSTRUCT <http://bio2rdf.org/namespace:identifier> ?p ?o WHERE ?s <dc:identifier> "namespace:identifier" . ?s ?p ?o .

No Template

URI Normalisation

Rules

Replace http://pele.farmbio.uu.se/

nmrshiftdb/?bibId= with

http://bio2rdf.org/nmrshiftdb_bib:

Replace http://purl.org/ontology/bibo/doi

withhttp://prismstandard.org/namespaces/2.0/basic/doi

Replace http://bioguid.info/pmid:

withhttp://bio2rdf.org/pubmed:

Schema Normalisation

Rules

Figure 3.11: Normalisation rules

In this example the model is able to syntactically normalise the data, but withoutsome further understanding of the way the DOI was meant to be represented andused, it cannot semantically verify that the data is accurate. Further rules, that areimplemented after the data is imported into the system or after it is merged with other

Chapter 3. Model 63

results, could contain the semantically valid conditions that could be used to remove ormodify the results.

Normalisation rules are used at various stages of the query process, as highlighted inTable 3.1. Normalisation rules are applied at the level of data providers to provide themost contextual normalisation with the least disruption to generally used query types.For each query type on each provider, the normalisation rules are used on each of theseseparate streams. All normalisation rules from all streams may be applicable to thetwo final stages after all results are merged into the final pool of results, and after thesecomplete results are output to the desired file format.

Stage Mode1. Query variables Textual2. Query template before parsing Textual3. Query template after parsing In-Memory4. Results before parsing Textual5. Results after parsing In-Memory6. Results after merging In-Memory7. Results after serialisation Textual

Table 3.1: Normalisation rule stages

Initially, normalisation rules are used for each query type before template variablesare substituted into the relevant template. After the template is ready, normalisationrules may be applicable to the entire template, including when the template is in textualform or after the template is parsed by the system to produce an in-memory represen-tation of the query. This enables scientists to process the variables independently of thetemplate, before processing the template to determine if it is consistent, if this processcannot be performed using the variables alone.

The normalisation rules are then used on the textual representation of the results,if this is available with the relevant communication method. The normalised text isthen parsed into memory using the single data model that the system recognises, andnormalisation rules are applied to the in memory representation of the results beforemerging results from different streams. After the results are merged, normalisationrules are applied to the in-memory representation of the results before serialising thecomplete pool of results to an output format. In most cases, this output format willbe the requested format, unless the request was not compatible with the single datamodel, in which case, the pool of results must be serialised in a different way basedon the in-memory representation, or a textual transformation on the output would beused.

The same normalisation method would not be used for each stage. For example,the query variables are normalised using textual transformations, while the parsed in-memory results are normalised using algorithms based on the common data model. Animplementation would need to support each of the different methods that were usedwith normalisation rules attached to each relevant provider before a query could besuccessfully executed. There is no guarantee that leaving out a normalisation rule will

64 Chapter 3. Model

produce a consistent or useful result, unless the scientist has directed in their profilesthat the normalisation rule is not useful to them.

3.4.1 URI compatibility

Although normalisation rules are generic transformations, they are a useful tool whenapplied to the case of queries distributed across a set of linked datasets. Althoughmany datasets contain references to other datasets, the format for references to thesame dataset is not generally consistent across datasets. Even if a resolvable URI isgiven as a reference, the resolution point is contained in the reference, along with thecontext specific information that defines which item is being linked to. A naive approachto resolving the URI, would be to use the protocol and endpoint that are given to findmore information about the item. However, if the resolution point is not controlledby the user, any data quality measures using normalisation rules, or redundancy usingmultiple providers, may not be utilised. To solve this issue, normalised URIs can becreated that scientists can easily resolve using resources that they control. If normalisedURIs are given as a basic standard, they can be further utilised to change the referencesin documents to match the scientist’s context, making it simple to continue using themodel described here to maintain control over the way the data is represented andthe way references are resolved. This enables scientists to work with data from manydatasets, but choose where to source the information from according to their context.

Although this may tend to encourage the explosion of many different equivalentURIs, if the URIs are resolvable by other users, the new URIs may still be useful. Ifthere is a standard format for the URIs, and a mapping between the local URIs andthat set can be published as normalisation rules scientists could recognise the link andnormalise it before resolving it to useful information. This can be performed withoutcontinuously relying on one scientist for infrastructure, as long as a rough network isestablished to share information about providers and normalisation rules, along withany useful query types and namespace information.

A goal of this research is to both make information recognisable and trustworthy,while simultaneously making it accessible in many different contexts. Along with socialagreements about link and record representations, the use of this model to representwhat data is available, and what the access methods are, makes it possible to decentraliseinfrastructure, even if there is still a global resolution point, the use of the resolutionpoint can be mirrored in other circumstances to minimise interruptions to research.The model is able to do this without a new file format, access protocol or resolutioninterface.

By contrast, the LSID protocol was designed to solve data interoperability and accessissues for life sciences. However, it has not been extensively used. There are likely arange of reasons, but with respect to this research, it may not have been widely adopteddue to its insistence on a completely new data access protocol, a range of resolutionmethods, and the lack of a single data format, although they insisted on the use of RDFfor metadata.

Chapter 3. Model 65

The use of LSIDs would have enabled scientists to recognise the LSID protocol basedon the URI, and access original data format files using one of the supported interfaces.However it did not include the ability to apply transformations to the information,as it insisted that data remain permanently static after its publication. In addition,scientists still needed to be able to process and supply data in different formats, makingit difficult–or impossible–to merge data from different sources.

Federated SPARQL approaches require that the URIs used on one endpoint torepresent a resource must be exactly the same as the URIs used on other endpoints torepresent that resource, as any other URI cannot be presumed to be equivalent withoutspecific advice. The normalisation rules in this model are not explicitly linked to thenamespace model because they are more relevant to providers than to namespaces, andthe definition of whether the URI is equivalent is dependent on the provider rather thanthe namespace definition. An example of this can be seen in Figure 3.12, where differentproviders have different names for the same item and the same query template can beused with all of the providers.


DrugBank Dailymed



NormalisedQueries

Integrate results Normalised data



CONSTRUCT ?s ?p <http://bio2rdf.org/diseasome_diseases:1689> .

WHERE ?s ?p <http://www4.wiwiss.fu-berlin.de/diseasome/resource/

diseases/1689> .


Datasets

Sider


MediCare



CONSTRUCT ?s ?p <http://bio2rdf.org/diseasome_diseases:1689> .

WHERE ?s ?p <urn:diseasome:diseases:1689> .

Disease review

http://mycompany.com/disease_review/sparql


CONSTRUCT ?s ?p <$normalisedURI> . WHERE ?s ?p <$endpointSpecificURI> .

Query template

Figure 3.12: Single query template across homogeneous datasets

The use of normalisation rules enables the URIs used on each provider to be nor-malised before being returned to users, so scientists can recognise the URI, and thesoftware can recognise where to perform further queries that relate to that namespace.In order not to remove the link to the previous context, normalisation rules can beused together with a static statement of equivalency back to the alternative URIs that

66 Chapter 3. Model

were used on any of the endpoints, so scientists can have a list of URIs that relateto particular resources, although they may still focus on the normalised URI to makefurther querying with the model easier. A Federated SPARQL system could be de-signed using the configuration model, without references to query types or parameters,by recognising what normalisation rules and namespaces are applied to each knownprovider, removing the reliance on the preconfigured query types for arbitrary SPARQLendpoints. Although the namespaces are designed to be recognised in the context ofthe query types, the model could be implemented with named standard parameters toprovide efficient access to the places where namespaces are known to be located.

The model allows the definition of rules that remove particular statements from theresults. These rules make it possible to reduce the amount of information as necessarywithout changing the interface for the provider.

3.5 Namespaces

The namespaces component is centred around identifying the ways that a dataset isstructured. There are two main considerations for the decision to split a dataset intonamespaces, the way that identifiers are used and the way that properties are used.

In a traditional relational database, primary keys do not have to be given contexts,because they inherit their nature from the table they are located in. In non-relationaldatasets there is still a need to define unique identifiers for items, however as the con-straints are not given by the structure of the database, the identifiers must be assignedusing some other method. This model does not seek to convert all datasets into a rela-tional compatible model, where each aspect is normalised and given a unique identifierset based on its data type, as there may be benefits to sharing an identifier set acrossdifferent types of items in the same dataset. These benefits include improved acces-sibility stemming from a simpler method of searching the dataset without previouslyknowing its internal schema. Namespaces can be used to define the scope of an internalidentifier, so scientists can know the context that the identifier was defined in, althoughthe namespace may not have a clear semantic meaning if data items of different typesare mixed in a single namespace.

The second consideration, relating to properties, stems from the requirement thatthe model supports multiple datasets that do not have a single schema for all dataitems. Scientists cannot be assured that a given property will exist, although queriesfor an item may not cause errors. To specify which properties actually exist, namespacesmay be used to link terminologies, such as ontologies, to datasets, specifically throughparticular providers of those datasets. If a provider is an interface for a particularproperty, then a namespace can be used to indicate that the provider is an interface forthat property. This is possible even if the namespace is not being used to identify thedata items using the property or the location of the property definition.

In some cases, the set of queries attached to a provider can be used to providehints about which properties may be supported by a provider, particularly when thequery type has been annotated as being relevant to a particular property via a known

Chapter 3. Model 67

namespace.

The normalisation of URIs is based on the existence of identifiers that representparts of datasets that have unique identifiers available inside of them. These identifiersneed to be easily translated into different forms without ambiguity. As there is notalways a direct link between the normalisation rule component of the model and thenamespaces, any links that are made between normalisation rules that implement thetranslation and are informative, rather than normative. Not all normalisation rulesrelate to namespaces and identifiers, and not all namespaces require normalisation rules.If the model required links between normalisation rules and namespaces, the coveragewould not be complete, and the link may not be obvious outside of the context ofproviders.

The use of the authority property in the namespace definitions enables scientists todefine the provenance of a namespace along with its use, as the namespace URI is usedas the link by all other parts of the model. This provides scientists with an point ofcontact if they have queries about the definition or use of the namespace.

3.5.1 Contextual namespace identification

The model does not require that namespaces be identified globally. A particular iden-tifier, generally associated with a single namespace, may have different meanings whenapplied to different queries, so it may be identified as one namespace in one query, andanother namespace in another query. The contextual nature of namespaces makes themuseful for integrating different datasets into the same system. Although it may be moreapplicable to normalisation rules, the ability to decide what the meaning of a namespaceis for a particular query makes it possible to modify the common representation for adocument to match the locally understood representation. This may include avoidingmaking large assertions about things that are not necessarily applicable to one datasetor query, even though they may be applicable to other datasets. The design of the modelallows scientists to include only specific namespaces in both query types and providers,so scientists can decide that a namespace is only applicable to a set of queries withoutrequiring reference to providers of information, or the namespace can be deemed onlyrelevant to a particular provider.

The namespace mechanism is designed with the idea that references from the en-vironment would be mapped into the model in one or more ways, allowing differentscientists to assign the same external reference, such as a short prefix identifying adifferent reference across different contexts, to more than one namespace. This allowsscientists to reuse other model based information without semantic conflicts with theirown models. There are likely to be many situations where this support is necessary oncethe model is deployed independently across different areas by different authorities. Toallow computational identification of conflicts, the namespace mechanism must containa direct link to the authority that assigned the namespace. This allows scientists toidentify cases where a single authority has assigned the same reference, such as a prefix,to more than one case, indicating that the authority did not have a clear purpose for

68 Chapter 3. Model

the string. It also allows scientists to trust namespaces based on the authority, althoughif there are untrusted sources of data the reference to the authority may not carry theweight that a set of namespace definitions from a completely trusted authority would.

3.6 Model configurability

The system is designed so that semantically similar queries will be compatible with manydifferent data site patterns. Public definitions of which queries are semantically similar,or relevant in a particular case, can be included or excluded, and local, private andefficient queries can be added or substituted using profiles. The namespace classificationis designed so that it relies on functional keys, although the functional key may look likea URI. The functional key may match more than one namespace classification, enablingscientists to take advantage of currently published classifications in some cases, whileconcurrently defining their own internally recognised, and maintained, namespaces.

3.6.1 Profiles

The profile component was included to allow varying levels of flexibility with respectto the inclusion or exclusion of each of the profileable components, i.e., query types,providers and normalisation rules. Profiles make it simple for scientists to customise thelocal configuration by adding their own RDF configuration snippets into the set of RDFstatements that are used to configure the system. The semantics surrounding profilesprovide for input from both scientists and public configuration creators, although localscientists can always override the recommendations given by configuration creators. Theprofiles are also stacked, enabling scientists to define higher or lower level profiles toextend and filter the set of publicly published configurations. For a query to be executedon a particular provider, both the query and the provider must be included by one ofthe profiles. If this process results in the inclusion of the provider for that query, theprofile list can also be used to determine which of the normalisation rules are used totransform the input and output.

If a lower level, more relevant profile, explicitly excludes or includes a component,than the higher level, less relevant profile, will not be processed for that component.If a profile specifies that it is able to implicitly accept either query types, providers,or normalisation rules, then the default include or exclude behaviour specified by theauthor or the configuration will be used to decide whether the query matches the profile,and if no profile matches. The default include or exclude behaviour was designed toallow a configuration editor to suggest to scientists using the configuration that theyare not likely to either find the component relevant, or the component relates to anendpoint that they are not likely able to access from the public web.

The simplest possible configuration consists of a single query type and a singleprovider, as shown in Figure 5.2. The query type needs to be configured with a regularexpression that matches scientist queries. The provider needs to be configured with botha reference to the query type, and an endpoint URL that can be used to resolve queries

Chapter 3. Model 69

matching the query type. Although the example is trivial, in that the user’s query isdirectly passed to another location, it provides an overview of the features that makeup a configuration. One particular feature to be noted is the use of the profile directiveto process profile exclude instructions first, and then include in all other cases. In thisexample, there are no profiles defined, resulting in the query type and the providerbeing included in the processing of queries that match the definitions.

Although profile decisions need to be binary to ensure replicability, the layering ofdifferent profiles makes it possible to have various levels of trust. A complex layeredset of profiles may be confusing to users, but scientists can preview the included andexcluded items independent of an actual query. This makes it possible to assert levelsof trust before designing experiments.

If the set of configurations that the scientist uses for data access is refreshed period-ically, possibly from an untrusted source, scientists would need to preview the changes,especially if any profiles have implicit includes, and a configuration is not completelytrusted. This may be a security issue, but if static configuration sources are used, thenthe profiles will have the effect of securing the system as opposed to making it morevulnerable. This security results from the choice of providers based on their relevancy,and if needed restrict all queries to just a set of curated providers that will not changewhen a public configuration is updated, as may happen at regular intervals.

3.6.2 Multi-dataset locations

In the case where a particular query needs to operate on a provider where more thanone dataset needs to be present, the model allows for the query to specify which arethe namespace parameters, along with a definition of how many of the namespaceparameters need to match against namespaces on any eligible providers. This allows forthe optimisation of queries, without relying on scientists to create new query types foreach case where possible.

This allows scientists to either create local data silos where they store copies of eachof the most used datasets, while still being able to link to distributed SPARQL queryendpoints in some cases. This allows scientists to gain the benefits of the distributedand aggregated datasets without losing the consistent linking and normalised documentsthat this model can provide.

The model allows groupings of both equivalent providers and endpoints, and byallowing scientists to contribute their information to a shared configuration. The sharedconfiguration is not limited to a single authority, as the configurations for the model aredesigned to be shareable, and independent of the authority that created them originally,apart from their use of Linked Data URIs as identifiers for items in the configurationdocuments.

70 Chapter 3. Model

3.7 Formal model algorithm

The formal definition of how the model is designed to work is given in this section. Itdoes not contain some features that were included in the prototype, such as choicesbased on redundancy, that are provided by provider and query groupings, however,these can be included by restricting the range of the loops over query groups (qg) andeach of the provider groups (pg). The use of query and provider groups is optional, toallow for the simplest mechanism of providing and publishing configuration informationusing the model. If a query type, or a provider does not appear in any group, theyshould be included as long as they pass the tests for inclusion for the scientist query.

1. Let u be the scientist query

2. Let r be the overall set of inputs given in u;

3. Let f be the ordered set of profiles that are enabled;

4. Let π be the pool of facts that will contain the results of the query;

5. Let Q be the set of query types in the global set of query types (GQ) that matchu

6. Let QG be the set of query groups in the global set of query groups (QGS) thatmatch u

7. Let P be the global set of providers

8. Let PG be the global set of provider groups

9. Let N be the global set of namespaces

10. Let GNR be the global set of normalisation rules to be used to modify queriesand data resulting from queries

11. For each query group qg in QG

12. For each query type q in qg;

(a) Check that q matches the constraints of the set of profiles in f , and if it doesnot match then go to the next q

(b) If q matches the conditions in f ;

(c) Let np be the set of namespace parameters identified from r in the contextof q

(d) Let nd be a set of the set of namespace definitions in N which can be mappedfrom each element of np

(e) Let nc be the set of namespace conditions for q

(f) If nd does not match q according to nc, then go to the next q

Chapter 3. Model 71

(g) If nd matches q according to nc;

(h) For each provider group pg in PG

(i) For each provider p in pg;

i. Check that p matches the constraints of the set of profiles in f , and if itdoes not match then go to the next p

ii. If p matches the conditions in f ;

iii. If p matches q according to nc and nd, including the condition that ndmay not need to match p if this condition is included in nc

iv. Let s be the set of normalisation rules that are attached to p;

v. Substitute inputs from r in any templates in q, using rules that aredefined in p and relevant to query parameter denormalisation stage

vi. Substitute inputs from r in any templates in p, using rules that aredefined in p and relevant to query parameter denormalisation stage

vii. Parse q into an in memory representation mq if necessary for the com-munication type of provider p

viii. Normalise mq based on any normalisation rules defined in p that arerelevant to the parsed query normalisation stage

ix. Let α be the document that is returned from querying q on p in thecontext of r

x. Normalise α according the rules defined for p that are relevant to thebefore results parsing stage

xi. Parse the normalised α and then normalise the facts based on the nor-malisation rules defined in p that are relevant to the parsed results beforepool stage

xii. Then include the normalised results in η

xiii. Let β be the set of non-resolved facts found by parsing the static tem-plates from q after substituting parameters from the combination of r,q, p, and nd

xiv. Normalise β according s, and include the results in η

xv. Include η in π

13. Normalise statements from π which match the after pool results stage in theordered set of normalisation rules relevant to all providers in P

14. Serialise facts in π

15. Normalise the serialised facts which match the after pool serialisation stage in theordered set of normalisation rules relevant to all providers in P

16. Output the result of q as the serialised document

72 Chapter 3. Model

3.7.1 Formal model specification

The model requires that the configuration of each item includes a number of mandatoryfeatures. The list of features that are required for each part are given in this section.This list of features may be extended by prototypes if they need information that is spe-cific to the communication methods that they are using. For example, an HTTP basedweb application may require additional information for each query type to determinethe way different user queries match based on their use of different HTTP methods suchas HTTP GET and HTTP POST.

Query type

A query type should contain the following features:

• Input query parameters : The information from the user, including namespaceswhere required.

• Namespace distribution rules : Ie, which combinations of namespaces are accept-able, and if it is not clear, how are the namespace identifiers given in the query.

• Whether namespaces are relevant to this query : If namespaces are not relevant,the namespace distribution parameters are ignored.

• Whether to include endpoints which do not fit the namespace distribution param-eters but are marked as being general purpose.

• Query templates : The templates for the queries on different endpoint imple-mentations should be given in the query, with parameters in the template whereavailable.

• Static templates : Templates that are relevant to the use of the query, but do notrely on information from the endpoint, other than the provider parameters. Thesetemplates represent facts that can be directly derived from the queries, and willnot be subject to any normalisation rules that are specific to this query type andthe provider it is executed using, with the exception of normalisation rules thatare applied after the facts are included in the output pool.

Provider

A Provider should contain the following features:

• Namespaces : Which namespace combinations are available using this provider.This may be omitted in some cases if the provider may contain information aboutany namespace, but it is hard to pin point which ones it may actually contain atany point in time.

• Whether this provider is general purpose : This is used to match against anyquery types that are applied to this server that claim to be applicable to generalpurpose endpoints.

Chapter 3. Model 73

• Query types : Which query types are known to be relevant to this provider. If asystem does not require preconfigured queries to function using the model thesemay be omitted. An example of where they may be omitted may be in the useof the normalisation, namespace and provider portions in a federated SPARQLimplementation.

• Endpoint communication type : The communication framework that is used forthis provider. For example, HTTP GET, HTTP POST, etc.

• Endpoint query protocol : The method that is used to query the endpoint. Forexample, SPARQL, SOAP/XML etc.

• Normalisation rules : Which rules are necessary to use with the given queries onthe given namespaces for this endpoint.

Normalisation rule

A normalisation rule should contain at least one of the following features:

• Input rule: used to map normalised information into whatever scheme the relevantendpoints require.

• Output rule: used to map endpoint specific data back into normalised information.

• Type of rule: For example, Regular expression matching or SPARQL.

• Relevant stages : When the rule is to be applied. For example, before querycreation, after query creation, before results import to normalised data model,after import to normalised data model, or after serialisation to output data format.SPARQL rules are only useful in results after they have been imported to thenormalised data model, but regular expressions may be useful for any stage wherethe query or results are in textual form, while query optimisation functions areonly relevant after the query has been created, rather than in the process ofreplacing templates in the query to form a full query that may be parsed beforebeing executed depending on the communication method of the relevant dataprovider. The normalisation rules, which are applied to the results pool and theserialised results document, are not relevant to the query type or provider thatthey were included with, so variables such as provider location that may be usedin the other earlier stages, will not be available for these rules.

• Order : Indicates where the rule will be applied in a given set of rules within astage.

Normalisation rule test

The model has an optional testing component that is used validate the usefulness ofnormalisation rules. The expected input and expected output are not defined in a rule,as they are

74 Chapter 3. Model

• Expected input : The input to use.

• Expected output : The output that is expected.

• Set of rules : The set of rules that must be applied to the input to derive theexpected output.

Namespace

A Namespace should contain the following features:

• The identifier used to designate this namespace : These items can be mappedto both this and other namespace’s as applicable. These may include traditionalprefixes, but also other items such as predicates and types that are available.

• The expected format for identifiers : A regular expression that can be used toidentify syntactically invalid identifiers for this namespace.

• The preferred URI structure : A template containing the authority, namespaceprefix, identifier, and any other variables or constants necessary to construct aURI for this namespace.

• The authority that is responsible for assigning the scientist level items to thisnamespace : This is in effect a mapping authority, although multiple authoritiescan be used concurrently, so there is no central authority.

Profile

A Profile should contain the following features:

• Explicitly included items : A list of items that should be used with this profile,and the scientist should be told that they will use this if it is the first matchingprofile on their sorted profile list.

• Explicitly excluded items : A list of items that should not be used with this profile,and the scientist should be told they shouldn’t be used if this is the first matchingprofile on their sorted profile list.

• Default inclusion and exclusion preference : This is used to assign a default valueto any item that do not specify inclusion or exclusion preferences, as the use ofprofiles is not mandatory.

• Implicit inclusion preference : Whether items in each category should be includedif the result of the profile rules indicates the item can be implicitly included.

Chapter 4

Integration with scientific processes

4.1 Overview

Scientists need to be able to concurrently use different linked datasets in their research.The model described in Chapter 3, enables scientists to query various linked datasetsusing a single replicable method. This chapter contains a variety of ways scientistscan integrate the results of these queries with their programs and workflows. Theintegration relies on the model to provide data quality, data trust, and provenancesupport. Scientists are able to use the model to reuse and modify work produced byother scientists to fit their current context, even if they need to use different methodsand datasets used by the original scientist.

The model allows scientists to explore linked data and integrate annotations relatingto data they find interesting. In comparison to other techniques, the annotations donot need to modify the original dataset. The resulting RDF triples can be integratedinto computer operated workflows, as the data resulting from the use of the modelis mandated to be computer understandable and processable using a common querylanguage regardless of which discipline or location the data originated in.

A case study is given in Section 4.4 describing the ways that the model and pro-totype, described in Chapter 5, can be used in an exploratory medical and scientificscenario. The case highlights one way of using the model to publish links to resultsalong with publications. The use of resolvable URLs as footnotes in Section 4.4 is onemethod of integrating computer resolvable data references into publications, althoughother methods have also been proposed in the data-based publication literature, includ-ing embedding references into markup [37, 58] and adding comments into bibliographicdatasets [123].

Scientists need a variety of tools to analyse their data. One method that scientistsuse to manage their processing is to integrate data and queries using workflow man-agement systems. If the workflow is able to process and filter the single data modelthat is mandated by the query model, the scientist can avoid or simplify many of thecommon data transformation steps that are required to use workflow management sys-tems to process and channel data between different services. A case study is given inSection 4.7 using an RDF based workflow to demonstrate the advantages of the model

75

76 Chapter 4. Integration with scientific processes

and prototype compared to current workflow and data access systems.In many cases it is difficult for a scientist to determine what the sources of data

were in a large research project. Provenance information, including the queries andlocations of data, is designed to be easily accessible using the model, as each of the querytypes are directly referenced in the relevant providers; and providers in turn referencenormalisation rules and namespaces by name. These explicit links make it easy togenerate and permanently store a portable provenance description. The provenanceinformation provided by the model enables scientists to use provenance as part of theirexperiments. An example of how this information can be used to give evidence for theresults of a scientific workflow is given in Section 4.7.1.

Scientific results, and the processes that were used to derive the results, need to bereviewed by peers before publication. These steps require other scientists to attemptto replicate the results, and challenge any conclusions based on their interpretationof the assumptions, data, methods, and results that were described by the publishingscientist. The model is useful for abstracting queries away from high level processingsystems, but the high level processing systems are required to integrate the results ofdifferent queries. Scientific peers need to replicate and challenge the results in their owncontexts as part of the scientific review process. An example of how this can occur isgiven in Section 4.7.1.

4.2 Data exploration

Although most scientific publications only demonstrate the final set of steps that wereused to determine the novel result, the process of determining goals and eliminatingunsuccessful strategies is important. Historically, the scientific method required thescientist to propose a testable hypothesis explaining an observable phenomenon. Thisinitial hypothesis may have come from a previous direct observation or a previous study.Ananiadou and McNaught [6] note that, “hypothesis generation relies on backgroundknowledge and is crucial in scientific discovery.” The source of the initial hypothesisaffects how data is collected and analysed. It may also affect how conclusions based onthe results can be integrated into the public body of scientific knowledge [51]. Variationson this method in the current scientific environment may involve exploration throughpublished datasets before settling on a hypothesis.

The model described in this thesis, focusing on replicating queries across linkedscientific datasets, provides a useful ad hoc exploration tool. It is designed to provideconsistent access to information regardless of the data access method. The model focuseson replication, so exploratory activities can be tracked and replicated. This provides anovel and powerful tool for scientists who rely on a range of linked scientific datasetsfor their work. The method used by the model, is unique, as it provides an abstractionover the methods for distributing queries using namespace prefixes and query mappingfrom scientists queries to actual queries on datasets.

Other RDF based strategies rely on scientists writing complete SPARQL querieswhich are resolved dynamically using a workflow approach to provide access to the

Chapter 4. Integration with scientific processes 77

final results. These strategies may be simpler than writing workflows that rely onmultiple different data formats, but they are still a barrier to adoption by the widerscientific community. They rely on datasets to be normalised, and scientists to be ableto identify a single location for each dataset to include it in their query. This makes itdifficult to replicate queries in different contexts, as peers may need to have access to anequivalent set of RDF statements and they may need to modify the query to use theiralternative location. By contrast the model provides the necessary provenance detailsto enable replication using substitution and exclusion without removing anything fromthe provenance details.

An implementation could interactively provide query building features, using bothknown and ad hoc datasets as needed. The essential data exploration is related tothe number of links between datasets and how valuable they are. It is useful to havethe ability to continuously explore information, without first requiring an applicationto understand a separate interface for each dataset and data provider. Scientists thenneed to be able to send future queries to processing services based on the data theyhave previously retrieved from the linked datasets. These features are provided by thenormalising of URIs with the prototype, which make it possible to determine futurelocations based on the URIs from past results.

4.2.1 Annotating data

Scientists can create annotations that contain direct references to the data along withnew information about the data. This enables them to keep track of interesting itemsthroughout the exploration process. If these annotations are stored in a suitable placethey can be integrated into the description for each item using a query and provider thatare targeted at the annotation data store. An annotation service was implemented andthe model was used to access the annotations, as shown in Figure 4.4. It demonstratesthat novel data can easily be included with published data for any item based on itsURI. The annotations contained details about the author of the annotation, as theannotation prototype required scientists to login with their OpenID, and the OpenIDidentifying URI was included in the annotation.

Scientists can use tags to keep track of their level of trust in a dataset by annotatingitems in the dataset gradually, and then compiling reports about namespaces usingthe tag information. The reports can be selective, based on the annotations of trustedcolleagues, and can be setup using new query types and profiles. These tags wereused to annotate data records from the published Gene Ontology [16] and UniprotKeywords [114] datasets, as well as an unpublished BioPatML dataset [92]. The taggingfunctionality was implemented in a web application that scientists can use it to annotateany URIs 1.

The tag information could be directly integrated with the document that they resolveif they know they want to see annotations together with the original data in theircontext. In current workflows, every instance where the dataset was accessed would

1http://www.mquter.qut.edu.au/RDFTag/NewTerm.aspx

http://www.mquter.qut.edu.au/RDFTag/NewTerm.aspx


need to be subdivided into two steps, one for the original data and one for the annotateddata. When used with normalisation rules from the model, annotations could be used toselectively remove case specific information, although other scientists would not receivethis information until they resolved the annotation to RDF. If useful information wasremoved from the original dataset, annotations could be used to include it again.

4.3 Integrating Science and Medicine

The model is designed so that it can be easily customised by users. Extensions canrange from additional sources and queries, to the removal of sources or queries forefficiency or other contextual reasons. The profiles in the model provide a simple wayfor scientists to select which sources of information they want to use without havingto make choices about every published information source. In the context of HealthInformatics, a hospital may want to utilise information from a drug information site,such as DrugBank and DailyMed, together with their private medical files.

The hospital could map references in their medical files to disease or drug datasetssuch as DrugBank or DailyMed and publish the resulting information in a single datamodel, or provide a compatible interface to their current system with a single referencedata model. The data from their systems could then be integrated with the DrugBankinformation transparently, as the model requires that the data model be able to beused for transparent merging of data from arbitrary locations. The hospital could thencreate a mapping between the terminologies stored in their internal dataset and thoseused by DrugBank and related datasets and use those to syntactically normalise theirannotations. These mappings could be made using one of the available SQL to SPARQLconverters such as the Virtuoso RDF Views mechanism [45] or the D2RQ server [29].

To distinguish private records from external public datasets, hospitals would createa namespace for their internal records, along with providers matching the internal ad-dresses used for queries about their records. This novel, private information, is able to betransparently included in the model through the use of private provider configurationsindicating the source of the information.

If the hospital then wanted to map a list of diseases into their files, they could finda source for disease descriptions, such as Diseasome [55], and either find existing linksthrough DrugBank [144] and DailyMed 2, or they could use text mining to discovercommon disease names between their records and disease datasets [7]. For scientificresearch, resources like Diseasome are linked to bioinformatics datasets such as theNCBI Entrez Gene [93], PDB [26], PFAM [48], and OMIM [5] datasets. This linkage isdefined explicitly using namespaces and identifiers in those namespaces. This enablesthe use of direct resolvable links and enables links to be discovered, for instance theremay be links from patients and clinical trials to genetic factors, implying that patientsmay be eligible for particular clinical trials. Patients could be directly linked to genesusing the data model without reference to diseases, and new diseases could be described

2http://dailymed.nlm.nih.gov/dailymed/about.cfm

http://dailymed.nlm.nih.gov/dailymed/about.cfm


internally without requiring a prior outside publication.Medical procedures and drugs inevitably have some side effects. There are a list of

potential side effects available in the Sider dataset [80], enabling patients and doctorsto both have access to equal information about the potential side effects of a courseof medication. Side effects that are discovered by the hospital could be recorded usingreferences to the public Sider record, reducing the possibility that the effect would bemissed in future cases.

The continuous use of a single data model enables scientists to transition betweendatasets using links without having to register a new file model for each dataset, al-though the single data model may have different file formats that are each associatedwith input and output from the model. For ease of reading, the URI links, which canbe resolved to retrieve the relevant information used in the following case study, arefootnoted. The links between the datasets were observed by resolving the HTTP URIsand finding relevant RDF statements inside the document that link to other URIs.

In comparison to other projects that aim to integrate Science and Medicine, suchas BIRN [17], the model used here does not require each participating institution topublish their datasets using RDF, as long as the data is released using a license thatallows RDF versions of the data to be created by others. This enables others to integratedatasets that may not be in the same field or may not be currently maintained by anorganisation. The semantic quality of the resulting information may not be as high asin the specialised neuroscience datasets that BIRN was developed to support, as eachof the organisations involved in BIRN is required to be active and maintained.

The clinical information that was integrated in BIRN consisted of public de-identifiedpatient data, whereas, this model allows hospitals to retrieve public information and in-tegrate it with their private information without having to de-identify before being ableto distribute queries across the data. The information about how to federate queriesdoes not need to be publicly published to be recognised by the model or prototype, al-though it does need to be published in other systems that rely on centralised authoritiesfor dataset locations and query federation.

Private data is not ideal, as it hides semantically useful links from being reused.In some cases, notably hospitals, it is required by privacy laws in some countries thatpersonal health records be private unless explicitly released.

4.4 Case study: Isocarboxazid

This case study is given in the context of both Science and Medicine. It is foundedaround a drug known generically as “Isocarboxazid”. This drug is also known by thebrand name “Marplan”. The aims of this case study are to discover potential relation-ships between this drug and patients with reference to publications, genes and proteins,which may possibly affect the course of their treatment. In cases where patients areknown to have adverse reactions or they do not respond positively to treatment, al-ternatives may be found by examining the usefulness of drugs that are designed forsimilar purposes. The doctor and patient could use the exploratory processes that are


Figure 4.1: Medicine related RDF datasets

explained in the case study to determine whether current treatments are suited to thepatient with respect to their history and genetic makeup. If necessary, the case studymethods could also be used to find alternative treatments.

The case study utilises a range of datasets that are shown in Figure 4.1. Thesedatasets are sourced from Bio2RDF, LODD, Neurocommons, and DBpedia. The origi-nal image for the LODD datasets map can be found online 3.

A portion of the case study, highlighting the links between items in the relevantdatasets, can be seen in Figure 4.2. This case study includes URIs that can be resolvedusing the Bio2RDF website, which contains an implementation of the model. For thepurposes of this case study, the relevant entry in DrugBank is known 4, although itcould also be found using a search 5. According to this DrugBank record, Isocarboxazidis “[a]n MAO inhibitor that is effective in the treatment of major depression, dysthymicdisorder, and atypical depression”.

The DrugBank entry for Isocarboxazid contains links to the CAS (Chemical Ab-stracts Service) registry 6, which in turn contains links to the KEGG (Kyoto Ency-clopedia of Genes and Genomes) Drug dataset 7. The link back to DrugBank fromthe KEGG Drug dataset and others in this case could also have been discovered usingonly the original DrugBank namespace and identifier 8. The brand name drug dataset,Dailymed, also contains a description for Marplan 9, which is linked from Sider 10 and

3http://www4.wiwiss.fu-berlin.de/lodd/lodd-datasets_2009-08-06.png4http://bio2rdf.org/drugbank_drugs:DB012475http://bio2rdf.org/searchns/drugbank/marplan6http://bio2rdf.org/cas:59-63-27http://bio2rdf.org/dr:D025808http://bio2rdf.org/links/drugbank_drugs:DB012479http://bio2rdf.org/dailymed_drugs:2892

10http://bio2rdf.org/sider_drugs:3759

http://www4.wiwiss.fu-berlin.de/lodd/lodd-datasets_2009-08-06.png


http://bio2rdf.org/searchns/drugbank/marplan

http://bio2rdf.org/cas:59-63-2

http://bio2rdf.org/dr:D02580

http://bio2rdf.org/links/drugbank_drugs:DB01247

http://bio2rdf.org/dailymed_drugs:2892

http://bio2rdf.org/sider_drugs:3759


DrugBank: Isocarboxazid / Marplan drugbank_drugs:DB01247

Dailymed: Marplandailymed_drugs:2892

KEGG Drug: Isocarboxazid (INN) dr:D02580

CAS cas:59-63-2

PubChempubchem:17396751

MediCare (US)medicare_drugs:1332

DrugBank: Amine oxidase [flavin-containing] B

drugbank_targets:3939DrugBank:

Amine oxidase [flavin-containing] Adrugbank_targets:3941

HGNC: monoamine oxidase B (MAOB)

hgnc:6834HGNC:

monoamine oxidase A (MAOA)hgnc:6833

PDB: Human Monoamine Oxidase B in

complex with Farnesolpdb:2BK3

PFAM: Amino oxidasepfam:PF01593

DrugBank: L-amino-acid oxidase

drugbank_targets:3041

DrugBank: Flavin-Adenine Dinucleotidedrugbank_drugs:DB03147

Diseaseome: Brunner syndromediseasome_diseases:1689

NCBI Entrez Geneid: monoamine oxidase B (Human)

geneid:4129

MeSH: Parkinson's Disease

mesh:D006816

Pubmed: Contains 6795 related articles

countlinksns/pubmed/mesh:D006816

NCBI Entrez Geneid: MAOB (Rat)

geneid:25750

NCBI Entrez Geneid: MAOB (Mouse) geneid:109731

SideEffects:sider_drugs:3759

SideEffects:Neuritis

sider_sideeffects:C0027813

SideEffects:Fluvoxamine interaction

drugbank_druginteractions:DB00176_DB01247

Figure 4.2: Links between datasets in Isocarboxazid case study


the US MediCare dataset 11. These alternative URIs could be used to identify moredatasets with information about the drug.

The record for Isocarboxazid in the Sider dataset has a number of typical depres-sion side-effects to watch for, but it also has a potential link to Neuritis 12 a symptomwhich is different to most of the other 39 side effects that are more clearly depressionrelated. Along with side effects, there are also known drug interactions available usingthe DrugBank dataset. An example of these is an indication of a possible adverse re-action between Isocarboxazid and Fluvoxamine 13. If Fluvoxamine was already beinggiven to the patient, other drugs may need to be investigated, as alternatives to preventthe possibility of a more serious Neuritis side effect. DrugBank contains a simple cate-gorisation system that might reveal useful alternative Antidepressants 14, in this case,such as Nortriptyline 15.

Dailymed contains a list of typically inactive ingredients in each brand-name drug,such as Lactose 16, which may factor into a decision to use one version of a drugover others. The Drugbank entry for Isocarboxazid also contains links to Diseasome,for example, Brunner syndrome 17, which are linked to the OMIM (Online MendelianInheritance in Man) entry for Monoamine oxidase A (MAOA) 18.

DrugBank also contains a list of biological targets that Isocarboxazid is known to ef-fect 19. If Isocarboxazid was not suitable, drugs which also affect this gene; Monoamineoxidase B (MOAB) 20 21 22 23, the protein 24, or the protein family 25, might alsocause a similar reaction. The negative link (in this case derived using text mining tech-niques) between the target gene, monoamine oxidase B, 26, and Huntington’s Disease27 28, might influence a doctors decision to give the drug to a patient with a history ofHuntington’s.

The location of the MAOB gene on the X chromosome in Humans might warrantan investigation into gender related issues related to the original drug, Isocarboxazid.The homologous MAOB genes in Mice 29 and Rats 30, are also located on chromosomeX, indicating that they might be useful targets for non-Human trials studying genderrelated differences in the effects of the drug.

11http://bio2rdf.org/medicare_drugs:1332312http://bio2rdf.org/sider_sideeffects:C002781313http://bio2rdf.org/drugbank_druginteractions:DB00176_DB0124714http://bio2rdf.org/drugbank_drugcategory:antidepressants15http://bio2rdf.org/drugbank_drugs:DB0054016http://bio2rdf.org/dailymed_ingredient:lactose17http://bio2rdf.org/diseasome_diseases:168918http://bio2rdf.org/omim:30985019http://bio2rdf.org/drugbank_targets:393920http://bio2rdf.org/symbol:MAOB21http://bio2rdf.org/hgnc:683422http://bio2rdf.org/geneid:412923http://bio2rdf.org/mgi:9691624http://bio2rdf.org/uniprot:P2733825http://bio2rdf.org/pfam:PF0159326http://bio2rdf.org/geneid:412927http://bio2rdf.org/mesh:Huntington_Disease28http://bio2rdf.org/mesh:D00681629http://bio2rdf.org/geneid:10973130http://bio2rdf.org/geneid:25750

http://bio2rdf.org/medicare_drugs:13323

http://bio2rdf.org/sider_sideeffects:C0027813

http://bio2rdf.org/drugbank_druginteractions:DB00176_DB01247

http://bio2rdf.org/drugbank_drugcategory:antidepressants


http://bio2rdf.org/dailymed_ingredient:lactose

http://bio2rdf.org/diseasome_diseases:1689

http://bio2rdf.org/omim:309850

http://bio2rdf.org/drugbank_targets:3939

http://bio2rdf.org/symbol:MAOB

http://bio2rdf.org/hgnc:6834


http://bio2rdf.org/mgi:96916

http://bio2rdf.org/uniprot:P27338

http://bio2rdf.org/pfam:PF01593


http://bio2rdf.org/mesh:Huntington_Disease

http://bio2rdf.org/mesh:D006816




The Human gene MAOA , can be found in the Traditional Chinese Medicine (TCM)dataset [47], 31, as can MAOB 32, although there were no direct links from the EntrezGeneid dataset to the TCM dataset. TCM has a range of herbal remedies listed as beingrelevant to the MAOB gene 33 including Psoralea corylifolia 34. Psoralea corylifolia isalso listed as being relevant to another gene, Superoxide dismutase 1 (SOD1) 35 36.SOD1 is known to be related to Amyotrophic Lateral Sclerosis 37 38, although therelationship back to the Brunner Syndrome and Isocarboxazid, if any, may only beexploratory given the range of datasets in between.

LinkedCT is an RDF version of the ClinicalTrials.gov website that was setup toregister basic information about clinical trials [65]. It provides access to clinical infor-mation, and consequently is a rough guide to the level of testing that various treatmentshave had. The drug and disease datasets mentioned above link to individual clinicalinterventions in LinkedCT, enabling a path between the drugs, affected genes and trialsrelating to the drugs. Although there are no direct links from LinkedCT to Marplanat the time of publication, a namespace based text search returns a list of potentiallyinteresting items 39. An example of a result from this search is a trial 40 conducted byJohn S. March, MD, MPH 41 of Duke University School of Medicine and overseen bythe US Government 42.

The trial references published articles, including one titled “The case for practicalclinical trials in psychiatry” 43 44. These articles are linked to textual MeSH (MedicalSubject Headings) terms such as “Psychiatry - methods” 45, indicating an area thatthe study may be related to. The trial is linked to specific primary outcomes and thefrequency with which the outcomes were tested, giving information about the scientificmethods in use 46.

Although LinkedCT is a useful resource, as with any other resource, there are dif-ficulties with the data being complete and correct. An example of possible issues withcompleteness in LinkedCT are recent studies about the use of ClinicalTrials.gov whichindicate that a reasonable percentage of clinical trials either do not publish results,register with ClinicalTrials.gov, or reference the ClinicalTrials record in publications re-sulting from the research [94, 116]. These issues may be reduced if people were requiredto register all drug trials and reference the entry in any publications.

31http://bio2rdf.org/linksns/tcm_gene/geneid:412832http://bio2rdf.org/linksns/tcm_gene/geneid:412933http://bio2rdf.org/linksns/tcm_medicine/tcm_gene:MAOB34http://bio2rdf.org/tcm_medicine:Psoralea_corylifolia35http://bio2rdf.org/tcm_gene:SOD136http://bio2rdf.org/geneid:664737http://bio2rdf.org/mesh:Amyotrophic_Lateral_Sclerosis38http://bio2rdf.org/mesh:D00069039http://bio2rdf.org/searchns/linkedct_trials/marplan40http://bio2rdf.org/linkedct_trials:NCT0039521341http://bio2rdf.org/linkedct_overall_official:1233342http://bio2rdf.org/linkedct_oversight:228343http://bio2rdf.org/linkedct_reference:2211344http://bio2rdf.org/pubmed:1586378245http://bio2rdf.org/mesh:D011570Q00037946http://bio2rdf.org/linkedct_primary_outcomes:55439

http://bio2rdf.org/linksns/tcm_gene/geneid:4128

http://bio2rdf.org/linksns/tcm_gene/geneid:4129

http://bio2rdf.org/linksns/tcm_medicine/tcm_gene:MAOB

http://bio2rdf.org/tcm_medicine:Psoralea_corylifolia

http://bio2rdf.org/tcm_gene:SOD1


http://bio2rdf.org/mesh:Amyotrophic_Lateral_Sclerosis

http://bio2rdf.org/mesh:D000690

http://bio2rdf.org/searchns/linkedct_trials/marplan

http://bio2rdf.org/linkedct_trials:NCT00395213

http://bio2rdf.org/linkedct_overall_official:12333

http://bio2rdf.org/linkedct_oversight:2283

http://bio2rdf.org/linkedct_reference:22113

http://bio2rdf.org/pubmed:15863782

http://bio2rdf.org/mesh:D011570Q000379

http://bio2rdf.org/linkedct_primary_outcomes:55439


Doctors and patients do not have to know what the URI for a particular resourceis, as there is a search functionality available. This searching can either be focusedon particular namespaces or it can be performed over the entire known set of RDFdatasets, although the latter will inevitably be slower than a focused search as somedatasets are up to hundreds of gigabytes in size, representing billions of RDF statements.An example of this may be a search for “MAOB” 47, which reveals resources that werenot included in this brief case study.

4.4.1 Use of model features

The case study about the drug “Isocarboxazid” in Section 4.4 utilises features fromthe model to access data from multiple heterogeneous linked datasets. The case studyuses the query type feature to determine what the nature of each component of thequestion means, and maps the question to the relevant query types. The query typesfrom the Bio2RDF configuration that were utilised in the case study range from basicresolution of a record from a dataset, to identifying records that were linked to an itemin particular datasets and searches over both the global group of datasets and specificdatasets. These methods make it simple to replicate the case study using the URIs,although additions of new query types and providers may be necessary to utilise localor alternate copies of the relevant datasets.

Basic data resolution

The Bio2RDF normalised URI form can be used to get access to any single datarecords in Bio2RDF. The pattern is “http://bio2rdf.org/namespace:identifier”,where “namespace” identifies part of a dataset that contains an item identified by “iden-tifier”. This query provides access to Bio2RDF as Linked Data. It is simple to im-plement using SPARQL, with either a CONSTRUCT query, or a DESCRIBE querybeing useful. The SPARQL Basic Graph Pattern (BGP) that is used in this query is “<http://bio2rdf.org/namespace:identifier> ?p ?o . ”, although the URI may be mod-ified using Normalisation Rules as appropriate to each data provider. The SPARQLpattern may not be necessary if the query is resolved using a non-SPARQL provider.

An example of a question in the case study that may not use SPARQL is the ini-tial query. The process of resolving the URI “http://bio2rdf.org/drugbank_drugs:DB01247”, may require the model to use the Linked Data implementation at the FreeUniversity of Berlin site, provided by the LODD project, or it may use the SPARQL end-point. In this case, the namespace is “drugbank_drugs”, and the identifier is “DB01247”.According to the Bio2RDF configuration, the namespace prefix “drugbank_drugs”, isidentified as being part of the namespace “http://bio2rdf.org/ns:drugbank_drugs”by matching the namespace prefix against the preferred prefix that is part of theRDF description for the namespace. This namespace is attached to the provider“http://bio2rdf.org/provider:fuberlindrugbankdrugslinkeddata”, which will at-tempt to retrieve data by resolving the Linked Data “http://www4.wiwiss.fu-berlin.

47http://bio2rdf.org/search/MAOB

http://bio2rdf.org/namespace:identifier



http://bio2rdf.org/ns:drugbank_drugs

http://bio2rdf.org/provider:fuberlindrugbankdrugslinkeddata

http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB01247


http://bio2rdf.org/search/MAOB



de/drugbank/resource/drugs/DB01247” to an RDF document.

The DrugBank provider in Bio2RDF contains links to normalisation rules that arenecessary to convert the URIs in the Free University of Berlin document to theirBio2RDF equivalents. This enables the referenced URIs to be resolved using theBio2RDF system, which contains other redundant methods of resolving the URI besidesthe Linked Data method. In this case, the document is known to contain incorrectlyformatted Bio2RDF URIs. This is a data quality issue that can be fixed using thenormalisation rules method, in a similar way to the way the useful, but not Bio2RDF,Linked Data URIs were changed.

There are some situations where it may be difficult to get a single document tocompletely represent a data record. If the only data providers return partial datarecords, such as Web Services that map the record to another dataset, then the total datarecord may require that many Web Services may be resolved to get a single document.In some cases, the complete record may have parts of its representation derived fromdifferent datasets, or at least different locations. However in many cases, the record isavailable from a single location, so it is simpler to maintain and modify the record. Inboth cases, the model allows the data record to be compiled and presented as a singledocument to the scientist, including any necessary changes based on their context.

References in other datasets

Scientists are able to take advantage of references that dataset providers have deter-mined are relevant. They are not easily able to discover new references, as there haspreviously been no way of asking for all of the references to a particular item. Al-though the Linked Data proposals are useful as a minimum standard, they do not allowscientists to perform these operations directly, as they only focus on direct resolutionof basic data items. Query types were created to handle both global and namespacetargeted searches, extending the usefulness of the Linked Data approach to include linkand search services in addition to simple record resolution. These queries are useful forscientists who utilise datasets that do not always contain complementary links to otherdatasets, and are therefore a way to increase the usability and quality of these datasets.The omission of a reference may not imply that the data is of a low semantic quality, butin syntactic terms it is useful to know where an item is referenced in another dataset.

The global reference search in Bio2RDF has the pattern “http://bio2rdf.org/links/namespace:identifier”, where “namespace” identifies part of a specific datasetthat contains an item identified by “identifier”, and the query will find references tothat item. In the case study, there was a need at one point to find all references to adrug, without knowing exactly where the references might come from. The Bio2RDFURI http://bio2rdf.org/links/drugbank_drugs:DB01247 was used to find any itemsthat contained references to the drug. Although the canonical form for the data itemof interest is http://bio2rdf.org/drugbank_drugs:DB01247, the data normalisationrules will change this to the URI that is known to be used on each data provider. Thisprocess is not easy to automate, as it requires a mapping step. A general mapping



http://bio2rdf.org/links/namespace:identifier

http://bio2rdf.org/links/namespace:identifier

http://bio2rdf.org/links/drugbank_drugs:DB01247



that created multiple queries for every known dataset would be very inefficient for anydatasets that had more than one or two known URI forms. In the case of Bio2RDF themapping process was manually curated based on examinations of sample data recordsto increase the efficiency of queries.

As the global references search requires queries on all endpoints, by design, it isnot efficient in very large sets of data. A more efficient version was designed to takeadvantage of any knowledge about a namespace that a reference may be found in. Thisform has the pattern “http://bio2rdf.org/linksns/targetnamespace/namespace:identifier”, where “namespace” identifies part of a specific dataset that contains anitem identified by “identifier”, and the query will find references to that item in dataproviders that are known to contain data items in the namespace “targetnamespace”.This was used in many places in the case study where the scientist knew that therewere links between datasets, and wanted to identify the targets efficiently.

The SPARQL BGP that is used in both cases is likely to be similar to this pattern “ ?s?p <http://bio2rdf.org/namespace:identifier> . ”, although in some cases the predicate(second part of the triple) will be necessary, and the object (third part of the triple)will be “identifier” or something similar.

The model maintainer may be able to specify which data providers are known tocontain some links to a namespace, to assist scientists in identifying linked recordswithout having to do either a global reference search, or know which namespaces touse for a targeted search. Using this knowledge, the model can be setup so that duringthe data resolution step, there is a targeted search for references in a select number ofnamespaces to dynamically include these references in the basic data resolution. In somecases, such as with widely used taxonomies, it is not efficient to utilise this step, butin general it is useful and can provide references that scientists may not have expected.This functionality is vital for namespaces that may not be derived from single datasets,or the datasets are not available for querying, as the semantic quality of the referenceddata can be increased using a generic set of syntax rules.

An example of this is the “gene” namespace that was referenced in the case study.Each identifier in the gene namespace is a symbol that is used to identify a gene in aparticular species. Although these are curated by the HGNC (HUGO Gene Nomen-clature Committee) organisation for humans, the use of these symbols is more widelyspread, including in the gene namespace, where the symbols are denoted as being humanusing the NCBI Taxonomy identifier “9606”, in the form http://bio2rdf.org/gene:

9606-symbol, where “symbol” is the name given to a human gene. An example of amouse gene symbol is http://bio2rdf.org/gene:10090-ace, which is similar to thesymbols that are available for the human version of the gene, http://bio2rdf.org/gene:9606-ace and http://bio2rdf.org/symbol:ACE. In cases where the identifiersare not standardised, or different namespaces have different standards for the sameset of identifiers, there are likely to be data quality issues similar to this case. AsURI paths are case sensitive, http://bio2rdf.org/symbol:ACE is different to http:

//bio2rdf.org/symbol:ace, and datasets using one version will not directly match

http://bio2rdf.org/linksns/targetnamespace/namespace:identifier

http://bio2rdf.org/linksns/targetnamespace/namespace:identifier

http://bio2rdf.org/gene:9606-symbol

http://bio2rdf.org/gene:9606-symbol

http://bio2rdf.org/gene:10090-ace



http://bio2rdf.org/symbol:ACE

http://bio2rdf.org/symbol:ACE

http://bio2rdf.org/symbol:ace

http://bio2rdf.org/symbol:ace


datasets using the other version.The use of RDF enables the addition of a statement linking to two identifiers. How-

ever, this is not a good solution as it requires scientists to apply reasoning to the datawhile assuming that the data is of perfect semantic quality. In the prototype there isthe possibility to apply normalisation rules to template variables such as “identifier” ele-ments in the symbol namespace that modify queries on particular data providers basedon a previous examination of the data to know which convention a dataset follows.

In some cases, the RDF datasets do not contain URIs for particular referenceditems. Dataset owners do this to avoid any impression that the URIs would haveongoing support from their organisation. Some organisations have best practices topublish references to other data items using a tuple formed using one property for anamespace prefix, such as “namespace” above, and another property for the identifier,such as “identifier” above. This restricts the usefulness of the resulting data as thereis no simple mechanism for using this tuple to get more information about the item.Although the lack of URIs is simple to get around in many cases, it makes queriesharder, as the relevant properties need to be targeted along with a textual literal, whichmay not be handled as efficiently as URIs.

In some cases a Linked Data RDF provider may argue that the underlying datasetdid not contain URI references to another dataset, so they do not have to provide thelinks. One case where this is prominent is in the lack of references out from DBpediato Bio2RDF, due to the fact that Bio2RDF isn’t a clear representative of the relevantdata providers, although DBpedia is not a representative of the Wikipedia group inthe first instance. In these cases, the model is useful in determining what DBpediaitems reference a given Bio2RDF namespace and identifier and using these for referencesearches, however, there is a lack of support in the SPARQL query language withrespect to creating URIs inside queries, so the resulting documents still contain thetextual references.

In all of the cases where the dataset does not use a URI to denote a reference toanother item, the query types feature in the model is utilised to make it possible touse multiple different queries on the dataset, without the scientist having to specify intheir request that they want to use the alternative query types. As an example, thesealternative queries are utilised for the inchi, inchikey 48, and doi 49 namespaces. Thesepublic namespaces are well defined, so there are not generally syntactic data qualityissues, however they do not have RDF Linked Data compatible URIs, so some datasetshave resorted to using literals, along with a number of different predicates, for referencesto these namespaces.

The use of a large number of different data providers for these public namespacesis difficult from a trust position, as it is not always easy to identify where statementsoriginate from in RDF, unless one of the Named Graph extensions to RDF is utilised,such as Trig or Trix. In the case of pure RDF triples, it is not easy to identify after thefact what trust levels were relevant. In the case of high throughput utilisation of this

48http://www.iupac.org/inchi/49http://www.doi.org/about_the_doi.html

http://www.iupac.org/inchi/

http://www.doi.org/about_the_doi.html


information, it is not appropriate to rely on a scientist applying their trust policies oninformation after queries are performed. The use of profiles to restrict the set of dataproviders is the solution provided in the model. The profiles are designed so that theycan be easily distributed to other users, so organisations can use the profiles to definetheir trust in different providers. However, profiles are user selected, so any profiles thatappear in query provenance can be excluded during replication simply by not selectingthem.

Text search

RDF URIs are useful, but they are not always easy to find. In particularly, Linked Datadoes not specify any mechanism for finding initial URIs given a question. In some casesthis may not be an issue, if identifiers are textual and a URI can easily be created withknowledge of the identifier. In many cases identifiers are privately defined by datasetsand the identifier has no semantic meaning apart from its use as part of the dataset.In these cases it is useful to be able to search across different datasets. One solutionto this may be to rely on a search engine, but there are questions about the range of asearch engine, and the ability to restrict it to what is useful and trusted.

The majority of the data providers for text search queries were SPARQL endpoints.Although the official SPARQL 1.0 implementation only supports one form of text search,regular expressions, some SPARQL endpoints were known to support additional formsof full text search that are more efficient, so searches on these endpoints were performedusing a different query type that was based on the same query structure.

In order for scientists to utilise the system without having to know these details,the two forms of SPARQL search, along with other URL based search providers, wereimplemented to match against two standard URI patterns. The first pattern is for atext search across all datasets that have text search interfaces, http://bio2rdf.org/search/searchTerm, where “searchTerm” is the text that a scientist wants to searchfor. The second pattern is used to target the searches at datasets which contain aparticular namespace, http://bio2rdf.org/searchns/targetNamespace/searchTermin this case that is “targetNamespace”. The “targetNamespace” is not guaranteed to bethe only namespace that results are returned from, as the query and the interface maybe generically applicable to any data in a provider. The process could be optimisedfurther in cases where the interface allowed the query to only target data items thathad URIs which matched the target namespace, but this process may not always beeffective or efficient as the URIs may not have a base structure that is easily or efficientlyidentifiable in a SPARQL query.

An experiment to improve the accuracy in some cases was completed, but it wasnot efficient, and queries regularly timed out, returning no results. The main efficiencyproblem was related to the use of SPARQL 1.0. It is not optimal for searches that relyon the structure of URIs, as it relies on regular expressions, which are not efficient whenapplied to large amounts of text. For the same reason, the targeted reference searchonly relies on the namespace being part of the data provider definition, so results can

http://bio2rdf.org/search/searchTerm

http://bio2rdf.org/search/searchTerm

http://bio2rdf.org/searchns/targetNamespace/searchTerm


be returned from any namespace that was present in the dataset that was queried. Insmall scenarios, where the number of triples is quite small 50, the identifier-independentpart of the URI can be identified as belonging to a particular namespace.

The text search function demonstrates a novel application of the model as an entrypoint to Linked Data models. In terms of data quality, it is perhaps less effectivethan a single large index, where results can be ranked across a global set of data, asthe prototype implementation of the model only allows ranking inside data providers.However, in terms of data trust, the sites that were utilised were able to be selected ina simple manner using profiles, so as to restrict the possibility that the search termsaccidentally matched on an irrelevant topic.

The exact provenance for a search using the model is possible, and can be used incombination with the results to validate the usefulness of the results to a particularquestion. If the results have data quality issues, a scientist can encode rules that canbe used to clean the results to suit their purposes, and view the results of this intheir work, along with which rules were applied in their provenance. If a new textsearch method needs to be created to replicate results, it can be included using profiles.Profiles can also be used to filter out of date query types. For example, when SPARQL1.1 is standardised, the SPARQL 1.0 query types can be filtered from historical queryprovenance and replaced with efficient SPARQL 1.1 queries

4.5 Web Service integration

The query type and provider sections of the model can be transparently extended infuture to allow scientists to patch into web services to use them directly as sources forqueries. A similar ontology based architecture, SADI [143], annotates and wraps up webservices to query them as part of Federated SPARQL queries, however it makes someassumptions that make it unuseful as a replicable, context-sensitive, linked scientificdata query model.

The SADI method relies on service providers annotating their web services in termsof universally used predicates and classes. SADI also requires queries to be stated usingknown predicates for all parts of the query, with at least one known URI to start thesearch from, as the system is designed to replicate a workflow using a SPARQL query.This prevents queries for known predicates being executed without knowledge of anyURIs, and it also prevents queries that have known URIs without scientists knowingwhat predicates exist for the URI. SADI is designed as an alternative to cumbersomeworkflow management systems. In this respect it is successful, as the combination of astarting URI and known predicates are all that is required for traditional web servicesto be invoked, as they are specifically designed for operations on known pieces of infor-mation. SADI is limited in this respect compared to arbitrary Federated SPARQL in-terpreters that allow any combination of known operations, known information, and/or,unknown information and operations.

501000 triples is small in this context and 5 million+ is large


In comparison to the separate namespace and normalisation rules model describedhere, SADI requires wrapper designers to know beforehand what the structure is forURIs that all datasets use, with the resulting transformations encoded into the pro-gram. In comparison, the normalisation rules part of this query model only requiresconfiguration designers to know how to transform the URIs from a single endpoint toand from the normalised URI model, and scientists can extend that knowledge usingtheir own rules. This reduces the amount of maintenance that is required by the system,as URIs that are passed between services are normalised to a known structure as part ofthe output step for prior services before being converted to the URIs that are expectedby the next services.

If a scientist is able to locally host a SADI service registry, then they are able totrust each of the records. However, the model does not contain a method for trustingdata providers in the context of particular queries. It is important that scientists beable to trust datasets differently based on their queries, so that all known informationmay be retrieved in some contexts, but in other contexts, only trusted datasets may bequeried. Although this may come at the expense of scientists defining new templatesfor queries, the model is able to provide a definite level of trust for each query based onthe scientist’s profiles.

The design could be useful as a way to integrate documents derived from the model,however, the assumption that all queries can be resolved in a Linked Data fashion isinefficient and there needs to be an allowance for more efficient queries that do not followthe cell-by-cell model. SADI technically follows a cell-by-cell data retrieval model toanswer queries, however, some queries may be executed intelligently behind the scenesin Web Services to reduce the overall time, making the system efficient in particularcircumstances, but not based on the system design. The model and prototype can bothwe used to perform efficient queries, although resolution of individual Linked Data URIsis also possible if scientists are not aware of a way to make a query more efficient.

4.6 Workflow integration

Workflow management systems provide a useful way to integrate different parts of ascientific experiment. Although they may not be used as often as ad-hoc scripts, theycan make it easier to replicate experiments in different contexts, and they make it easierscientists to understand the way the results are derived. The model and prototypewere not designed to be substitutes for a workflow management system, as they aredesigned to abstract out the context specific details such as trust and data quality.By abstracting out the contextual details, each query can be replicated in future usingthe model configuration given by the replicating user, based on the original publishedconfiguration given by the original publisher.

An evaluation of a selection of workflows on the popular workflow sharing websitemyExperiment.org revealed that the majority of these scientific workflows have largecomponents that are concerned with parsing text, creating lists of identifiers, pushingthose lists into processing services and interpreting the results for insertion into other


processing services. These workflows are unnecessarily complex due to the requirementsin these steps to understand each output format and know how to translate these intothe next input format. Hence they are not as viable for reuse in the general scientificcommunity as a system that relies on a generic data format. In addition, each of the dataaccess steps contain hardcoded references to the locations of the data providers. Evenif the data locations are given as workflow parameters, the alternative data providerswould need to subscribe to the same query method. The query format assumptions areencoded in the preceding steps of the workflow, and the data cleaning assumptions forthe given data provider are encoded in the steps following the data access.

In the model, the query format assumptions and data cleaning assumptions arelinked from the relevant providers. If another provider is available for a query, but thequery format is different, a new query type can be created and linked to a new provider.This new provider can reuse the data cleaning rules that were linked to from the originalprovider, or new rules can be created for this provider. Each of these substitutions iscontrolled using profiles, so the entire configuration does not need to be recreated toreplicate the workflow. In addition, if the user only wants to remove data cleaning rules,providers, or query types, they only need to create new profile and combine it with theoriginal configuration. All of these changes are independent of the workflow, as long asthe location of the web application is specified as a workflow parameter so that futureusers can easily change it.

Scientists can use annotations and provenance in scientific workflows to providenovel conclusions that can be used to evaluate the workflow in terms of its scientificmeaning and the trust in the data providers. The model provides query provenance, andscientists can create query types to insert data provenance information into the resultset of a query. These provenance sources give a workflow system a clear understandingof the origin of the data that it is processing.

In the model, the RDF syntax is used to specify provenance details. This is use-ful because it enables concepts to be described using arbitrary terminologies, withoutrequiring a change to the model standard to support different aspects of provenance.Scientists can also combine provenance and results RDF statements to explain the effectof workflow parameters on the workflow results.

There are domain specific biological ontologies available that aim to formalise hier-archies of concepts, however, these are not well integrated with workflow managementsystems currently [61, 104, 124, 146]. Researchers have effectively proven that it is prac-tical to integrate semantic information into electronic laboratory books such as thoseby Hughes et al. [73] and Pike and Gahegan [109]. There have also been attempts toformally describe the experimental process for various disciplines such as geoscience andthe microarray biology community [142]. There have been uses of simple dataset iden-tifier information to provide enhancements to workflows, with the most common beingPubMed and Gene Ontology overlaps, where PubMed articles are described in terms ofthe Gene Ontology [44, 86, 132]. However, these systems do not involve the context ofthe scientist in the process, making it hard for scientists to replicate the results if they


need to extend them or process their provenance.Scientists can interpret the provenance of their data along with their results by

including trusted, clean, structured data from various locations along with provenancerecords. The model and prototype described in this thesis allows for this integrationusing query types, and process provenance records in RDF that can be used to configurethe prototype to replicate the results.

4.7 Case Study : Workflow integration

The model links into scientific workflow management systems more easily than currentsystems due to its reliance on a single data model, where documents at each step areparsed using a common toolset. Although many workflow management systems do notinclude native support for RDF as a data format, the Semantic Web Pipes program[87] is designed with RDF as its native format, including SPARQL as a query languageinside the workflow. It was used to demonstrate ways to use the results of queries onthe prototype to trigger other queries based on the criteria that are specified in theworkflow. A future version of the model could expand it into a workflow engine, butfor the prototype it was decided to use an external tool for workflow purposes. Thefollowing case study examines the usefulness of an RDF based workflow managementsystem to process data using the prototype as the data access system.

The basic input pattern that was used for these Semantic Web Pipes workflowscontained a set of parameters including the base URL for the users instance of theprototype web application. An implementation of the model was used to resolve a queryto an RDF document, as shown in Figure 4.3. The RDF document was then parsedinto an in-memory RDF model and SPARQL queries were performed to filter the usefulinformation that required further investigation. In some cases where workflows werechained together, the results of the SPARQL query were used to form parameters forfurther queries on the prototype. The results of these further queries were included inthe results of the workflow, as RDF statements from different locations can be combinedin memory and output to form single documents. In some cases, the initial RDF tripleswere used as part of the output statements for the workflow. This makes it possible tokeep track of what information was used to derive the final output, as the RDF modelprovides the necessary generic structure to enable different documents to be mergedwithout any specific rules.

These small workflows were then linked together using the Pipe Call mechanismin the Semantic Web Pipes program to create larger workflows. This structure makesit possible to remove internal dependencies in the model on the way each query isimplemented, so data quality and data trust can be changed in the prototype, withouthaving to change the workflow where possible.

These workflows were used to evaluate the effectiveness of the prototype as a dataaccess tool specifically for biologists and chemists, as they focused on a biological andchemical scientific datasets. The workflows were integrated with the aim of verifyingthe way that biological networks are annotated and curated using literature sources,


Figure 4.3: Integration of prototype with Semantic Web Pipes

and further discover new networks that may not have been annotated previously, butwhich share common sources of knowledge.

The case study uses the Ecocyc dataset as the basis for the biological networkknowledge about the E. coli bacterium. Ecocyc contains references out to the Pubmedliterature dataset to verify the source of the knowledge. The Pubmed literature datasetis also used to annotate the NCBI Entrez gene information database. The Entrezdatasets in turn is referenced from a large number of other datasets such as the Uniprotprotein dataset, and the Affymetrix microarray dataset. The microarray dataset can beused to identify which methods are available for experimentally verifying which of theproteins from the Ecocyc dataset appear in a biological sample. Along with the linksto Entrez gene, the Affymetrix dataset also includes links to Uniprot. These links areuseful, as they seem from the results of the workflow to be highly specific, although theactual level of trust could be given by a scientist who could use either this set of linksor the set of links from Entrez gene to further identify the items.. The Ecocyc datasetis also annotated using the Uniprot dataset, which provides a method of verifying theconsistency of the data that was discovered using the Ecocyc dataset originally. In theEcocyc dataset, the Uniprot links are given from the functional elements, such as genesand polypeptides (i.e., proteins).

The workflow was used with different promoters, and the results were manuallyverified to indicate that the results were scientifically relevant. The results generallycontained proteins which were known to be in the same family, as the workflow wasstructured to select these relationships in favour of more general relationships. However,the resulting proteins were known to be experimentally verifiable, and the biologicalnetworks could be considered in terms of these relationships. Although the results ofthis workflow may be trivial, they highlight the ease with which these investigationscan be made compared to non-RDF workflows. In traditional workflows, each of thedata items must be manually formatted to match the input for a service. The resultsmust also be parsed using knowledge of the expected document format, including theneed to parse link identifiers, or have a special service created to perform the dataset


link analysis and give these identifiers as simple values, and any data normalisationprocesses must be created in a way that suits the data format.

In the previous case study the links between data items were used for exploration,so it was suitable for a scientist to identify relevant links at each step of each case.In comparison, the use of the workflow in this case enables a scientists to repeat theexperiment easily, with different inputs as necessary. The workflows in this case werecreated in a way that allowed scientists to create a workflow for a single purpose, andthen use it in another workflow by linking to it. The relevant information from one stepwas processed into a list of parameters for the next step using an embedded SPARQLengine. The relevant inputs and results were then combined in the output as a singleRDF document.

The model and prototype were used to enable the workflow to avoid issues suchas data quality and remote SPARQL queries. These factors are difficult to replicateand maintain if they are embedded into the workflow. Each of the workflows includea parameter that is used as the base URL for queries. This means that users do nothave to utilise a globally accessible implementation of the model if they have one locallyinstalled. This ensures that scientists can reliably replicate the workflows using theirown context, if they have the datasets locally installed, with local endpoints and thelocal implementation used in their workflows as guided by their profile settings.

The workflow was used to verify the consistency of the statements about a partic-ular promoter, known as “gadXp” 51, that is thought to be involved in several Ecocycregulatory relationships. This promoter is known to be annotated with a reference to asingle Pubmed article, titled “Functional characterization and regulation of gadX, a geneencoding an AraC/XylS-like transcriptional activator of the Escherichia coli glutamicacid decarboxylase system”.

This article was used to identify 6 Entrez gene records which were annotated asbeing related to the article, using the query form, “/linksns/geneid/pubmed:11976288”52. The results of this query contained a set of RDF statements that indicated whichgenes were related to the article. These genes were matched against known sets ofprobes that could be experimentally verified with microarrays, using a query similar tothe following, “/linksns/affymetrix/geneid:NNNNNN”, where NNNNNN is one of thegenes that were identified as being sourced from common literature.

Each of the 8 microarray probes that were returned by this query were then ex-amined specifically with reference to any Uniprot annotations they contained, to avoidtransferring the entire documents across the network. This was performed using thequery form “/linkstonamespace/uniprot/affymetrix:XXNNNNNN”, where each of theaffymetrix URIs were microarray probes that matched from the Entrez gene set.

The 11 Uniprot links were then matched back against Ecocyc to determine whichEcocyc records were relevant. If the datasets were internally consistent, the resulting8 Ecocyc records, and the Uniprot records they are linked to, should refer to similarproteins. In this example, the Ecocyc and Uniprot records all referred to proteins from

51http://bio2rdf.org/ecocyc:PM0-244152http://bio2rdf.org/linksns/geneid/pubmed:11976288

http://bio2rdf.org/ecocyc:PM0-2441

http://bio2rdf.org/linksns/geneid/pubmed:11976288


the “glutamate decarboxylase” family, which gadXp is a member of. Each of the Ecocycand Uniprot records were then examined in the context of all of the known datasets toidentify references that could be used by the scientist to verify the relevance of eachrecord. This step was performed using the query form, “/links/namespace:identifier”,where namespace was either ecocyc or uniprot based on the set of proteins that wereidentified.

There may be cases where the Pubmed article was not specific to the promoterand contained a large number of irrelevant references to genes in the Entrez dataset.In these cases the workflow would need to be redesigned to verify the references usinganother strategy, such as a strategy using Uniprot references, along with other datasets.Although the typical usage of the prototype is to resolve queries about single entities,there is no restriction on queries that combine multiple entities in a single query. Thiswould reduce the time taken to resolve the query, as a large part of the resolution time fordistributed queries on the Internet relates to the inherent latency in the communicationchannels.

4.7.1 Use of model features

The workflow case study uses the basic query operations outlined in Section 4.4.1,along with others described here, where they are useful for further optimising the waythe scientist constructs and maintains the workflows related to their research.

References to other datasets

The URI model does not require that the leading part of a URI can be used to identifythe namespace, although this assumption is used by other systems such as VoiD. Thisassumption is necessary if scientists want a simple method of determining which refer-ences in a data item are located in particular namespaces. This is an optimisation, butit is necessary to avoid having to resolve an entire document, and every reference. Theoptimisation is useful if the namespace can be identified from the leading part of theURI, and the URI can be normalised using this leading part of the URI.

The data normalisation rules implemented in the prototype can be used to iden-tify links to other namespaces if a partial normalised URI, e.g., http://bio2rdf.org/namespace: can be transformed to the prefix of the unnormalised URI. This is neces-sary, as the user does not know the identifier for any of the references that they aresearching for.

In cases where a data item contains references that can be identified easily, thisquery is useful for optimising the amount of information that needs to be transferred.This was used in the workflow to optimise the amount of information, as the completeAffymetrix records were required to determine their relevance to the workflow.

An alternative to this method is to recognise the predicate, and rely on the predicateto optimise the amount of information that needs to be transferred. This alternativeis also able to be used with services that do not support partial URI queries, but they

http://bio2rdf.org/namespace:

http://bio2rdf.org/namespace:


do support queries that can be optimised based on the relationship between the recordand its references.

Human readable label for data

The references in linked datasets are given as plain, anonymous identifiers. This is notuseful for scientists, as they need to have a human readable label to recognise what thereference referred to. This functionality is necessary to avoid retrieving entire documentsdescribing each of the references for a document just to display the most informativeinterface for a scientist.

Although this query may be optimised to avoid transferring entire records, it is alsopossible to filter an entire record using deletion rules. This is particularly relevant ifthe linked dataset is only accessible using Linked Data HTTP URI resolution, and theresults need to be restricted to simple labels for the purposes of the query.

In the workflow, this query was used to identify Affymetrix records, as the entirerecord was not required for any part of the workflow, and the queries that were per-formed on the Affymetrix dataset did not include labels. It was useful to include humanreadable labels for scientists who wanted to verify the results using human readable la-bels.

In the Bio2RDF configuration, this query is implemented as “http://bio2rdf.org/label/namespace:identifier”, where “namespace:identifier” is the identifying portionfrom the normalised URI, “http://bio2rdf.org/namespace:identifier”. In the ex-ample, the label for the Affymetrix data item identified by “http://bio2rdf.org/affymetrix:gadB_b1493_at” is found by resolving the URL “http://bio2rdf.org/label/affymetrix:gadB_b1493_at”.

Provenance record integration

Scientists can use the provenance information in their workflow to provide for decisionsbased on which datasets and query types were used. There are many different caseswhere provenance information is useful and necessary, but in the context of distributedqueries, the provenance record needs to contain at least the locations where the querieswere performed, what the queries in each location were, and which datasets were poten-tially part of the results based on the queries and the data providers. The provenancerecord can then be used to determine the legitimacy of the query, including any trustand data quality metrics that can be created using annotations on the query types, dataproviders, and normalisation rules.

The prototype exposes the provenance record using the prefix “/queryplan/”. Forexample, the provenance for “http://bio2rdf.org/geneid:917300” can be found usingthe URL “http://bio2rdf.org/queryplan/geneid:917300”. In the workflows in thiscase study, these records were used to determine the source of statements, and thequery types that were used, to provide information about the link between Entrez geneand Uniprot datasets, including the version of each dataset that was used. If there aremultiple versions in use, the information may not be trustworthy, as any deletions or

http://bio2rdf.org/label/namespace:identifier

http://bio2rdf.org/label/namespace:identifier


http://bio2rdf.org/affymetrix:gadB_b1493_at

http://bio2rdf.org/affymetrix:gadB_b1493_at

http://bio2rdf.org/label/affymetrix:gadB_b1493_at

http://bio2rdf.org/label/affymetrix:gadB_b1493_at


http://bio2rdf.org/queryplan/geneid:917300


fixes that were provided between the versions may not be visible. If the versions do notmatch the most current versions the datasets may also not be trustworthy.

The RDF statements in the document were then filtered on the basis of a DublinCore version property pointing to the release version of the dataset that was used “http://purl.org/dc/terms/version”. If data provenance information is to be included itneeds to be retrieved using a separate query type, as the provenance record is designedto be only reliant on information that was in the model configuration of the server thatresolved the query to provide replicability down to the level of dataset versions. In thecontext of the model, data provenance is informal information about how a particularfact was derived, rather than how it was accessed. Current provenance models focuson integrating data provenance chains into the data store, which is possible with themodel as it aims to be independent of datastores to enable replicability in future ifa datastore is not available. Although it is not the focus of this research there arenumerous studies available [34, 35, 95, 119, 149]. For this research, scientists need tochoose which queries they want to record the provenance for, and retrieve, store, andprocess those provenance records before processing the information further.

4.7.2 Discussion

The workflow management system was used to process and integrate different queries onthe prototype. The example workflows demonstrated the context-sensitive abilities ofthe model to make it simple to personalise the data access operations, including the useof different data normalisation rules and locations. They also demonstrated the way theprototype could be used to make components of the overall query more efficient whenthe scientist was aware of the nature of the data that was required without requiringthem to be aware of this aspect before starting to design the workflow. The workflowwas used to identify the provenance of some queries, although the necessary size ofthe provenance records made it difficult to keep this information throughout the entireworkflow.

It made it simple to change the context by including a reference to the base URLfor the prototype in each workflow. In contrast to other methods such as direct accessto Web Services, the base URL makes it possible for scientists to change the location ofthe data, and normalise the data in the model to match the structure of the statementsthat are used by the workflow, where scientists would have to change the workflow tomatch a new set of Web Services if there were multiple sources of data, and encode eachdata normalisation rule into each workflow that required access to the Web Service.

The datasets and query definitions that were used were not identified at every stage,as the provenance records were found to be very large in some cases where a large partof the configuration was applicable to each query. The overall configuration file forthe prototype was in the range of 2-3MB when represented using RDF/XML. A globallinks query, as was performed in the last step of the case study, returned provenancerecords that were about 500KB in RDF/XML. The other provenance queries returneddocuments ranging from 12KB for the query plan for examining references to uniprot in

http://purl.org/dc/terms/version

http://purl.org/dc/terms/version


affymetrix records to 100KB for uniprot and ecocyc record resolutions, as the Uniprotdataset contains a large number of links to other namespaces. The size of the prove-nance records for ecocyc records was increased because of a number of automaticallygenerated data providers that were created to insert links to HTML and other non-RDFrepresentations of records. The size of the provenance record is proportional to the num-ber of data providers, rdf normalisation rules, and namespaces, that were relevant toeach query.

The workflow, along with the HTTP URIs that form the identifiers for data in theworkflow, was useful for linking the relevant services. It was useful for abstracting thedetails of the queries from the scientific experiment, so that the queries on the datasetscould be changed, or optimised, without the scientist needing to change their workflow.A scientist could change the data providers, including new normalisation rules to fitthe data quality expected by the workflows. They could redirect the location of theprototype that they used by changing a single parameter in the workflow, making itpossible to change a public interface while providing the old interface for select purposes.The scientist could also choose to extend the workflow using the model further byfollowing the pattern used by the workflow to make up new queries.

4.8 Auditing semantic workflows

Scientists can use the provenance information along with their results to audit thedatasources, queries, and filters that were used to produce their results. The provenancemodel needs to be taken into account to validate the design of a workflow in terms ofwhether the queries were successful and accurate. Although the provenance model doesnot directly provide a proof, it can be used along with review methods to validate the setof queries, and the datasets that were accessed. Assertions based purely on provenancemodels are separated in this research from assertions which investigate the scientificmeaning which provided the motivation for the workflow.

Useful optimisation conclusions based on both the syntactic and semantic levels areavailable as part of the provenance model. However, review methods were not designedas part of the model other than to provide a simple flag to confirm that the configurationfor a query type or data provider had been curated. The provenance model is based onthe RDF configuration model, and as such is designed to allow for extensions, includingnew relationships between items in the model while keeping backward compatibility.Any new relationships that use Linked Data HTTP URIs in RDF would be naivelynavigable, making it possible for scientists to use future provenance records in a similarway to their processing of the current model. The current model provides URI links fromProviders to Query Types, Namespaces, and Normalisation Rules, links from QueryTypes directly to Namespaces, and links from Profiles to Providers, Query Types andNormalisation Rules.

If a scientist did want to make workflow optimisations they could use either of twodifferent ways based on the number of observed statements about a specific item, ora common semantic link between items. The first method is more suited to purely


workflow based provenance which attempts to streamline workflow executions based onknowledge about failures of specific processing tasks, the length of time taken to executea specific workflow processor, or the amount of data which has to be transferred betweendifferent physical sites to accommodate grid or distributed processing. The secondmethod relies on knowledge about the scientific tasks or data items which are beingutilised by each of the workflow designs and executions which are under considerationby the auditing agent. The common link may refer to an antecedent item which wasrelated to the current item by means of a key of some sort as opposed to relating twoitems which were both processed by a given processor and were ontologically similar.

Even without reasoning about the meaning of information derived from the semanticreview metadata, any data access methods embedded in the metadata can be utilisedto provide a favourites or tagging system which can quickly identify widely trusted dataproviders and query types. This system may be integrated with other tagging systems,using the model to access multiple data providers that could provide tags for an item.In a scientific scenario where scientists are performing curation on data, for example,gene identification in biology, the tagging could be used for both data and processannotations. An example how this process may be designed can be seen in Figure 4.4.

With respect to the first set, it is possible that specific semantic tags can be ac-commodated within a provenance ontology, an example of which may be found in themyGrid provenance ontology53. The myGrid ontology is not directly suitable, as it as-sumes that the workflow model that is being represented fits within the SCUFL model,as shown by their use of DataCollection elements, which form the basis for the distinc-tion between single data elements and multiple data elements being the output from aSCUFL workflow. The most notable reason for not using the myGrid ontology, as is, forextended research into the topic, is that it forms a number of key provenance elementsusing strings, where the model is able to give unambiguous identifiers to each item, andallow these to be referenced using URIs.

In contrast to the idea that only one ontology is acceptable in each domain, multi-ple ontologies can exist, with both referencing each other where applicable and sensible.This is particularly relevant where a process provider may have a description of theirserver published using an ontology which is compatible with one workflow system, butmany parts of the description can also be represented the same way under anotherontology. If this were the case, and someone published a structure outlining the rela-tionships between the two ontologies, then a reasoner could present the common piecesof information for both, although the scientist did not anticipate this. This informationis also useful if the provider of the original server did not publish semantic descriptionsof their process, requiring an external provider to publish the descriptions and keepthem up to date. These external descriptions must then be relied upon to provide ac-curate information in relation to auditing a given provenance log. The model providesthe basic elements that scientists can use to analyse a provenance log to determine therange of sources that were used.

53http://www.mygrid.org.uk/ontology/provenance.owl

http://www.mygrid.org.uk/ontology/provenance.owl


Data transformed by process into

Legend:

Annotate region with a tag

Give region extended attributes

Select an interesting

region

Give tag a meaning

Scientist collects samples

http://mquter.qut.edu.au/user/scientistA

http://mquter.qut.edu.au/data/scientistA/sequence1234

http://mquter.qut.edu.au/data/scientistA/region13445

http://mquter.qut.edu.au/tags/xba2

http://mquter.qut.edu.au/tags/scientistA/20070813-1

http://semwiki.mquter.qut.edu.au/xba2

http://dbpedia.org/resource/xba2

http://moat.mquter.qut.edu.au/tag/xba2

Create a semantic wiki page for resource

Give tag extended semantic attributes

http://mquter.qut.edu.au/search/20070813-2

http://mquter.qut.edu.au/tags/xba2

http://mquter.qut.edu.au/tags/scientistA/20070813-2

http://mquter.qut.edu.au/workflow/20070813-1

http://mquter.qut.edu.au/provenance/20070813-22

Figure 4.4: Semantic tagging process


4.9 Peer review

Direct access to scientific data is increasingly being required by peer reviewers to ver-ify the results given in proposed publications, as the data is not easily replicable fordisciplines where the experiment relies on data from a particular animal or plant, forexample. This peer review process, shown in Figure 4.5, is predicated on the indepen-dence of the peers and in some cases the authors are unknown, although in many fieldsit may not be difficult for peers to identify the authors of a paper based on the style andreferences. The transfer of this data to the peers in a completely transparent processwould require the data to be transferred to the publisher and the peers would receivethe information in an anonymised form from the publisher.

The model is useful for this process, as the namespace components of the URIs canbe transparently redirected to the publishers temporary server or to the peers copy ofthe data, assuming that the publisher and peers require a detailed review on the as yetunpublished data. Although it is useful to provide temporary access to unpublisheddatasets, the dataset may eventually need to be published along with the publication.At this point, a final solution must be given to support the relevant queries, with aminimal number of modifications to the original processing methods.

Peer reviewer reads published material

Peer reviewer analyses previous experiments

Journal editor takes peer reviews and decides whether to publish the work

Article is published as part of a journal issue(Electronic and/or paper)

Citations by peers Post-publication reviews by peers

Peer reviewer critiques validity of the hypothesis

Peer reviewer critiques the design of the experiment

If needed the Peer reviewer replicates the experiment

Peer reviewer verifies the analysis process

Peer reviewers return their opinions to the journal editor

Figure 4.5: Peer review process

In some disciplines, the dataset may be either too large to individually publish eachitem, or the publication may refer to evaluations of public data. In these cases, thepublication may provide a set of HTTP URI links to documents which can be used


to explore the information. This ability is not new, as DOI’s have been used for thispurpose already, but the ability to link directly into the queries, with the ability of bothpeer reviewers and informal reviewers in future to get information about the componentqueries that were used, could be a valuable part of the publication chain in disciplinessuch as genomics. It would be most valuable if there is a known relationship betweenthe items in the query results and other data items, as is encouraged in this thesis bythe use of HTTP URIs and RDF documents.

The prototype described in Chapter 5, publishes the details and results of individualqueries using HTTP URIs, for example, http://bio2rdf.org/myexp_workflow:656.Related HTTP URIs are available for data access provenance records, for example,http://bio2rdf.org/queryplan/myexp_workflow:656. Higher level provenance de-tails are handled by other systems, such as a workflow management system, that issuited to the task. For example, the MyExperiment website provides RDF descriptionsand HTTP URIs for each of the workflows and datasets that are uploaded to theirsite, including the original URI for the workflow referred to earlier in the paragraph,http://www.myexperiment.org/workflows/656.

4.10 Data-based publication and analysis

In order for computers to be utilised as part of the scientific process, data needs tobe in computer readable forms [98]. A major source of communication for scientists ispublished articles. As these publications restrict the amount of information that can beincluded, scientists have developed text mining algorithms to interpret and extract thenatural language data from published articles [38, 49, 90, 126, 134]. However, recentlya few journals and conferences have started supporting the use of computer readablepublications, which include links to related data, and in some cases include the meaningof the text and links [128]. The use of semantic markup on text and links to related dataprovide opportunities for scientists to automate otherwise manually organised processes.The model is designed to take advantage of these extensions to provide scientists withintegrated methods of solving their scientific questions using the interactions shown inFigure 4.6.

Marking up of data, and its use in automating processes, presents many challengesthat were not present in the initial world wide web, where the emphasis was to makedata computer readable and able to be distributed between locations. Making datareadable simply requires that one arrange for a specific format based on a few simplecategories of data. For instance, true and false facts from boolean logic, integers andreal numbers from mathematics theory, and textual forms from natural language, aresufficient as a basis for data readability. Other forms such as image and spatial datacan be represented using these basic forms, although they will take more computationalpower to process and integrate.

The creation of semi-accessible data using plaintext metadata creates a quasi-semanticform which may be easier for scientists to personally understand, but does not ensure

http://bio2rdf.org/myexp_workflow:656

http://bio2rdf.org/queryplan/myexp_workflow:656

http://www.myexperiment.org/workflows/656


Other scientists

Published material

Own publications

Local datasets

External datasetsScientist

Link to dataset in a publication

Citation to previously published material

Collaborate and review publications

Collect data experimentally

Use publically accessible datasets

Make conclusions and write them up for publication

Read and integrate previously published work

Publish papers in the area

Figure 4.6: Scientific publication cycle


that a computer will be able to unambiguously utilise the data [2]. This plaintext meta-data may come in the form of dataset fields or file annotations, however these are difficultto relate intuitively to the real world properties that motivated the initial hypothesis.The creation of structures that group and direct the associations between metadata ele-ments is the final pre-requisite for making clear decisions. Such decisions may be basedon the domain structures that the computer can access, possibly using structures typi-cally associated with scientific ontologies [14, 20, 22, 32, 77, 79, 84, 101, 104, 135, 146].Although there may be disagreements about the use of data, semantic markup fixesthe original meaning while allowing the scientist to interpret the data using their ownknowledge of the meaning of the term [97, 109].

Although a computer will then be able to utilise the data more effectively, thedomain ontologies may still not be universally relevant due to a lack of world knowledgeintroduced by their creators [122]. This lack is an inherent disability to be overcomethrough cross-validation and evolution of the domain ontologies in an iterative manner,based on previous attempts to use the structures. The ability to gradually evolve,and normalise the use of domain ontologies is provided through the context sensitivenormalisation rules in the model. The ontologies that are used in different locationscan be normalised where possible to give a simple basis on which to further utilise data,without having to deal with complex mapping issues in each processing application.

References to data items need to be recognised by scientists regardless of the formatto remove the barriers to querying data based publications. This may not be a simpleexercise for the scientist, as there may be different URI formats that are used in dif-ferent locations to identify an item. The model and prototype provide simple ways torepresent data references, and resolve references from multiple locations. The prototyperequires the use of the RDF format to integrate data from different locations. RDF doesnot have to be presented in specialised RDF files. The RDFa format allows scientiststo markup existing XML-like documents with structured content. This allows a trans-parent integration with current web documents, which allows for scientists to presentand publish documents, and maintain structured data in the same document, with theallowance for structured links to other documents using Linked Data principles.

The use of the model with RDFa is implemented in the HTML results pages gen-erated by the prototype. The documents contain valid RDFa, however, the dynamicnature of the documents generated by the model makes it hard to take advantageof some of the advanced syntactic features of RDFa. RDFa is designed to allow forefficient manual markup of XML documents using compulsory shorthand namespacedeclarations, and dynamic generation of an entire automatically generated documentreduces the usability of the information as the namespaces must be artificially generatedto fit with the RDFa design principles.

As part of the publication process, scientists are required to give details about themethods they used to generate their results. Using the model, they are able to publishboth the process and data details using a combination of workflow file formats, and linksto the relevant provenance records. Using these methods scientists can integrate less


curated data and workflows, without requiring methods of re-interpreting some piecesof data using statistical or interpretative processes such as text mining or other formsof pattern analysis. It may be necessary for scientists to continue to use these methodsin some cases, including past publications, the future use of semantic annotations inrelation to data, processes, and results, on publications will increase the scalability ofthe analysis processes.

Semantically marked up publications contain links between textual language anddefined concepts, which were previously not available, and could not be previously usedto validate or mediate the analysis process. These links can easily be generated usingthe model as the data access method, and these documents can be simply integratedwith other RDF documents using standard RDF graph merging semantics.

The results may need to be derived through the use of a program that is not simple torepresent in workflow terminology, such as a Java, C# or C++ program. In these cases,the prototype could still be used to resolve queries on datasets, given the wide range oftools available to resolve and process HTTP requests and RDF documents. However,the results may not be as simple to replicate as when the programs are represented inother formats which can be easily transported between different scientists.

Chapter 5

Prototype

5.1 Overview

A prototype web application was created to demonstrate the scientific usefulness ofthe data access model described in Chapter 3. The prototype was implemented asa stateless, dynamic configuration driven, web application that relies on RDF as itscommon document model. It was configured to provide access to a large number ofscientific datasets from different disciplines, including biology, medicine and chemistry.It follows the Linked Data principles in making it possible for scientists to constructURIs using the HTTP protocol, and make the URIs resolvable to useful information,including relevant links to other Linked Data URIs that can be resolved using theprototype. This enables it to be integrated easily with the cases described in Chapter 4.

The prototype makes it possible for scientists to access alternative descriptions ofpublished dataset records using their own criteria, including adding and deleting state-ments from the record. HTTP URIs that identify different records in another schemecan be resolved using the prototype using normalisation rules to convert the URIs tolocally resolvable versions. Scientists can then publish their own HTTP URIs that areresolved using their instance of the prototype. In terms of provenance, scientists are ableto control the versions of the datasets that are used by the prototype, so they can thenpublish URIs that point to their instance and be able to personally ensure that the datain the resulting records will not materially change without changes to the configurationof the prototype. For example, a scientist may resolve http://localhost/prototype/namespace:identifier, and the prototype may be configured to interpret that URLas being equivalent to http://publicprototype.org/namespace:identifier. Thisfunctionality is necessary for scientists to be able to customise and trust their data,without having to publish their documents using the official globally authoritative URIfor each record if it would not be resolved using their instance of the prototype.

The prototype was used extensively on the publicly accessible Bio2RDF website,allowing biologists and chemists to retrieve data from many different linked scientific

107

http://localhost/prototype/namespace:identifier


http://publicprototype.org/namespace:identifier

108 Chapter 5. Prototype

datasets. The prototype software was released as Open Source Software on the Source-forge site 1. Scientists can download this software and customise it with their own do-main names to provide access to their local datasets without reference to the Bio2RDFdatasets.

The prototype accepts and publishes its configuration files in all of the major RDFformats. This makes it possible for scientists to reuse both the Bio2RDF configuration,containing biology and chemistry datasets, and any other configuration sources as partof their implementation so that they can avoid reconfiguring access to these public dataproviders. If a scientist has access to alternative data providers or query types comparedto the Bio2RDF configuration, they can use the profile mechanism to selectively ignorethe data providers that they are replacing without having first removed them from theconfiguration document.

Scientists are able to share the details of their query with other scientists usingthe RDF provenance document for the query. The prototype provides access to theprovenance record for any query using URIs similar to http://localhost/prototype/

queryplan/namespace:identifier. The provenance record contains all of the relevantconfiguration information that the prototype relied on to process the query. It enablesother scientists to replicate the results without requiring any further configuration.This makes it possible to setup an instance of the prototype using provenance recordsas configuration sources, possibly without having access to the entire original configu-ration that was initially used. This ability provides for direct replication of publishedexperiments wherever possible.

The prototype can be configured to access data using both internal access, andexternal access methods depending on the context of the user. Scientists can use thisfunctionality to provide limited access to their results using the prototype as a proxy.Peers can replicate the experiment on the same datasets to review the results, withoutthe scientist having to expose direct access to their organisations entire database. Theprofiles, and the profile directives on each query type and provider, are used by peers todistinguish between the data providers that they can access, as opposed to those thatthe scientist used internally.

5.2 Use on the Bio2RDF website

The prototype was used as the engine for the Bio2RDF website, with an example shownin Figure 1.13. In this role, it was used to access both Bio2RDF and non-Bio2RDFhosted datasets. Queries on the Bio2RDF website are normalised using the Bio2RDFURI conventions, with the exception of equivalence references in results that link tonon-Bio2RDF Linked Data URIs.

The prototype configuration, created for the http://bio2rdf.org/ website, con-tains references to 1603 namespaces, 103 query types, 386 providers and 166 normali-sation rules. The Bio2RDF configuration provide links out from the RDF documents

1http://sourceforge.net/projects/bio2rdf/

http://localhost/prototype/queryplan/namespace:identifier

http://localhost/prototype/queryplan/namespace:identifier

http://bio2rdf.org/

http://sourceforge.net/projects/bio2rdf/

Chapter 5. Prototype 109

to the HTML Web for 142 of the namespaces. These providers contain URL templatesthat are used to create URL’s for the official HTML version for all identifiers insideof the namespace. The Bio2RDF namespaces can be resolved using query types andproviders using 12921 different permutations. The Bio2RDF configuration can be foundin N3 format at http://config.bio2rdf.org/admin/configuration/n3.

Figure 5.1 shows an example of the steps required for the prototype to retrieve alist of labels for the Gene Ontology (GO) [62] item with identifier “0000345”, known as“cytosolic DNA-directed RNA polymerase complex”. It uses the Bio2RDF configurationand illustrates the combination of a generic query type, along with a query type that iscustomised for the GO dataset. The queries are designed so that the generic query canbe useful on any information provider, while the custom GO query will be restricted toproviders that contain GO information because it uses an RDF predicate that only GOdatasets contain. If another dataset was available to retrieve labels for GO terms usinga different query, then a custom query definition could be added in parallel to these twoqueries without any parallel side effects. Further examples of how the prototype is usedon the Bio2RDF website can be found in Section 4.4.1 and Section 4.7.1.

The prototype made it possible for the Bio2RDF website to easily include andextend data from other RDF based scientific data providers. This made it possibleto perform advanced queries on datasets using normalised, namespace-based, URIs tolocate datasets and distribute queries.

The prototype implements some URI patterns that are unique to Bio2RDF, andwould need to be changed by modifying the URL Rewriting configuration file inside ofthe prototype. The query is interpreted by the prototype and the actual query passedto the model excludes the file format, whether the query is requesting a query plan,and the page offset for the results. As these are prefixed in a known optional order,scientists can always consistently identify the different instructions to the prototype foreach query, independent of their query types.

5.3 Configuration schema

The configuration documents the prototype were created based on a schema that canbe found at http://purl.org/queryall/. The schema gives the definitions for eachof the schema URIs, along with whether an RDF object or a literal is expected foreach property. A simple configuration that was created using the model can be seen inFigure 5.2.

In order for the configuration schema to evolve without interrupting past implemen-tations, the prototype was built to understand different versions and to process themaccordingly. The prototype schema went through 4 revisions as part of this research,with each implementation able to provide backwards compatible documents based onschemas in previous versions. In some revisions forwards compatibility with new ver-sions were affected.

Forwards compatibility is important in order for scientists to replicate queries usingcurrent software where possible. For example, in the first revision, the RDF Type

http://config.bio2rdf.org/admin/configuration/n3

http://purl.org/queryall/


@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/elements/1.1/> .@prefix ns1: <http://bio2rdf.org/go:0000345> .@prefix ns2: <http://bio2rdf.org/ns/go:> .

<http://bio2rdf.org/go:0000345> rdfs:label "cytosolic DNA-directed RNA polymerase complex [go:0000345]" ; ns2:name "cytosolic DNA-directed RNA polymerase complex" ; dc:title "cytosolic DNA-directed RNA polymerase complex" .

These results, together with the results of other queries that matched, are parsed into an in-memory RDF database, and the combined results are

returned to the user in the N3 format requested by the user.

URL to resolve: http://bio2rdf.org/n3/label/go:0000345

User query: label/go:0000345

Host: http://bio2rdf.org/

Response format: n3/, ie, RDF N3

Query type: http://bio2rdf.org/query:labelsearchforgo

Matches regex: ^label/([\w-]+):(.+)

This query type is only useful for namespace: http://bio2rdf.org/ns:go

This query type has a namespace at regular expression matching group number 1, ie, ([\w-]+),

which in this case is "go"

The namespace http://bio2rdf.org/ns:go has a prefix "go", so this query type is useful for this

user query

Provider: http://bio2rdf.org/provider:mirroredobo


This provider is not a default provider, and the referring query type is namespace specific so a

namespace check needs to be performed

This provider is able to resolve queries for the namespace: http://bio2rdf.org/ns:go

The referring query type has a namespace at regular expression matching group number 1, ie, ([\w-]+),


The namespace http://bio2rdf.org/ns:go has a prefix "go", so this provider is useful for this user

query

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix ns1: <http://bio2rdf.org/go:0000345> .@prefix ns2: <http://bio2rdf.org/ns/go#> .ns1: ns2:name "cytosolic DNA-directed RNA polymerase complex" .

Results of this query:

<http://bio2rdf.org/ns/go#> is changed to <http://bio2rdf.org/ns/go:> by the normalisation rule.

A normalisation rule is applied to the results from this provider, in order to standardise the use of colon's instead

of hashes in the /ns/ ontology terms:http://bio2rdf.org/rdfrule:ontologyhashtocolon

Query type: http://bio2rdf.org/query:labelsearch


This query type is useful for all namespaces

Provider: http://bio2rdf.org/provider:mirroredobo


This provider is not a default provider, and the referring query type is namespace specific so a

namespace check needs to be performed

This provider is able to resolve queries for the namespace: http://bio2rdf.org/ns:go

The referring query type has a namespace at regular expression matching group number 1, ie, ([\w-]+),


The namespace http://bio2rdf.org/ns:go has a prefix "go", so this provider is useful for this user

query

One of the provider endpoint URL's is chosen for this particular query on this provider:

http://cu.go.bio2rdf.org/sparql

This provider is known to be a SPARQL endpoint so the SPARQL template from the query type is copied and

replaced with the variables from this user query.

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix ns1: <http://bio2rdf.org/go:0000345> .ns1: rdfs:label "cytosolic DNA-directed RNA polymerase complex [go:0000345]" .

Results of this query:

Figure 5.1: URL resolution using prototype


@prefix query: <http://purl.org/queryall/query:> .@prefix provider: <http://purl.org/queryall/provider:> .@prefix profile: <http://purl.org/queryall/profile:> .@prefix : <http://example.org/> .

:myquery a query:Query ; query:inputRegex "(.*)" ; profile:profileIncludeExcludeOrder profile:excludeThenInclude .

:myprovider a provider:Provider ; provider:resolutionStrategy provider:proxy ; provider:resolutionMethod provider:httpgeturl ; provider:isDefaultSource "true"^^<http://www.w3.org/2001/XMLSchema#boolean> ; provider:endpointUrl "http://myhost.org/$input_1" ; provider:includedInQuery :myquery ; profile:profileIncludeExcludeOrder profile:excludeThenIncludae .

Figure 5.2: Simple system configuration in Turtle RDF file format

definitions for a provider did not include which protocols it supported, as these wereonly given in the resolution method. However, the current version requires the RDFType definitions to recognise the type before parsing the rest of the configuration,independent of the resolution method. In this case, a scientist would need to add theRDF Type to each of the providers before loading the configuration when they wishedto replicate queries using that version. In general, any new features that are added tothe model may affect forward compatibility if the default values for the relevant featuresare not accurate for past implementations.

The configuration schema is defined using RDF so that the same query engine can beused to query the configuration as the one that parses and queries the scientific data. Inaddition, the use of RDF enables previous versions to ignore elements that they do notrecognise without affecting the other statements. This enables past implementations tohighlight statements that they did not recognise to flag possible backwards compatibilityissues with new configuration schemas, while enabling them to function using a besteffort approach.

If the configuration is dynamically generated using the prototype, a particular con-figuration schema revision may be requested and the prototype will make a best effortto translate the current model into something that will be useful to past versions. Forexample, the Bio2RDF mirrors all request their current implemented version from theconfiguration server, even if the configuration server supports a new version. However,the use of RDF makes it possible to include new features in the returned documentswithout affecting the old clients who are required to flag and ignore unrecognised prop-erties in configuration documents.

Backwards and forwards compatibility issues have several implications for scientists.If a query provenance record, including the relevant definitions from the overall con-figuration, is published as a file attachment to a publication, it will use the currentconfiguration version. The query provenance record contains a property for each of thequery bundles that were relevant to the query, indicating which schema version wasused to generate that bundle. Scientists can use this information to accurately identify


a complying implementation to execute the query provenance record against when theywish to replicate it. It is recommended that scientists publish the actual query prove-nance records, or at minimum the entire configuration file that was relevant to theirresearch.

If the query provenance record is published using links to a live instance of theprototype, the links may not be active in the future, so replicability may be hampered.However, if the links are active, they may refer to either the current version of theconfiguration syntax, as the software may have been updated, or they may explicitlyrefer to the version that was active when they research was performed. The Bio2RDFprototype allows fetching of configuration information using past versions, for example,http://config.bio2rdf.org/admin/configuration/3/rdfxml will attempt to fetchthe current configuration in RDF/XML format using version 3 of the configuration syn-tax, while http://config.bio2rdf.org/admin/configuration/rdfxml will attemptto retrieve the configuration using the latest version of the configuration syntax.

If possible, all three options should be simultaneously available to scientists when itcomes time to publish research that uses the prototype, as all three have advantages,and together, they reduce the number of disadvantages. For example, if they justpublish the file, scientists do not have access to any new features that they may haveimplemented, or fixes to normalisation rules that they may have found necessary sincethe publication.

If they just publish the version specific configuration URL, they ensure that anyfuture normalisation rule, provider and query type changes will be available–as long asthe URL is resolvable. The version specific configuration URL should also limit thepossibility for unknown future features to have a semantic impact. However, if it is notobvious how to change this URL to the current version, or the version that the scientisthas access to, it may not be as useful to them.

If they just publish the version independent URL, then future scientists may notknow how to get access to the past version. Overall, the availability of all three optionsconclusively identifies all of the necessary information: the configuration syntax versionthat was used, in the RDF statements making up the query provenance record; theoriginal configuration information; the current configuration information available inthe original configuration version; and the current configuration information availablein the current configuration version.

5.4 Query type

Query types are templated queries that contain the variables which are relevant to thequery. The meaning of the variables are unique to each query type, as different querytypes may interpret the same question in different ways. Scientists can take advantage ofthis ability to create new compatible query types independent of other query types thatmay have previously been defined, and may still be used, in the context of some dataproviders. This has a direct effect on the ability of scientists to contextually normaliseinformation, as the previous query interface may not have differentiated between factors

http://config.bio2rdf.org/admin/configuration/3/rdfxml

http://config.bio2rdf.org/admin/configuration/rdfxml


that the scientist believes are relevant to their current context.An example of a situation where a scientist may want to optimise a current query

to fit their context can be found in Figure 5.3. Depending on the scientist’s trust anddata quality preferences expressed in their profiles, all of the query types and providersillustrated in Figure 5.3 may be used to resolve their query. This behaviour is reliant onthe design of the pre-configured query type as a place where the namespace, and dataproviders can be abstracted from their query. The abstraction makes it possible for thenormalisation rules to then be independent from the way the query is defined, apartfrom including markers in the query to determine where endpoint specific behaviour isrequired.

In some situations a query may need to be modified to suit different data providers.This can be achieved without changing the way the query is posed using a new querytype that is relevant to the namespaces in the data provider and the original queryvariables.

A locally available concurrent query type matches concept:A/* and performs query on the scientists local

database of concepts known to start with A/

Typical query for <http://myscientist.org/concept:A/101>

Widely available query type matches *:* (namespace:identifier) and identifies a useful endpoint

for the namespace "concept":

Optimised query for <http://myscientist.org/concept:A/101>

Figure 5.3: Optimising a query based on context

The prototype implements query types as wrappers around templates that are ei-ther SPARQL queries or RDF/XML documents. These templates are used in con-junction with providers to include RDF statements in the results set. The wrappersrequire information derived from the input parameters. In the case of the prototype,the input parameters are derived by matching regular expressions to the applicationspecific path section of the URL that was resolved by a scientist. For example, if thescientist attempted to resolve http://localhost/prototype/namespace:identifier,



where “http://localhost/prototype/” was the path to the prototype application,then the input parameters would be derived by matching “namespace:identifier”against the regular expressions for each known query type. If a match occurred, thenthe query type would be utilised, and the matching groups from the regular expressionwould be used as parameters.

The prototype interprets the parameters as, possibly, a namespace, and either publicor private, according to the configuration of each query type, as shown in Figure 5.4.Namespace parameters are matched against prefixes in all of the known namespaces toallow scientists to integrate their datasets with public datasets, without having to previ-ously agree on a single authoritative URI for the involved namespaces. The distinctionbetween public and private is used to automatically identify a set of parameters thatcould be normalised separately from the public namespace prefixes. A public parame-ter may be normalised to lower case characters if a template required this for instance,where the scientist may not want private, internal, parameters to be normalised. Thisdistinction enables the scientist to control the data quality of the private identifiersseparately from the public namespace prefixes that they have more control over. Inthe prototype, matching groups default to being private, and not namespaces. In theexample above, “namespace:identifier” may be matched against a query type usingthe regular expression “([\w−]+) : (.+)”. The query type may define matching groupnumber 1, “([\w−]+)”, to be a public parameter, and for it to be a namespace prefix.The second matching group, “(.+)”, defaults to being private and not a namespace.

Public and private variables

Resolved URL

http://localhost/doi:10.1038/msb4100086

Bio2RDF URI

User query

Match the Bio2RDF URI pattern with namespace:identifierPattern: ([\w-]+):(.+) : Two groups

Only first matching group is declared public, which means the second is privateFirst matching group is also declared to be a namespace

Query type

BioGUID DOI

BioGUID DOI resolver to RDFHandles namespace : http://bio2rdf.org/ns:doiProvider template URL : http://bioguid.info/openurl.php?display=rdf&id=input_1:input_2Replaced provider URL : http://bioguid.info/openurl.php?display=rdf&id=doi:10.1038/msb4100086

Provider

NamespaceBio2RDF DOI

Bio2RDF DOI NamespaceHas prefix : doi

Has URI : http://bio2rdf.org/ns:doi

Figure 5.4: Public namespaces and private identifiers

Scientists can extend the configuration to support new query methods by creatingand using a new URI for the relevant parts of the query type, along with a new resolutionmethod and resolution strategy on the compatible providers. This makes it possible

http://localhost/prototype/




to add a Web Service for instance, where previously the query was resolved using aSPARQL endpoint, without formally agreeing on the interface in either case. Althoughit hides some of the semantic information about a query, it is important that scientistsdo not have to have a community agree about the semantic meaning for a query beforethey are able to use it. The use of the query in different scenarios may imply its meaning,but the data access model and prototype do not require the meaning to be establishedto use or share queries.

5.4.1 Template variables

Scientists need to be able to make up generic query templates that include variableswhich are filled in when the query executes. This pattern is similar to the way typicalretrieval of resources is done. The prototype however allows the variables to be definedin different ways for each of a set of query types that may be known to be relevant to thequery. In contrast to typical methods that provide either named parameters, e.g., HTTPGET parameters and Object Oriented Programming languages, or complex structuredobjects, SOAP XML, the prototype allows scientists to define a parameter based on thepath from the URL that the scientist tried to resolve. For example, the scientist mayresolve http://localhost/prototype/concept:Q/212, and the prototype would use“concept:Q/212” as its input. This input could match in different ways depending onthe query depending on the purpose of the query. The prototype uses regular expressionmatching groups to recognise information in queries. These groups are available asvariables to be used in templates in various ways, allowing denormalised queries andresult normalisation to take place.

Template variables, of the general form “$variablename” can be inserted in tem-plates, and if they match in the context of the query type and the URL being resolved,they will be replaced. In the example shown in Figure 5.5, the two query types eachmatch on the same user query. The matching process creates two input variables inBioGUID query, and one input variable in the general query. In the general case, thetemplate variable does not represent a namespace, while in the BioGUID case, the firstinput variable represents a public variable, and it is known to be the prefix for a names-pace. The prototype is then able to use the first input variable to both choose a providerbased on the namespaces that have this prefix, and replace the template variable in theURL using the input. In the general case, there is no namespace variable, so any defaultproviders for the query type are chosen, and the template variable is replaced in thecontext of these providers.

The matching groups are available as template variables in different ways dependingon the level of normalisation required. If the second matching group in the previousexample, needs to be converted to upper case, the variable “$uppercase_input_2”could be included in the template. Variables need to be encoded so that the contentof the variables will never interfere with the document syntax rules. For example, toproperly encode the uppercased, input variable used above, the template variable needsto be changed to “$urlEncoded_uppercase_input_2”. This will replace all URI

http://localhost/prototype/concept:Q/212


Namespaces as parameters

Resolved URL

http://localhost/doi:10.1038/msb4100086Patterns match on the path section of the URL, doi:10.1038/msb4100086

Bio2RDF URI

User query

HTTP GET

Match everythingPattern: (.+) : One group

No namespaces recognised

Match the Bio2RDF URI pattern with namespace:identifier

Pattern: ([\w-]+):(.+) : Two groupsNamespace prefix is first group

Query type

BioGUID DOIBioGUID DOI resolver to RDFHandles namespace : http://bio2rdf.org/ns:doiProvider template URL : http://bioguid.info/openurl.php?display=rdf&id=input_1:input_2Replaced provider URL : http://bioguid.info/openurl.php?display=rdf&id=doi:10.1038/msb4100086

Provider

Namespace

Bio2RDF DOIBio2RDF DOI Namespace

Has prefix : doiHas URI : http://bio2rdf.org/ns:doi

DefaultHandles all namespaces, to avoid having to know all namespaces

Bio2RDF mirrorResolves the query using the public Bio2RDF mirrorsProvider template URL : http://bio2rdf.org/rdfxml/input_1NOTE: input_1 is different for this provider compared to BioGUID DOIReplaced provider URL : http://bio2rdf.org/rdfxml/doi:10.1038/msb4100086

Figure 5.5: Template parameters

reserved characters with their “%XX” versions, where “XX” is the reserved charactercode according to the relevant RFC 2. In the prototype, the %XX variables are alwaysrepresented in uppercase, so “:” is encoded as “%3A” and not “%3a”.

In some cases data quality depends on the level of standardisation. In the case ofURIs, some applications differ in the way they encode spaces. One set of applicationsencode spaces (“ ”) as the plus (“+”) character, and other applications encode it usingpercent encoding as “%20”. The prototype allows for both cases, where a URI to beencoded using the “+” character for spaces by changing “urlEncoded” in template vari-ables to “plusUrlEncoded”. For example, the example given above could be changed to“$plusUrlEncoded_uppercase_input_2”. If the dataset does not consistently useeither method then the prototype may be able to provide limited access to the data byusing two different query types on the same provider to attempt to access data usingboth conventions.

For example, this template variable is used on the “dbpedia” namespace to retrieveinformation about resources starting with the prefix http://dbpedia.org/resource/.There are some URIs starting with this prefix which contain special characters in thename such as http://dbpedia.org/resource/Category:People_associated_with_

Birkbeck%2C_University_of_London, as the percent encoding is required to completethe name, but it interferes with the colon character after “Category”. It is difficultto retrieve information from dbpedia, if the item is a category, and the name of thecategory contains a reserved character, as each case would require a normalisationrule to encode the reserved character. In the example above, the prototype will at-tempt to retrieve information about the non-percent-encoded URI, .../Category:

2http://tools.ietf.org/html/rfc3986#section-2.1

http://dbpedia.org/resource/

http://dbpedia.org/resource/Category:People_associated_with_Birkbeck%2C_University_of_London

http://dbpedia.org/resource/Category:People_associated_with_Birkbeck%2C_University_of_London

.../Category:People_associated_with_Birkbeck,_University_of_London


http://tools.ietf.org/html/rfc3986#section-2.1



People_associated_with_Birkbeck,_University_of_London.When the identifier from the dbpedia namespace is fully percent-encoded, the URL

would be .../Category%3APeople_associated_with_Birkbeck%2C_University_of_

London. Both of these URIs will not return information because they do not exactlymatch anything in the RDF database that is hosting DBpedia.

The URI RFC 3 includes a condition that the URIs are equivalent if they onlydiffer by a percent encoded element in a part of the URI where the reserved charac-ters will not interfere with the syntax, in this case, the path section of the URI. In thiscase, it may be satisfactory to make another namespace, for example “dbpediacategory”,and associate part of the “dbpedia” namespace with that namespace instead to avoidthe issue, for example, the URI would be http://myexample.org/dbpediacategory:

People_associated_with_Birkbeck%2C_University_of_London, which does not con-tain conflicting percent encoded characters.

URLs in RDF/XML must also have the XML reserved characters encoded, as someof these characters are not encoded by the URI encoding 4. This template variablein the example using RDF/XML would be “$xmlEncoded_urlEncoded_input_2”.The order of conversion is right to left, so the input is converted to uppercase before itis URL encoded, and then the resulting string is XML encoded.

A query type may define intermediate templates which take the input parametersand use them to create other templates. An example of this in the prototype is theallowance to define for each query what a normalised URI would look like, given theparameters. This enables scientists to have another variable to use in queries, where theydo not necessarily want to deal with each input, as that may make the template harder toport to other query types. The normalised URI in the previous example may be definedusing “$defaultHostAddress$input_1$defaultSeparator$input_2”. In thetemplate, the expected public host is given, where it may in some cases be resolvableto a Linked Data document. Along with the default separator, this template is used toderive the normalised URI given the abstract details, which may not be directly relatedto the URI that was used by the scientist to resolve the query with the prototype. Thisallows scientists to use the normalised URI matching the public URI scheme when theydo not need to publish URIs that will resolve using their instance of the prototype.In many situations scientists may require others to resolve the document using theirinstance in order for others to easily replicate the queries, however, both options areavailable.

The ability for the prototype to recognise intermediate templates was necessaryso that scientists can reliably normalise complete URIs to the appropriate form that isrequired for each endpoint, as the normalisation rules assume that they will be operatingon the complete URI. For example, it is not possible to reliably normalise a URI, forexample http://myorganisation.org/concept:B/222 to its endpoint specific version,for example, http://worldconcepts.org/concepts.html?scheme=B&id=222 withoutincluding the protocol and host along with the parameters that were interpreted by

3http://tools.ietf.org/html/rfc3986#section-2.24http://www.w3.org/TR/REC-xml/#sec-predefined-ent



.../Category%3APeople_associated_with_Birkbeck%2C_University_of_London

.../Category%3APeople_associated_with_Birkbeck%2C_University_of_London

http://myexample.org/dbpediacategory:People_associated_with_Birkbeck%2C_University_of_London

http://myexample.org/dbpediacategory:People_associated_with_Birkbeck%2C_University_of_London

http://myorganisation.org/concept:B/222

http://worldconcepts.org/concepts.html?scheme=B&id=222

http://tools.ietf.org/html/rfc3986#section-2.2

http://www.w3.org/TR/REC-xml/#sec-predefined-ent


the query type. If only the parameters "concept:(A-D)/(\d+)" were included in therule, then the normalisation phase may change http://worldconcepts.org/concepts.html?scheme=B&id=222 to a URL such as http://worldconcepts.org/concepts:B/

222 without changing the host or protocol for the URI.

5.4.2 Static RDF statements

The prototype allows scientists to include pre-defined, static, templates that can beused to insert information into a document in the context of a data provider and ausers query. Scientists can use this functionality to keep a track of which URIs eachendpoint used, along with their relationship to the URI used by the prototype. Thismakes it possible for scientists to describe the context of their document in relationto other documents, which may be describing equivalent real world things in differentways. This is in contrast to typical recommendations that specify that there should be asingle URI for each concept, so that others can simply query for descriptions of the itemwithout knowing what context each alternative description is given in. The RDF modelis designed to provide ways of inter-relating concepts, including different descriptions ofthe same item. The prototype allows scientists to setup basic rules about the contextsurrounding their use of the URIs, and the RDF model, along with HTTP URIs, areused to implement this.

Each of the alternative URIs may be a Linked Data URI that can be resolved to asubset of the data that was found using the normalised URI structure. Even traditionalweb URIs, that do not resolve to RDF information, can be useful. For example, ascientist’s program may require a particular data format, and the traditional web URIcan point to that location, as shown in Figure 5.6.

Normalised URI

Scientific data format URL

Alternative Linked Data URI

http://example.org/namespace:identifier

http://namespacefoundation.edu/cgi-bin/formatX.cgi?id=identifier

http://othersource.org/record/namespace2/identifier

Dublin Core Identifier


Has Data Format X URL

Has Dublin Core Identifier

Has Alternative URI

RDF data source URL (From Provider Endpoint)

http://namespacefoundation.edu/cgi-bin/RDFendpoint.cgi?id=identifier

Has RDF source

Figure 5.6: Uses of static RDF statements



http://worldconcepts.org/concepts:B/222

http://worldconcepts.org/concepts:B/222


5.5 Provider

Datasets are available in different locations, and the method of accessing each datasetmay be different. Scientists are able to implement multiple access methods acrossmultiple locations for a range of query types using providers. This makes it possible forscientists to change the method and location and query that they use for data accesswithout having to change their programs, as providers are contextually linked to thequery types, which are linked to the overall query. The prototype makes it possibleto use a range of methods for data access by substituting providers and query types,enabling scientists to replicate queries in their context without restrictions that mayhave been imposed on the original scientist.

Each dataset may need more than one data quality normalisation change in orderfor the results to be complete. Each provider can be configured with more than onenormalisation rule, which makes it possible to layer normalisation rules. If scientistswant to change the URIs that are given in a document to republish the results using theURI of their prototype, they can add a normalisation rule at a higher level than each ofthe other normalisation rules. This normalisation rule will change the normalised URIinto one that points to the prototype.

In some cases, the scientist may not know exactly which set of data normalisationsare required for a particular dataset. In these cases the scientist may need to havemultiple slightly different queries performed on the dataset as part of the overall query.Each provider can be used multiple times to respond to the same overall query, althoughit will only be used once in the context of a query type.

The prototype uses HTTP based methods, including HTTP GET, and a HTTPPOST method for SPARQL endpoints. These make it possible to use a large rangeof data providers, including all of the Linked Data scientific data providers, and theSPARQL endpoints that have been made available as alternatives for the Linked Datainterfaces to the scientific data providers. In Figure 5.7, there are two providers, whichare used in different contexts. In one context, myexample.org uses the original dataprovider, with the associated normalisation rule, and in the context of a scientist who hascurated and normalised the dataset, the scientist could easily substitute their provider,and exclude the original data provider using their profile.

5.6 Namespace

The concept of a namespace encompassed both the representation of data and the originof the datasets. Although its use in the implementation is limited to simple declarationsthat a feature such as a part of a dataset exists on the provider in the context of thequeries it is defined to operate on, it could easily be extended to include statistics. Thismay provide for a generalised model where queries are ad hoc, and the parameters aredirectly defined by the scientist instead of in the context of each query type.

The use of regular expression matching groups in the prototype to identify thenamespace sections in the scientist query is used to identify a set of namespaces. The


Providers in the Prototype

Common profile

Original data provider

HTTP GET from : http://namespacefoundation.edu/cgi-bin/RDFendpoint.cgi?id=identifier

Scientist A profile

Curated, normalised, local data provider

HTTP POST to : http://localhost:1234/sparql

Original data Normalisation Rule

Change http://namespacefoundation.edu/ to http://myexample.org/

Requires normalisation

SPARQL Describe query

User query

http://myexample.org/namespace:identifier

Resolved by Scientist A using

Untemplated query

Resolved primarily using query

Resolved using

Resolved using:

Resolved using providerExcluded by

Scientist A profile

Resolved by myexample.org using

Figure 5.7: Context sensitive use of providers by prototype


matching groups are prefixes that are mapped to a set of URIs from the configurationsettings for the prototype. The current model and implementation require that scientistinputs are given as a single string. The use of a single arbitrary string enables semanti-cally similar queries to choose the parts of the string that are relevant without affectingother queries. For example, one query type may take the input and send it as partof an HTTP GET query to an endpoint without investigation into the meaning of thescientist’s query, while other query types may require knowledge about the meaning ofthe query and only partially use the information in a query.

The use of indirect namespace references, using identical prefixes which are attachedto multiple arbitrary URIs, makes it possible to integrate prototype configurations whichwould not be possible if the namespace was directly named with a single URI. It is nec-essary to integrate different sources in a distributed query model, as there is no singleauthority which is able to name each of the namespaces authoritatively. A hypotheticalexample of where different configurations can be integrated seamlessly using the pro-totype is shown in Figure 5.8, where it would be difficult to integrate them otherwiseas it would require changes to one of the configurations to make its namespace URIsmatch the others. In addition, the ability to define aliases for namespaces makes itnecessary to rely on the prefix instead of the URI, as there is not a single relationshipbetween a URI and a prefix. The scientist needs to identify providers and query typesin their contextual profiles as the basis of selecting which providers will actually be usedto resolve the query, rather than namespaces, as it is ambiguous from their query whichone should automatically be chosen.

5.7 Normalisation rule

A principle issue in the area of linked scientific datasets relates to the inability ofscientists to query easily across different datasets due to the inability of the relevantcommunities to conclude on both the structure of a universal URI scheme, and themechanism for how those URIs will be resolved. This model neatly avoids these issuesby reducing URIs to their components, in this case, a namespace prefix, and a privateidentifier. The principal information components, which have been used by scientists inthe past to embed non-resolvable references into their documents, are used to create aseries of normalisation steps for each dataset, assuming a starting point that matchesthe scientists preferred URI structure. This standard URI is then transformed usingthe normalisation rules as necessary given the context of each provider, so that queriessuccessfully execute with the correct results, before normalising the results so thatscientists can understand the meaning without having to manually translate betweenURIs that are not relevant to their scientific questions.

Normalisation rules can be defined and applied to providers to access datasets thatfollow different conventions to what a scientist expects. Normalisation rules are initiallyapplied to query templates using template variables as well as to the query at otherstages.

The prototype implements the models requirement for ordered normalisation rules


Resolving namespaces using different configurations

Configuration A

Preferred prefix : doi

Configuration B

Preferred prefix : doidAlias : doi

Namespace beta

User query

http://myexample.org/doi:identifier

Using Source B

Namespace alpha

Resolved primarily using queryContains namespace

Contains prefix

Using Source A

Contains prefix

http://mydigitalorganisationid.example.org/

Has Original producer

http://www.doi.org/

Has Original producer

Figure 5.8: Namespace overlap between configurations


with a staged rule application model. In the model the query and the results go throughdifferent stages where normalisation can occur. For each stage from Table 3.1, thesupported normalisation methods that the prototype implements are listed in Table 5.1.At each stage, the normalisation rules are sorted based on a global ordering variablewhere indexes were sorted based on whether the stage was designed for de-normalisationor normalisation. The reversal of the order priority in the latter normalisation stagesmade it possible to integrate the normalisation and denormalisation rules for a singlepurpose into a single rule. This made it easier to manage the configuration information,as many rules would otherwise have required splitting in two.

Stage Method Order Priority1. Query variables Regular Expressions Low to High2. Query template before parsing Regular Expressions Low to High3. Query template after parsing None implemented Low to High4. Results before parsing Regular Expressions, XSLT High to Low5. Results after parsing SPARQL CONSTRUCT High to Low6. Results after merging SPARQL CONSTRUCT High to Low7. Results after serialisation Regular Expressions High to Low

Table 5.1: Normalisation rule methods in prototype

In the prototype, URIs that should match what appears in an endpoint take theform of “$endpointSpecificUri”. This template variable is derived by the applicationof a set of Regular Expression normalisation rules to the standard URI template thatis available in the query type, after substituting the relevant variables. The prototypeuses this method as an exhibit of the usefulness of high level normalisation rules to aset of distributed linked datasets. The use of the endpoint specific URI template, andits normalised version, “$normalisedStandardUri”, enable queries to be created thatare not possible in other models.

The prototype initially supports three different types of rules, Regular expressionstring replacements, SPARQL Construct queries for selection or deletion of RDF triples,and XSLT transformations to transform XML results documents into RDF statementsso they can be integrated with other RDF statements. Other types of rules can becreated by implementing a subclass of the base normalisation rule class and creatingproperties to be used in configurations. Each rule needs to be serialisable to a set ofRDF triples to enable replication solely based on the provenance record.

Regular expression rules contain equivalent normalisation and de-normalisation sub-rules. The de-normalisation part only applies to template variables in queries, wherethe normalisation stage applies to entire result sets, although it is not applied to staticRDF insertions as they provide the essential references between normalised and non-normalised data. The same template variables that are applied to query templates, arealso available in static RDF insertions. This is important, because normalisation is asyntactic step in most cases, and although it is useful, it is also useful to highlight thediversity of URIs, where scientists may only have been aware of the normalised versionpreviously and may not have correlated other URIs with their local version. The static


RDF templates that are attached to query types are not normalised along with theresults, although they will be normalised after the results are merged into the overallpool as there is no way of knowing where individual RDF triples came from at thatpoint.

For example, the variables used in a SPARQL query may be modified using a regularexpression prior to being substituted into the templates inside of the query, before thequery is then processed and modified using SPARQL Basic Graph Pattern manipulation,although there is no currently standard rule syntax for this manipulation. Although theparsed template query normalisation stage is necessary in the model for completeness,it was not viewed as essential to the prototype in terms of data quality, context, or datatrust. In each case, namespaces or profiles could be used to eliminate query types wherethe template would not be useful, rather than using normalisation rules to fix the queryat this stage.

The SPARQL rules operate in one of two modes, either deleting or keeping matches.The major reason for this restriction is the lack of software support for executing the,yet to be standardised, SPARQL Update Language (SPARUL) DELETE queries 5,where scientists could have more flexibility in the way deletions and retentions operate.Each of the different stages of a query are applicable to different types of rules. Forinstance, SPARQL transformations are not applicable to most queries, as they requireparsed RDF triples to operate on. In comparison, Regular Expressions are not relevantto parsed RDF triples, so they are only relevant to queries, the unparsed results ofqueries, and the final results document.

The RDF results from a query may be modified as textual documents before beingimported to an internal RDF document, and they may be modified again after all ofthe results are pooled in a single RDF pool. In combination with the order attributeand the SPARQL Insert or Delete modes, normalisation rules can be implemented toperform any set of data transformations on a scientist’s data.

In the absence of the standardisation of the SPARQL Update Language (SPARUL)and the lack of implementation of the draft in the OpenRDF Sesame software used bythe prototype 6, normalisation rules in the prototype are limited to regular expressionsand matching SPARQL CONSTRUCT or RDF triple results. When SPARUL andother SPARQL 1.1 changes are standardised, including textual manipulation and URIcreation, the prototype could be easily extended to support the use of these features forarbitrary normalisation.

5.7.1 Integration with other Linked Data

It is a challenge to integrate and use Linked Data from various sources. Apart fromoperational issues such as sites not responding or empty results due to silent SPARQLquery failures, there are issues that scientists can consistently handle using the proto-type, including URI conventions and known mistakes from individual datasets in the

5http://www.w3.org/TR/2010/WD-sparql11-update-20100126/6http://www.openrdf.org/

http://www.w3.org/TR/2010/WD-sparql11-update-20100126/

http://www.openrdf.org/


context of particular query types.

The prototype can be configured to use different URI conventions, including thosematching the Bio2RDF normalised URI specification [24], and other proposals such asthe Shared Names proposal, although the Shared Names is not yet finalised.

The prototype implements HTTP URI patterns that are not handled or seen bythe core model implementation. For example, the choice of file format for the results ismade by prefixing the query with “/fileformat/”. In comparison, other systems appendthe file format to the URI string, but this prevents their models handling cases wherethe actual query ends in the extension, for example, “.rdf”, as there may be no wayto distinguish the suffix from an actual query for the literal string “.rdf”. The modelis designed to provide access to data using an arbitrary range of queries, where themajority of Linked Data sites solely provide access to single records using a single URIstructure. If all of the identifiers in a dataset are numeric, for example, a real querywould never end in “.rdf”, so there is no confusion about the meaning of the query.

In some cases it may be necessary to integrate normalisation rules created usingdifferent normalised URI schemes. The ordered normalisation rule implementation al-lows for this easily by creating a new normalisation rule with an order above any ofthe orders that the included normalisation rules use. This normalisation rule wouldthen consistently convert the results of all of the imported normalisation rules, into thelocally accepted normalisation standard.

It would be simpler for single configuration maintainers if every dataset were as-signed a single set of normalisation rules to correct queries. However, it would preventscientists from integrating their provenance and configurations with other scientistswithout manually combining the rules that were relevant to each dataset. The proto-type enables scientists to use other prototypes configuration documents and provenancedocuments as configuration sources.

This process is automatic as long as there are no URI collisions where inconsistentdefinitions for query types etc., are provided by different sources. However, this shouldbe a social issue, as scientists should use URIs that are under their control if they expecttheir work to be consistent with other users. In cases where scientists wish to modifydefinitions for any model items, they should change the subject URI for the triples, andchange any dependent items by changing the reference along with the subject URI forthe dependent item. The original URIs then need to be added to the scientist’s profileto ignore them and use the new items.

The migration method using new URIs and profiles enables scientists to attach theirown data quality and trust rules to any providers. However, if they only wish to changedata in the context of a particular query type on a particular provider, they would needto create at least one new provider to include the new normalisation rule. If the querytype was still used, they could not ignore it completely without redefining all of its uses.In this case, the original provider would need to be ignored using the profile and a copymade that did not contain the query type.

The model allows different sets of normalisation rules to be used with the same


query type on the same namespace on a single dataset. This situation requires twoproviders with the same query types and namespaces assigned to them, with only thenormalisation rules differing. The Bio2RDF prototype used this pattern in cases wheredatasets contained inconsistent triples and it was not known which normalisation ruleswould be necessary for a particular query on a particular namespace beforehand.

5.7.2 Normalisation rule testing

The prototype included tests that specify the example inputs and expected outputs fornormalisation rules. The tests include the requirement that normalisation rules withboth input and output (i.e., de-normalisation and normalisation) components must beable to transform the results of the input, de-normalisation, rules back into a normalisedversion using the output rules. However, if the normalisation rule does not contain aninput rule, or does not contain an output rule, the tests are only performed on theexisting half of the rule.

Rule testing makes it possible for scientists to both validate the way the rule workson their examples, and demonstrate to other users what the rule is intended to do if thatis not clear from the description and the rule itself. This is important to independentlyvalidate the usefulness of rules to convince other scientists that they are useful fordata cleaning, as there is no single authority to curate rules. Although the model iscompletely decentralised by design, to allow for context sensitive changes, authoritiescould produce rules and curate them, and scientists could use all or some of the rulesbased on their profiles.

5.8 Profiles

In the prototype, configuration information is serialisable to RDF, which can then beincluded with configuration information from other sites to form a large pool of possibledatasets and the associated infrastructure. This pool of information can be organisedusing profiles so that scientists can exclude items that are not useful to them, and op-tionally replace the functionality by including an alternative, perhaps a trusted datasetthat has a known provenance and is hosted locally. The prototype was designed as alayered system so that the prototype can explicitly say which profiles were more relevantthan others.

The configuration files specify a global ordering number for each profile, and highernumbers take preference over lower numbers, so scientists can generally find a simpleway to override all other profiles with their own. The profiles that will be used in aparticular situation can be specified by the operator of the prototype system withoutreference to the order, which is obtained from each of the profiles. The global orderingmakes sense as some profiles will contain instructions to always include anything thatisn’t restricted so far, and these profiles should be consistently visible for other users togenerate the same results when using the profile in their set of profiles.


The overall implementation allows scientists to source statements from many differ-ent locations. Each of these locations should be trusted to safely execute the system.Untrusted configurations may include queries that are so complex that they timeout,or that accidentally cause denial of service attacks of data providers. Before queries areexecuted, ideally, scientists should review the queries, although they may not be ableto recognise where issues will occur without experimenting with the different ways theprofiles work. An example of a situation where the profiles may be accidentally misusedis where there are multiple mirrors for a website, such as Bio2RDF, and the mirrorscould hypothetically rely on each other for resolution of some queries. This could easilylead to circular dependencies, which the system could not detect without analysing thequery plan for each query before executing it. The profiles are designed so that thesepotential dependencies can be removed by scientists.

The ordered profile system is useful in a range of scenarios. The simplicity makes itvaluable in many situations, as the order of the profiles relies on knowledge about thedesired effects, and the effects of each of the profiles individually. This makes it possiblefor scientists who know which order the profiles are given to consistently determine whichquery types, providers, and normalisation rules will be used without having to ask otherusers for information about how their prototype was setup internally.

Although RDF provides a native facility for ordered lists, the implementation methodmakes it impossible to extend a list dynamically, as the last element of the list explicitlyreferences a null value. It is important that future scientists be able to easily extend alist without having to modify the document containing the original definition. They cando this using numerically ordered profiles and normalisation rules. In addition, explicitordering provides a degree of control over which processes can be performed in parallel,as items with the same order should not interfere with each other.

Chapter 6

Discussion

6.1 Overview

This chapter presents a discussion of the issues which were relevant to the design of thedata access model described in Chapter 3; the use of the model by scientists as describedin Chapter 4; and the implementation of the prototype web application described inChapter 5. The model is designed to provide query access to multiple linked scientificdatasets that are not necessarily cleanly or accurately linked. The model providesa set of concepts that enable scientists to normalise data quality, enforce data trustpreferences, and provide a simple way of replicating their queries using provenanceinformation. The prototype is a web application that implements the model to providea way for scientists to get simple access to queries across linked datasets. This discussionincludes an analysis of the advantages and disadvantages of the model and prototypewith respect to the current problems that scientists face when querying these types ofdatasets.

Simple methods of querying across linked scientific datasets are important for sci-entists, particularly as the number of published datasets grows due to the use of theinternet to efficiently distribute large amounts of data to different physical locations [67].The model was based on the degree to which it supports the needs of scientists to accessmultiple scientific datasets as part of their research. The model provided useful featuresthat were not possible in other similar data access models, including abstract mappingfrom queries to query types and normalisation rules, using namespaces where neces-sary. This enables the model to substitute different, semantically equivalent queries,in different contexts without requiring changes to the scientists high level processingmethods.

They can use the model to transform data based on their knowledge about dataquality issues across the range of datasets they require for their research, without re-quiring each query to determine which data quality issues are relevant to its operation,resulting in the ability to perform multiple semantically unique queries on a providerwith each query requiring different data quality modifications. The model allows scien-tists to trust that the results of their query were derived from a trusted location usingan appropriate query and normalisation standards. Scientists can evaluate the process

129

130 Chapter 6. Discussion

level provenance for their queries, including the exact queries that were performed oneach data provider, and they are able to interpret this information in a simple way dueto the object and reference based design of the model.

The prototype implementation was created to examine the practical benefits of themodel in current scientific research. As part of this case studies were examined, includinga case integrating science and medicine, in Section 4.4. An evaluation of the usefulnessof the prototype together with workflow management systems in Section 4.7. Thesecase studies, together with experience using and configuring the prototype revealed anumber of advantages and disadvantages relating to the prototype and the initial modeldesigns.

The prototype enables scientists to easily construct alternative views of scientificdata, and it enables them to publish these views using HTTP URIs that other scien-tists can use if they have access to the server the prototype is running on. The prototypewas successful in acting as a central point to access numerous datasets, using typicalLinked Data URIs for record access and atypical Linked Data URIs as ways to performefficient queries on datasets. The prototype was found to be a useful way of distribut-ing configuration information, in both bulk form and in a condensed form as part ofprovenance records. This information was then able to be customised using a list ofacceptable profiles that were statically configured for each instance of the prototype,making it possible for users to reliably trust information, as long as they already trustedthe sites they obtained the overall configurations from. The prototype did not attemptto implement as access control system, so scientists would need to physically restrictaccess, or implement a proxy system that provided access control to each instance ofthe prototype.

In terms of scientists using the system, it is important that data is able to be re-liably transferred between locations on demand. This process may be disrupted byissues such as network instability, query latency, and disruption to the data providersdue to maintenance or equipment failure, while there are inherent limitations includ-ing bandwidth and download limits that affect both clients and data providers. Themodel was designed to provide abstractions to both data providers and query types,to provide redundancy across both types of queries and data locations. The prototypewas implemented with a customisable algorithm to decide when a data provider hadfailed enough times to warrant ignoring it and focusing on other redundant methods ofresolving the query. The redundancy in the model, together with users specifying profilepreferences, made it possible to reduce latency on queries by choosing physically closedata providers and optimising the amount of information that needed to be transferred.

6.2 Model

The model is evaluated based on the degree to which it fits the research questions thatwere outlined in Chapter 1, specially context sensitivity, data quality, data trust, andprovenance. The evaluation focuses on the advantages of the model design decisionsin reference to the examples given in Chapter 1, with specific reference to the way the

Chapter 6. Discussion 131

model can be used to replicate queries in different contexts.

6.2.1 Context sensitivity

Scientists work in many different contexts during their careers. These contexts includedifferent universities and research groups at a high level, and many different data pro-cessing methods on different datasets. Some of these datasets may be sensitive, andthe experiments may be restricted from public recognition, at least temporarily. Scien-tist use linked datasets in many different facets of their research, making it necessaryto have a simple method of adjusting the way the data access occurs to match theircurrent context.

Scientists require that their experiments are able to be replicated in the future, withthe expectation that the results will not be materially different from the original results.The model includes the ability to easily switch or integrate new data providers or queries,along with the relevant normalisation rules. This means that even if the original dataprovider is unavailable, the overall experiment could still be replicated if the scientistcan find an alternative method of retrieving the results. This is novel, compared toalternatives such as workflows, because it recognises the difference between the datalocation, queries on that data, and any changes that are required. Scientists can use thisflexibility in the model to substitute components individually, according to their context,without requiring changes to the high level processing methods. The recognition of thecontext that data is accessed in, without requiring the use of a particular method, meansthat scientists can share their experimental workflows, and other scientists can replicatethe results using their own infrastructure if necessary, with changes to the configurationof their prototype rather than workflow changes.

Scientific datasets may develop independently, making it difficult to integrate datasets,even if they are based on the same domain. For example, a scientist may wish to inte-grate two drug related datasets, DailyMed and DrugBank, but they may be unable todo so automatically because they do not contain the same data fields, although thereare links between records in the two datasets. If a scientist requires that the data fromanother set of results to be represented differently to fit into the way their processesexpect, the model allows the scientist to modify the query provenance without affectingthe way the original set of results was derived. The scientist can use normalisation rulesto avoid having to contact the original data provider to modify their results. They woulddo this using a data provider, that accesses the original set of results, and modifies thedata using rules, without performing any further queries. In the DailyMed/DrugBankexample, the scientist would find the common data fields based on their opinion, andnormalise data from one or both datasets using normalisation rules.

There are a number of common scenarios that the model is useful for in terms oftransitioning between different contexts and transitioning between different datasets, asshown in Table 6.1. The table shows the changes necessary to keep the status quo, soscientists using the model for data access do not need to modify their processing systemswhere the model can be modified instead. In each case, a scientist would have to ensure


that their profiles included the new providers, query types and normalisation rules, andexcluded outdated providers, query types and normalisation rules. An important thingto note with respect to these scenarios is that it may never be necessary to produce anew query type in reaction to a change in a current dataset, as query types are designedto be generic, where possible, to keep a stable interface between scientists and a largenumber of constantly updated datasets.

Scenario Definite changes Possible changesData provider shuts down New provider for re-

placementNew query types andnormalisation rules

Data structures change New normalisationrules and add rulesto providers

New query types

New scientific theory New normalisationrules and add rulesto providers

New providers andquery types

New dataset New providers New normalisationrules and querytypes

Data identifiers change New normalisationrules

New providers

Table 6.1: Context change scenarios

The ability to group query types and providers based on a contextual semanticunderstanding of functionality, makes it possible for a scientist to easily find alternatives,if they find their context makes it difficult to use a particular query type or provider.For example, a scientist may decide that the Bio2RDF published version of the UniProtdataset is easier to use than the official UniProt dataset, and they could link togetherthe query types, providers and normalisation rules relevant to Bio2RDF and publishthem together using a profile. Although this ability is useful, the model does not requirescientists to specify why they didn’t find a particular query type or provider semanticallyuseful. In each case, the scientist would make a decision about whether they want to useeach item. The range of possible implementations of the model could including a rangeof inclusion metrics found in Mylonas et al. [101]. In contrast to Mylonas et al. [101] andother similar models, this model does not require that scientists and/or data providersadhere to a single global ontology. It does not even require them to specifically choosea single ontology for all of their research, as they would need to do in these systemsjust to access data from a range of distributed datasets. If different scientists interpretthe same query in different ways, the provenance records will highlight the differencesto other scientists, and other scientists may choose providers based on these records forsome of their own experiments.

The model is designed based on the assumption that the data within a namespacethat can be found using a particular identifier will be consistent over time, althoughthis may not be the case in many datasets. Although the model makes it possible


for scientists to modify and change queries and data providers transparently to over-come changes, it does not make it simple for scientists to determine which queries andproviders are relevant at the current time, compared to in the past. It is assumed thatscientists will keep track of the provenance of previous experiments if they want tobe guaranteed that the model will use the same sources in resolving their query, andscientists need to manually verify that the dataset was not modified.

The issue of temporal reasoning, where scientists simultaneously specify that dif-ferent data providers are relevant at different points in time, could be included in theprofiles system through the use of time constrained profiles, but it would also need tobe included in the data providers if they include multiple versions of items. If datasetschange identifiers for a data record, or deleted data describing a record, and the changedcould not be reverted using normalisation rules, scientists would need to modify theirprocesses to replicate their past experiments, something which is not ideal, but necessaryin some circumstances.

For example, if a scientist discovered that a gene was not real, or it was a replicaof another gene, a genome dataset may choose to remove all of the information fromthe record describing the gene and replace references to the gene with references to theother gene. This would disrupt any current experiments that relied on the modifiedor deleted gene record containing meaningful information, and the scientists workflowswould need to be modified to allow for genes that have been replaced by references tonew genes. Although this was not a real world change, the fact that scientists changedtheir opinion is important to the way the system functions. Scientists use labels fortheir concepts and a label needs to be used consistently by a scientist in order for theirwork to be as successful as possible.

In comparison to other models, this means that scientists are able to have inde-pendent opinions about the meaning of different pieces of scientific data. This is neverpossible if scientists are required to adhere to a single global ontology, as there is noway to recognise additions or deletions in experiments which are designed to producenew novel results. There may be benefits in well developed disciplines for scientists toadopt a single set of properties, even if the objects or classes were still variable. Forexample, in the field of medicine, health care providers may adopt a single standardto ensure that there is no confusion between documents produced by, and distributedbetween, different doctors. However, doctors participating in medical trials may needto forego the benefits of this standardised data interpretation to test new theories usingunconventional properties.

6.2.2 Data quality

Scientists have the need to reformat data to eliminate syntactic data quality mistakes,such as malformed properties or schemas, before queries across datasets will be consis-tent. This is a major issue for scientists, as these issues must be fixed before they canuse higher level, semantic data quality procedures to verify that the information contentis consistent. The model is designed to be useful before data quality issues have been


resolved, along with being useful when the syntactic and semantic issues are resolvedusing permanent dataset changes or normalisation rules.

Many syntactic data quality issues require human interaction to both verify thatthere is an issue, and to decide what the easiest way to fix the issue is. Althoughthis process is best done on a copy of the dataset that the scientist can modify, anymodifications that the scientist can perform as part of a query can be accommodatedusing the model with normalisation rules on queries and results. The model makes itpossible for scientists to them publish the rules along with their queries, so that otherscientists can verify the results independently, avoiding the normalisation rules that didnot suit them using profiles as necessary, without having to redesign the query. If ascientist does need to republish the entire dataset locally, a custom profile can be usedto isolate provider references to the original dataset and replace them with referencesto the cleaned dataset.

It is difficult to use automated mechanisms to diagnose semantic issues on distributedlinked datasets, as the process may require information from multiple geographicallydifferent locations, which are not efficient to retrieve if the relevant parts of the datasetsare highly interlinked. The model succeeds in cases where the necessary information canbe retrieved to enable the model to semantically clean data locally, or where queries canbe constructed in ways that allow scientists to clean the data in the query. In general,the model relies on being able to construct queries that match the semantic meaningcontained in the data provider, with the normalisation occurring on the results. Thisis as issue for any distributed query methodology, as it is based around a fundamentaldistinction between locally available data that can be processed using efficient methods,including in memory and on disk, and distributed data that needs to be physicallytransported between locations before efficient localised methods can be used to verifythat the data is semantically valid. If scientists require semantically valid trusted datathat is not efficient in a distributed scenario, they can setup a local data provider andco-locating the datasets to provide semantic data quality using efficient local processes.

In comparison to federated SPARQL systems, the model is able to be used to di-agnose and fix semantic issues, without requiring data to be provided using SPARQLin every location. The use of provider specific syntax rules to fix syntactic issues aftera query is executed, provides a more stable basis for semantic reasoning, as providersare free to change and migrate to new standards while queries that were applicableto past versions of the dataset may still be useful after syntactic normalisation. Themodel can be used to impose current standards on datasets regardless of the decision ortiming from dataset publishers. For example, the query model was used throughout thisproject to impose the current Bio2RDF standards on each of the datasets, even thoughthe Bio2RDF project itself did not always have the resources to physically change thedatasets each time the best practices in the community changed.

The model may be used to easily link together data of differing qualities, potentiallylowering the quality of the more useful data. This factor is inevitable in a model thatallows users to merge information from different sources. Although it may be useful to


enforce a single schema on the results, this would only enforce syntactic data quality,as shown in Figure 6.1. In the example, the syntax normalisation does not result inmore useful data, and may even make it more difficult for scientists to process the dataif they assume that a disease reference will be a link, rather than just a textual label.The use of a single data format, without relying on a single schema, makes it possiblefor scientists to include their private annotations, which may not have been possible inother scenarios. If the user then republished their annotations along with the originaldocument, the quality of the combined document may be lower than the original data.This is a feature of the model, but the resulting issues must be dealt with at a higherlevel, through the creation of methods that can be used segregate the data based on itsorigin and the desired query, such as RDF quads for instance.

Many systems assume that the data quality issue can be solved in a context-independent way, relying on a single domain ontology to normalise all documents fromevery source to a common representation. However, the assumption that because ontolo-gies are logically relevant to everything does not imply they are syntactically relevantto everything. A number of limited scope systems, such as the BIRN network and theSADI network, have succeeded in integrating data from a range of biological sources, butthey could not easily integrate annotations, or internal scientific studies that violatedone of the core ontological principles of the system. The model is designed to providea neutral platform on which scientists can pick and choose their desired sources usingproviders, and their desired ontologies, based on the results of normalisation rules.

The cost of this neutral platform is that queries must be specified in terms of thegoals, and not the query syntax, so the system cannot interpret complex queries into setsof sub queries without users having first defined templates for the sub queries with linksfrom the appropriate providers to the templates. However, the neutral platform, partic-ularly including the representation of normalisation rules in textual formats, providessimple ways for any scientist to choose their personally relevant set of queries, providersand rules. They can group these sets using profiles that other users can include or omitthe sets based on the identifier that was given to the profile.

Workflow management systems are designed to allow users to filter and clean data atevery stage. The filtering and cleaning stages however are typically directly embeddedin the workflow, making it difficult to distinguish between data processing steps andcleaning steps. This is important if the cleaning steps are rendered unnecessary in afuture context, but the data processing steps are still necessary to execute the work-flow. In addition, the locations of the datasets are embedded in workflows, so if usersindependently clean and verify a local copy of a dataset, the workflow cannot easily bemodified to both ignore the cleaning steps and to fetch data from the local copy. Themodel is designed with this goal in mind, so it natively supports omission and additionof both cleaning steps–normalisation rules–and dataset locations–providers.

In the workflow integration described in Section 4.6, the prototype was used to pro-duce workflows that focused on the semantic filtering process. The query provenance foreach query were fetched when necessary by prefixing the query URI with “/queryplan/”.


Syntax normalisation example

DrugBank


Query

Syntax normalised

results


DatasetsDrugBank drug name : Marplan (Isocarboxazid)Disease reference : Diseasome/1689

DailyMed

Dailymed generic drug name : (IsocarboxazidLinked to disease : Brunner syndrome

Label : Marplan (Isocarboxazid)Label : IsocarboxazidDisease reference : diseasome_diseases:1689Disease reference : Brunner syndrome

Figure 6.1: Enforcing syntactic data quality


6.2.3 Data trust

The model is useful for restricting queries to sources of data on trusted sources. Itallows scientists to restrict their experiments to a set of data providers, along with thequery types and normalisation rules that they find useful. The trust mechanism is novelin the context of restricting queries on linked datasets based on scientific trust. Thisenables scientists to predefine trusted sources, and then perform queries that are knownto be contextually trustworthy, as profiles are limited in their scope.

The trust factors include the ability to distinguish between trusted, and unreviewed,implicitly useful items, as both being available, depending on the overall level of trustthat the scientist places in the system. It is not practical or useful for a scientist to applytrust factors to every dataset, however, at the query level, in the context of particulardatasets, a scientist is able to eliminate untrusted items, allow explicitly trusted items,and make it known to other scientists that other items are as yet unreviewed, but intheir opinion, another scientist could either use or not use them, depending on thecontext for the research. The range of undecided datasets may include datasets thatare not linked to the scientist’s discipline expertise, or are not curated to an overall levelthat is satisfactory to the scientist, or they could simply include datasets that were notuseful for the experiments that were currently being focused on.

The profiles are useful for temporary exploration, and are necessary if scientists areto easily use widely applied query types, to particular data providers, without redefiningthe query types or providers. This may be common in an open situation where otherscientists may have knowledge of more query types or data providers, that are notreviewed according to the profile, but can be included or excluded based on the overallcriteria that the other scientist applies. This aspect of data trust is reflected in theway the provenance is applied in future situations, as scientists need to decide whatpreference they are going to apply to the extra data providers and query types, thatthey know about, but do not find in a provenance record, as scientific experiments maybe able to be replicated using different query types, for instance, the original scientistmay have used SQL queries as compared to the SPARQL queries that are currentlyavailable, although both queries have the same semantic meaning, and can therefore besubstituted.

If scientists were interested in determining trust in a social context, they coulddigitally sign and share their profiles, along with resolvable Linked Data URIs to indicatewhere to find the relevant data providers, query types and normalisation rules. Themodel and prototype were not evaluated at this level as the model was designed tomake it possible for scientists to define and share their personal trust preferences, ratherthan on how to decide at the community level where something was trusted. The trustalgorithm that was used for the prototype could be substituted with a community basedalgorithm that used numeric thresholds to represent the complex relationship betweentrust and the community preferences. In the trust system that was implemented forthe prototype, the scientist only needed to define whether they wanted to use the item,rather than whether others should want to use the item. For instance a review site such


as that discussed by Heath and Motta [66] would make the model more flexible in socialterms, but it wouldn’t change the fundamental methods used by the model to choose orignore data providers, query types and normalisation rules based on the overall contextof the current user.

The model may be extended easily to include a system of numeric thresholds, suchas those used by many distributed trust and review models.The model relies on a verdictof either yes, maybe, or no, for each item on each profile, and maybe decisions are passeddown through a set of layered profiles, based on the scientist’s trust, until a yes or nois returned, or all of the profiles do not match. A numeric threshold would require thescientist to arbitrarily assign an acceptable value to the maybe verdicts for each profile.The threshold could be defined as an average over a group of scientists, although thereare debates about whether the wisdom of crowds [138] are more useful for trust purposesthan the wisdom of known friends [131]. For example, the rating of known scientistsmay be more useful than the rating of a general crowd of scientists. The use of numericthresholds would still follow the yes, maybe, and no, system for each profile in order forthe system to allow for a complex definition of context sensitivity that could be directlyreplicated by others in the same way as the current model.

The profiles component of the model does not distinguish between model elementsthat are either untrusted, those that are excluded because a scientist has a more efficientway to access the data provider, those where the implementation of the query type isthought to be incorrect, or those where the normalisation rule is not deemed to benecessary in a particular context, although it may be trusted in other contexts. Thelimited semantic information given by the profile is suitable for the model to consistentlyinterpret the scientists intentions, but it does not make the exact intentions clear to otherscientists. An implementation of the model may provide different definitions for thesemantic reasons behind a particular inclusion and exclusion, in order for other scientiststo more accurately reason about the information, but the essential steps required toresolve queries using the model would be unchanged as scientists would need to defineexact criteria in the implementation regarding which parameters specified inclusion andexclusion, and what to default to if the set of parameters did not match these sets.

In the model, the statements from different data providers are merged into a pool,before returning the complete set of results to the scientist. It is not possible to distin-guish between statements from different locations using the current RDF specificationwithout using reification or RDF Quads. Reification is not practical in any large scalescenario as it turns single triples into sets of 4 triples as opposed to the RDF quadsimplementations that enlarge the size of results by only one extra URI per triple byadding a graph URI to the current model, and using that URI to denote the locationof the data provide for the statement. There is however limited support for RDF quadsin client applications, as the concept and the available formats (NQuads, TRiG, TRiX)have not yet been standardised. The prototype supports data available in Quads, withthe graph URIs as the URIs of providers that were used to derive each of the triples inthe graph.


6.2.4 Provenance

The model is designed to consistently distribute semantically similar queries acrossmultiple datasets in a transparent manner, so the initial form, location and quality ofthe data does not need to be encoded into data processing scripts and tools. In doingso, it promotes the reuse and replication of the results in the future by being able topublish the query as part of a process provenance record that contains informationabout the context of the query and what information is needed to replicate the query.The model can be used to track the provenance of the data, but that would requireextra annotations on providers. It was not feasible given that most datasets in scienceonly contain simple data provenance annotations such as the dates when records werelast updated, and these dates are not useful for scientists in terms of allowing them toexactly replicate their experiments.

In scientific workflow models that have previously explored the area of provenance,the location of the data, and the filters and normalisation rules applied to the data arenot different parts of the model. An advantage of the way the provenance informationis derived from the model is the ability to locate the original data without requiring theuse of data normalisation rules. This enables scientists to work with a data model thatother scientists may be more familiar with in some cases, compared to their preferrednormalised version. This is useful for cases where scientists require the original datato be unchanged, or the data is accessed in a different way to match their currentcollaboration. In the model, data normalisation is not a deciding factor in whether touse a provider, unlike systems that use single ontologies to distribute data, where thedeciding factor is primarily whether the scientist’s query matches the expected datanormalisation rules.

In a similar way, the normalised data can be useful if it contains references, such asHTTP URIs, that can be resolved by other scientists to get access to information thatis used as evidence in publications. The normalised information may be cumbersome toreproduce without the standard model of specifying what queries were used on whichdatasources and which changes were made to the information. Scientists can publishlinks to the underlying query plan, and peers can access this information and processit in a similar way to a workflow, but with the benefit of knowing where they cansubstitute their own datasources and which normalisation elements were used, in orderto review the work as part of the peer review process.

The design choice of separating the providers of information from queries that sci-entists may want to pose on arbitrary endpoints improves the way that other scientistscan understand the methodology that was used to derive the answer. This is an im-portant step in comparison to current workflow tools which rely on scientists to guesswhich parts of a workflow are related to data access, and which parts are related to datacleaning and format transformation. It allows scientists to choose which data providersare used for queries, and allows them to change the mix of providers without interferingwith the methods used to integrate the information with processing tools.

In terms of provenance, the model allows scientists to completely recreate their


queries, assuming that the data is still accessible at some location. It does not specifythe meaning of each query, as this is not relevant to the data access layer, and it doesnot specify the provenance of individual data items inside of data providers, as that isthe responsibility of the dataset. In particular, it requires scientists to understand theway that the query parameters are linked to the overall question before the provenancecan be interpreted, as this requires community negotiation to define how the parametersmatch the scientific question.

Other provenance systems focus on understanding the way that the linked datasetsare related and what levels of trust can be placed in each data item, as opposed tofocusing on what is required to reproduce the results of each query [149]. The prove-nance of the results does depend on this information, but it needs scientists to interpretthe information that is available from the model, including any annotations on dataproviders, and any provenance information that is derived as part of the query.

The inclusion of data item provenance in the model was restricted by the requirementthat the model be simple to create and understand. If the model included annotationsabout data items in the overall configuration, it would not be maintainable or dis-tributable. It is the responsibility of the scientist to store the provenance informationdescribing their queries, using the information that is generated by the model. This isconsistent with the design of the model as a way for scientists to define the way theirqueries are distributed across their trusted datasets.

The final complete configuration of 27,487 RDF statements for the Bio2RDF web-site was serialised in RDF/XML format for a file size of 3,468 kilobytes. It would berelatively efficient to store provenance records, if redundant triples from the configura-tion that are duplicated between provenance records are not all stored multiple times.The provenance record also stays small because each of the elements, such as querytypes and providers, are linked using URIs, and it is easy to determine whether the in-formation about the query type has already been included in the provenance record toavoid adding it multiple times. In addition, the RDF model does not recognise multiplecopies of a statement as being useful, so it can optimise the number of statements thatare included in the results automatically.

In comparison with the workflow provenance models that attempt to exactly repli-cate the data processing actions that were included, using the same data access inter-face in future, the model provides the ability for scientists to merge and change thedata processing actions using substitution based on the user query parameter matchingfunctions. The parameter matching functions allow for multiple query types to be usedtransparently, in comparison to workflows and SPARQL based projects such as SADIwhich rely on being able to provide replicability by relying on a single input data struc-ture producing a single output data structure using a single data location for each partof the processing cycle. The model makes it possible to integrate heterogeneous servicesat the data access level without having to change the processing steps, as is necessaryin current models.

A provenance model for scientific data processing needs to allow scientists in the


future to replicate and contrast results using independent data. Although RDF providesthe basis for this replication, current data query models that are based on SPARQL donot support independent replication without human intervention. The extension of thesedata access models to remove direct references to data locations and properties used inthe inputs and results makes it possible for independent replication using configurationchanges. The emphasis of provenance literature on replication in the future denies theevidence that data providers will always eventually cease to exist or not always beaccessible in the future.

There are tools such as WSsDAT [89] which are designed based on evidence thatcommunications between data providers are variable in the current time, and may notlast beyond the lifetime of the scientific grant related to the study. Although theo-retically SPARQL based query models could generate the same results from differentlocations, in practice they rely on either hard coding this information into the dataproviders service definition (see SPARQL 1.1 Service Definition draft specification1),or hard coding this information into the middleware or clients data processing code asin SADI. Neither of these solutions are as flexible as a model that relies on separatingthe translation and data normalisation steps from the definitions of queries. The modelproposed in this thesis allows scientists to replicate their data processing in future with-out having to rely on particular data providers or face the daunting task of changingtheir processing code to support alternative data providers to support replication usingindependent data sources when they could declaratively exchange a part provider for acurrent one with a change to the profile in use.

6.3 Prototype

The prototype was implemented as a proof of concept for the model that can be usedto validate the advantages and disadvantages of the model, as described in Section 6.2.The evaluation of the prototype is based on its usefulness as both a privately deployedsoftware package, and its deployment as the engine behind the Bio2RDF website. Somefeatures of the model, such as context sensitivity, and data trust, apply mostly tothe private deployments of the prototype due to the emphasis on scientists formingopinions about datasets, while the others apply equally to both the public and privateas they apply to the broader scenario of scientists easily getting access to data whilestill understanding the provenance of the process involved, and any data quality changesthat were applied in the process.

6.3.1 Context sensitivity

The use of RDF as an internal model makes it possible for scientists to translate resultsfrom any query into other formats as necessary. The output format is able to identifylinks explicitly, so any reference fields in other formats can be mapped from the RDF

1http://www.w3.org/TR/sparql11-service-description/

http://www.w3.org/TR/sparql11-service-description/


document, and other fields can be mapped as necessary from the structures in thedocument.

The prototype provides access to the paging facility which is included in the modeland a subset of the queries, so that scientists can choose to only fetch particular partsof the results set at each time, although they can get access to all of the informationby resolving the document for each page offset until they recognise that there is no newinformation being returned. A scientist can use these functions to direct the modelto return a maximum number of results from each provider. There is no recognitionof relevance across the results returned from different providers, although it could begiven using the deletion rules, to select the best results as necessary. This enables theprototype to support both gradual and complete resolution of queries, through the useof a number of data providers to represent the different stages of the query in caseswhere the data provider allows paging to occur.

The prototype allows scientists to host the application in arbitrary locations, aslong as scientists put in normalisation rules where necessary to change the structure ofthe URIs that are given in the results of each query. In the case of the HTML resultspages, the location is automatically changed to match the way the scientist accessed theprototype, but this is not the case currently with the RDF representations, where it isviewed as more important to have standardised URIs in order for the scientist to easilyidentify the concept they were querying for without reference to the location of theprototype. It is particularly relevant that scientists can replicate a public Linked Datawebsite using the prototype to avoid overburdening the public website with automatedqueries that can be performed by local software directly accessing local data.

The prototype allows scientists to extend public resources with any of their own re-sources, and keep their resources private as necessary. This enables scientists to extendwhat are traditionally very static Linked Data sources, without revealing their infor-mation, or hosting the entire dataset locally. This context sensitivity is an importantfeature on its own, but it also makes it possible to solve the data quality and datatrust issues in local situations, and control the data provenance, rather than just obtainthe information about what an independent data provider thinks about a dataset. Inthis way it is a generic version of the DAS, where the data must be genome or proteinrecords, and other annotations are not accessible.

The prototype software allows scientists to individualise the configuration and be-haviour of the software according to their contextual needs. It includes the ability toresolve public Linked Data URIs using the prototype, while supporting a simple methodfor extending the public representation of the record with RDF information directingscientists to other queries that are related to the item. However, this behaviour is notcommon, and scientists may not fully understand the difference between their localquery and the public version, including how to give a reference to the local version ofthe data in their research, especially if the RDF retrieved by the general public usingthe published URI never includes their information.

The prototype requires that scientists utilise a local prototype installation to resolve


documents according to their trust settings. This either requires scientists to changethe URIs in the results to match the DNS name for their local resolver, or it requiresthem to publish the documents using the local prototype URIs. This is not a designdeficiency of the prototype, as it is necessary to fully support local installations of theprototype software. It is, however, an issue that needs to be resolved using a socialstrategy to support the data quality and data trust features, as local results can beintentionally different to the same query as executed on a public service.

Although it is simpler for scientists to use commonly recognised Linked Data URIsfor a particular item, it requires other scientists to know how to change the URIs in anyof their documents to work with the scientist’s prototype resolver. If they do not know tochange the URIs, then resolving the Linked Data URI will only get them the statementsthat were available at the authority. The prototype is designed to allow scientists toinstall their own prototype and utilise the configurations from other scientists to derivetheir own context sensitive options for resolving the URI, including their own trustsettings to determine which sources of information they trust in the context of eachquery.

Some recommendations for identifying data items, and links to items in otherdatasets focus on standardising a single URI for each item in a dataset, and utilis-ing that URI wherever the data item is referred to. A standard URI is useful forintegrating datasets, but it is not useful if there are contextual differences between theway the item is represented, including novel statements, then it is not appropriate touse the standard URI, as its trusted global meaning will be limited to statements thatare published by the original authority. In order for the prototype to deliver the uniqueset of statements that a scientist believes to be trusted, it is more appropriate that theURI is changed, although the model and the prototype support both alternatives.

The Linked Data movement does not focus on supporting complex queries usingHTTP URIs, although the principles of Linked Data fit for queries in a similar way toraw data records. In terms of evaluating the contextual abilities of this prototype, thereare no current recommendations about how to interlink data items with queries thatmay be performed using the data item. For example, in the Bio2RDF configuration,there are many different queries that may be performed depending on the context thatthe scientist desires.

It is not practical to enumerate all of the query options in an RDF document thatis resolved using the data item URI, as it would expand the size of the document, anddetract from the information content that is directly relevant to the item. An exampleof this is the resolution of the URI “http://bio2rdf.org/geneid:4129”. This URIcan be modified to suit a number of different operations, such as locating the HTMLpage that authoritatively defines this item, “http://bio2rdf.org/html/geneid:4129”,locating the XML document that authoritatively defines this item, “http://bio2rdf.org/html/geneid:4129”, and searching for references to the item in another namespace,such as the HGNC namespace, “http://bio2rdf.org/linksns/hgnc/geneid:4129”.


http://bio2rdf.org/html/geneid:4129



http://bio2rdf.org/linksns/hgnc/geneid:4129


If URIs that could be resolved to all of the possible queries for links in other names-paces were included, there would currently be at least 1622 extra statements correspond-ing to a link search for each of the namespaces in the Bio2RDF configuration. TheseURIs can be usefully created by scientists without impairing basic data item resolution.

If a user requires these query options, extra statements could be included to identifythe meaning of each URI. It is not necessarily evident to a computer that the URI“http://bio2rdf.org/linksns/hgnc/geneid:4129 will perform a query on the HGNCnamespace. These extra statements would need to link the URI to the namespace, whichin Bio2RDF is identified by the URI “http://bio2rdf.org/ns:hgnc”. The computercould decide whether to resolve the namespace URI based on the predicates that wereused in the extra RDF statements. It is not practical for Bio2RDF to include each ofthe possible query URIs in the document that is resolved using “http://bio2rdf.org/geneid:4129”, although some are included, such as the location of the HTML page.In particular, there is a unique “linksns” URIs that can be created dynamically foreach namespace, resulting in a large number of RDF statements that would need to becreated and transported but may never be used.

6.3.2 Data quality

The use of the prototype on the Bio2RDF website provided insights into the data qualityissues that exist across the current linked scientific datasets. The issues have a range ofcauses, including a lack of agreement on a standard Linked Data URI to use for eachdata item, a range of URI syntax differences in and across datasets and the inability toretrieve complete results sets depending on the server, which cuts out possibly importantresults, and makes it hard to consistently apply semantic normalisation rules to queryresults.

There is a large variation in the way references are denoted across the range ofscientific data formats, including differing levels of support for links. These issues arenot necessarily fixed by the use of RDF as a format for the presentation of these datasets,as the link representation method in the original data format cannot always be fixedby generic rules. The prototype was useful for fixing a number of the different linkrepresentations, although the lack of support for string manipulation in SPARQL madeit difficult to support complex rules that would enable the prototype to dynamicallycreate URIs out of properties where the currently available RDF representation did notspecify a URI 2. The prototype was used to change references and data items that didnot conform to the normalised data structures that were decided on by the Bio2RDFproject. These references included URIs that were created by other organisations, thatcould be simply replaced with Bio2RDF URIs to provide scientists with a way to resolvefurther information within the Bio2RDF infrastructure. The model also provided theability to define query templates, so that the normalised URI would be used in output,and the endpoint specific reference, whether it be a URI or another form, could be usedin the query. It was successful in integrating the datasets provided by projects varying

2http://www.w3.org/2009/sparql/wiki/Feature:IriBuiltIn

http://bio2rdf.org/linksns/hgnc/geneid:4129

http://bio2rdf.org/ns:hgnc



http://www.w3.org/2009/sparql/wiki/Feature:IriBuiltIn


from the Bio2RDF endpoints, to the LODD, FlyWeb, and the datasets provided by thePharmaceutical Biosciences group at the Uppsala University in Sweden.

The prototype enables scientists to understand the way the information is repre-sented in each dataset relevant to their query, as the query can be completely plannedwithout executing it. This makes it possible for scientists to understand each step oftheir overall query, without relying on a single method to incrementally perform a largequery on any of the available datasets, depending on the nature of the query. A majordata quality factor related to the execution of distributed queries relates to the wayresults can be obtained from each data provider. In some cases, the entire results setcan be returned without limitation, but many data providers set limits on the numberof results that can be returned from each query. This restriction has a direct impacton data quality, as the scientist must know what the limits are, and repeatedly per-form as many incremental queries as necessary to make the query function properly.The prototype provides a way to incrementally page through results from queries un-til the scientist is satisfied. Some query types will return generic information that isnot page dependent. Query types can be created to include a parameter indicatingwhether paging is relevant. For other query types that are page dependent, scientistscan incrementally get results until the number of results is constant, and less than theserver is internally configured to support. This method is compared to other models inFigure 6.2, with other systems providing a variety of methods ranging from returninga fixed sample of the top ranked results to always returning all results.

Paging query results

Information linked to both concept X and concept Y

User query

Information linked to either concept X or concept Y

Information about concept X

If there are not enough results:

Resolved using automatic ranked results method


Return top results

No pagingBias towards results with both conceptsThere may be no scientifically useful ranking method


Return all results


Return all results

High bandwidth usage if there are a large number of resultsMay still be hidden issues with fixed size result sets

Figure 6.2: Different paging strategies

Some public data providers are known to put both incremental restrictions, and


overall restrictions on the number of results that can be related to a given query. Thismay be necessary, as the data provider may be voluntarily provided for minor use. Theremay be overall restrictions with the size of document returned by distributed queries.If the data provider is an HTTP server, there is the option for the server to return astatus code indicating that the server’s bandwidth is restricted for some reason, and thedocument cannot be returned because of this. This restriction is in place for the publicDBpedia provider, http://dbpedia.org, for example, although larger documents maybe available if scientists enable GZIP compression on the results, something that is notrequired by the HTTP or Linked Data standards.

Each query needs to be executed in a way that would minimise the chances ofthe size restriction being triggered, although it is not possible using current SPARQLstandards to know how large a document may be. This may include constructingLinked Data URI resolutions as SPARQL queries with limits on the number of resultsthat would be returned. The prototype avoids these issues by allowing scientists topose multiple queries on the data provider, allowing them to iteratively get as manyresults as necessary. Scientists may find it necessary to create their own mirror of thedata so that they can execute queries without the limitations that the public providerimposed. The prototype allows this, as the model is designed to make the data accessmethods transparent to the scientist request, making it simple for scientists to overrideparts of the prototype configuration with their own context. For example, they couldoverride the “wikipedia” namespace that DBpedia provides so that queries about it wereredirected to their mirror instead of the public endpoint.

The prototype provides the ability for scientists to customise the quality of thedata resulting from their queries. It is limited by several factors, including the lackof transformation functions in SPARQL, the lack of RDF and SPARQL to support ornormalise scientific units, and semantic ambiguity about results from different datasets.The use of RDF as the model that was used to represent the knowledge provides anumber of semantic advantages, as entities from different datasets can be merged, whereother similar models do not allow or encourage universal references to items outside ofa single dataset.

The prototype uses an in memory SPARQL implementation for high level datatransformations. However, the current SPARQL 1.0 recommendation does not includesupport for many transformation operations. Notably, there is no way to create aURI out of one or more RDF Literal strings. This makes it currently hard to use stringidentifiers to normalise references in all documents to URIs if the original data producerdid not originally use a URI, as string based regular expressions are not suitable forthe process. This has an impact on the data quality normalisation rules that can beperformed by the prototype, as URIs can only be created by inserting the value intoa URI inside the original query, or the statically inserted RDF. This only works forqueries where the URI was known prior to the query being executed.

In terms of scientific data quality, it is important to know the units that are attachedto particular numeric values. For example, it is important to be able to work with

http://dbpedia.org


numbers, such as a experimental result of “3.35”, while knowing that the number isrepresented in units, such as “mL”. The RDF format has not been extended to includeannotations for these units in a similar way to the native language annotation featurethat is currently included in the RDF specification 3. This issue forms a large part of thenormalisation process, as scientists expect data normalisation to result in a single unitbeing represented for each value. As units are not supported directly by the underlyingdata format, the prototype would need to have some other method of recognising theunit attached to a numeric value to use a rule to convert between units unless scientistsare aware of the possible existence of unsuitable units.

The prototype implements a simple normalisation rule mechanism. It support sim-ple regular expression transformations, but does not utilise transformations such asRIF, OWL reasoning. These transformations may be required in some cases, but theprototype demonstrates the practical value of the model in relation to linked scien-tific datasets using simple regular expressions for the majority of normalisations, withthe ability to define SPARQL queries to perform more complex normalisation tasks.Some non-RDF transformation approaches, such as XQuery and XSLT, require a directmapping between the dataset and the result formats. In these cases there is no oppor-tunity to utilise a single intermediate model, such as RDF, that can have statementsunambiguously merged together in an unordered manner.

Many RDF datasets have avoided issues relating to the creation of URIs by exten-sively using Blank Nodes, which are unaddressable by design. This affects data quality,as scientists cannot reference, or normalise a reference, to a particular node, and there isno way, at least in the current RDF 1.0 specification, to normalise Blank Nodes consis-tently to URIs, by design. The use of Blank Nodes makes it difficult to optimise results,as an entire RDF graph may need to be transferred between locations so that there areno elements that cannot be dereferenced in future if blank nodes are used extensively.For example, an optimised query using URIs as identifiers could result in one tripleout of a thousand triples in a dataset. The same query on a dataset that uses BlankNodes may need to transfer all one thousand triples so that the scientist could properlyinterpret the triple in terms of other information. If the graph uses URIs, the scientistcan selectively determine which other triples are relevant to the results using the URIsas independent identifiers. The prototype provides the ability to optimise queries andreturn normalised URIs that can be matched against any suitable dataset to determinewhere other relevant statements are located.

The use of RDF as the results format may highlight existing semantic data qualityissues. The data quality issues may occur when information is improperly linked toother information, and the use of computer understandable RDF enables a comput-erised system to discover that the link is inconsistent with other knowledge. In science,however, there is a constant evolution of knowledge, so the use of RDF to automaticallydiscover inconsistent information relies on an up to date encyclopaedia of verified knowl-edge. The data quality of the encyclopaedia must be assumed to be perfect before it

3http://www.w3.org/DesignIssues/InterpretationProperties

http://www.w3.org/DesignIssues/InterpretationProperties


can be used with current, widely implemented, reasoning techniques such as OWL [36],although there have been explorations into how to make imperfect sources of knowledgeuseful for the reasoning process [42]. Data quality issues at the semantic level, suchas these, are highlighted by the use of the model with syntax normalisation rules, asthe sources of data can be integrated with ontologies that represent a high standard ofcurated knowledge, although there may not be enough information available to fix theissue without a higher level processing tool.

There may also be inherent record level issues with data when similar, but distinct,queries are executed on different datasources, with the multiple results being mergedinto the returned document. If there is likely to be ambiguity about the meaning of theinformation from different datasources, then ideally the queries shouldn’t be combined,as scientists would then need to recognise that multiple datasources were being used,and accommodate for this in the way they process the information, something that themodel attempts to avoid. However, in practice, the different sets of results may both bevaluable, and scientists would then need to notice the different structures used by thedifferent datasources, or customise their profiles to choose one query over the other, orallow for the different structures in their processing applications.

The example shown in Figure 5.1 illustrates a data quality issue, as there are multipleRDF predicates which may be used to describe a label or title for the item, and scientistswould need to recognise this. In the example the datasources used a unique predicatefor name of an item in the Gene Ontology, scientists may also find the distinction useful,but other scientists may create normalisation rules to convert the data to a standardpredicate, and include those rules in their provenance records to highlight the differenceto other scientists. In comparison to other projects, this allows scientists to specifytheir intent, and share this intent with other scientists, without having to embed thenormalisation and access aspects into the other parts of their processing workflows.

In many cases, the domain experts may not be experts in the methods of formallyrepresenting their knowledge in ontologies. Data created by the domain expert maybe consistent, but it may not include many conditions that could be used to verifyits consistency. In order for other scientists to trust the data quality, they can createnew rules and apply them to the information. This would require either low levelprogramming code in other systems, or a change to the globally recognised ontologyframework, neither of which can be used in provenance records to distinguish betweenthe original statements, and the changes that were applied by others.

In order for the model to be used to verify the semantic data quality of a piece of data,the context of the item in conjunction with other items may be required. This processmay require that a large range of data items are included in the output, so that rulescan be processed. In practice, this level of rule based reasoning would require scientiststo download entire datasets locally, and preprocess the data to verify its consistency,before using it as an alternative to any of the publicly accessible versions of the dataset.

The use of RDF triples makes it difficult to delegate statements to particular dataprovenance histories, and therefore the data quality is harder to identify. However, it is


possible if the scientist is willing to use different URIs for their annotations comparedto the URI for the relevant data item. Different reasoning tools have different levelsof reliance on the URI that is used for each statement about an item. In OWL forinstance, there is no link between a single URI and a distinct “Thing”, as statementswith different URIs can be merged transparently to form a set of facts about a single“Thing”. If users do not rely on OWL reasoning, it is possible to distinguish betweenstatements made by different scientists by examining the URI that is used. This isprincipally because RDF allocates a particular URI to the Subject position of an RDFtriple, and users can create queries that rely on the URI being in the Subject positionif they are not using reasoning tools. Reasoning tools such as OWL may discover aninverse relationship based on the URI of the Predicate in the RDF triple, and interpretthe RDF triple differently, with the item URI being in the Object position in the RDFtriple. This makes it difficult to conclusively use the order of the triple to define whichauthority was responsible for the statement.

In some cases, the prototype may need to be configured to perform multiple querieson a single endpoint to deal with inconsistencies between the structure of the data inthe endpoint and the normalised form. This is inefficient, but it is necessary in somecases to properly access data where it is not known which form the data will be in foran endpoint.

The prototype partially supports RDF Quads as there are already datasets, includingBio2RDF datasets, which utilise RDF Quads to provide provenance information aboutrecords. However, the lack of standardisation for any of the various RDF Quads fileserialisations, and the resulting lack of standardisation between the various toolsets,made it necessary to restrict the prototype to RDF triples for the current time. Anymoves to require RDF Quads should require a backwards compatible serialisation toRDF Triples, which is not yet available outside of the very verbose RDF Triple reificationmodel.

A summary of the data quality support of a range of scientific data access systemsis shown in Table 6.2. It describes a range of ways that different systems use to modifydata, including rule based, workflow and custom code data normalisation. The proto-type uses rules that are distributed across providers, as shown in Figure 3.1 with thecontext sensitive design model. The federated query model in the figure is represen-tative of the general method used by the other rule based systems, as they rely on asingle, complete decomposition of a complex query to determine which endpoints arerelevant.

The prototype is fundamentally different to other systems in the way it separatesqueries from the locations that the queries are to be executed, excluding users frombeing able to specify locally useful rules as global rules for the query. When this isused along with the contextual definition of namespaces it makes the overall queryreplicable in a number of locations, where the queries in other systems are localised tothe structure and quality of the data as provided in the currently accessible location.

The configuration uses URI links between query types, providers and normalisation


System Advantages DisadvantagesPrototype Rules, syntactic and semantic –SADI Rule based semantic Custom code for syntacticBioGUID Clean linked identifiers Custom code for each serviceTaverna Custom workflows Hard to reuse and combineSemantic Web Pipes Custom workflows Not transportable

Table 6.2: Comparison of data quality abilities in different systems

rules that make it possible for data quality rules to be discoverable. Communities ofscientists can take advantage of this behaviour to migrate or replace definitions of rulesin response to changes in either data sources or current standards for data representa-tion. This benefit is available without having to change references to either the datanormalisation rules in any queries due to the separation of rules from query types, or therules in any providers where the new queries do not require new normalisation rules.This ability does not necessarily affect the provenance of queries, as the provenancerecording system has full access to the exact rules that were used to normalise datawhen the query is executed to provide for exact replication on data sources that did notmaterially change.

6.3.3 Data trust

The prototype demonstrated the practical considerations that were necessary to supportdata trust, including the open source distribution of the program to enable scientists tocreate and modify configurations without having to negotiate with the community toget a central repository modified to suit their research. The configurations that wererelevant to different scientists were published in different web locations, and scientistswere able to configure the prototype to use their locally created configuration. Theseconfigurations, were then used to setup the state of a server, so that the server knewabout all of the providers, query types, and normalisation rules in the provided configu-rations. When a scientist performed a query on the prototype, the pool of configurationinformation was filtered using the profiles that the server was pre-configured to obey.The pool of configuration information was used to allow three servers in the Bio2RDFgroup to each choose the datasets and queries that they wanted to support by assigningeach server a different profile.

In comparison to methods that use endpoint provided metadata to describe servicefunctionality, the prototype can use on a combination of local files, and publicly acces-sible web documents for configuration information. The VoiD specification presents away of specifying the structure of data available in different SPARQL endpoints, butit does not include trust information [3]. VoiD documents are produced using RDF,so it can be stored locally, and interpreted directly, unless it relies on the URL thatit was derived from, to provide context, something that is popular in some RDF pub-lishing environments. A scientist could trust VoiD information in a similar way to theconfiguration information for the prototype if the RDF files were available and verified


locally. However, in contrast to VoiD, the prototype configurations can be used to trustspecific queries on specific data providers, as shown by the inclusion of a layer betweenproviders of data and the query in the context-sensitive model in Figure 3.1.

If a trust mechanism was included explicitly in VoiD it would not include the scien-tist’s query as part of the context that was used to describe the trust. Trust, in VoiD,would need to be specified using the structure of the data as the basic reference forsyntactic data trust, although semantic data trust could be recognised using a referenceback to the original data provider. In comparison to the model described here, VoiDrelies on a description of a direct link between datasets and the endpoints that can beused to access the datasets. In VoiD there is no direct concept of endpoints, so scientistscannot easily substitute their own endpoints, given a description of the dataset. In thesame way, they cannot also directly state their trust in a particular endpoint, whether itis a trust based on service availability, data quality, or the other trust factors describedin Gil and Artz [52] and Rao and Osei-Bryson [113].

The mechanism that is used to describe services in the recent SADI model does notallow transparent reviews of the changes that were made to data items, as many ofthe rules related to data normalisation are only visible in code, as the extensive use ofWeb Services makes it simpler to encapsulate this layer in auto-generated programmingcode, rather than expose it as queries that can be studied by scientists. This meansthat a scientist needs to trust a datasource without knowing which method is beingused to resolve their queries, something that is possible, but not as advantageous as thealternatives where scientists see all of the relevant information and can decide about itindependent of the internal state of a server that is out of their control. In particular,SADI requires that scientists accept all data sources as semantically equal, and codeis typically linked to a single data location making it necessary to use a single syntaxto describe each datasource in each location. The scientist must trust the results ofthe overall SPARQL query that is submitted to the SADI server, as the results arecombined and filtered before the results of the query are returned to scientists. Inparticular, any semantic rules that are applied to the data must exactly match or thedata will be discarded or an error returned to the user in lieu of giving them access towhatever data is available. This makes it particularly difficult for SADI to provide foroptimised queries where there is not enough data in the results of the query to satisfythe semantic rules for the classes of data that were returned. An entire RDF recordneeds to be resolved to reliably determine whether the item described by the record fitsinto a particular semantic class.

The provenance information for a SADI query, if available, would require the systemto evaluate the entire query again to accurately indicate which services were used, asthe model behind SADI requires incremental execution of the query to come to a finalsolution. This may be similar to the requirements for a query executed using VoiD,as it contains information about both SPARQL predicates, which may not be specifiedin a query, along with information about how to match the starting portion of a URIwith a given dataset. In contrast to this, the prototype is designed to only execute the


query in increments, so the information necessary to perform the query in future canbe obtained without performing the query beforehand. This ability is vital if scientistsare to trust the way a particular query will execute before performing the query.

The prototype implements the data trust feature using user specified preferences todefine which profiles are trusted, and where the configuration information is located.This information is used to provide the context for all queries on the instance of theprototype. A scientist does not need to decide about this information for every query onthe prototype, as they should have a basic level of trust in the way the query is going tobe performed before executing the query. The scientist is able to make a final decisionabout the trust that they are going to put into the results when the query is complete,and the provenance information is available. However, in the prototype the provenanceinformation is derived using a different HTTP request. This means that the HTTPrequested query provenance may not exactly match the data providers and/or querytypes that were used to derive the results of the previous query due to random choicesof providers and network instability causing providers to be temporarily ignored. Theseissues are not present for internal Java based requests where the provenance and theresults can be returned from the same request.

The prototype does not insert information about which provider and query typecombinations that were used to derive particular pieces of data into the results. Thismeans that scientists do not have a simple way of evaluating which of the sources ofdata were responsible for obtaining an incorrect result, if the identifiers for the dataitems do not contain this information. This is mostly due to the restrictions placedon the prototype by the RDF triple model with relation to annotations at the triplelevel. The RDF model provides a reification function, but it would expand the sizeof the results for a query by at least three hundred percent. Although an RDF quadformat could be used to make it possible to add information about provider and querytype combinations above the level of triples, with some quad syntaxes being relativelyefficient, the RDF quad formats are not widely implemented, and are not standardised,and changing would require all scientists to understand the RDF quad syntaxes, asthere is no backwards compatibility with the triple models.

The method of configuring the prototype using multiple documents as sources ofconfiguration information relies on the assumption that URIs for different elements,such as providers and profiles, being unique to a given document. Although this isnot a realistic assumption to make in a broadly distributed scenario, the configurationsources were all trusted, and the configuration sources were setup so that within aparticular document it was easy to verify if the URI for an element was accidentallyduplicated. Another method needs to be used to authenticate the document to fullytrust the configuration information given by a web document.

The configuration documents could be authenticated by having scientists manuallycheck through the contents of the file, and then have them sign the particular file usinga digital signature. This is not viable in dynamic scenarios, as there are legitimatereasons for making both minor and major changes to the configuration file without


the scientist having to lose trust in the configuration information. The major risks arethat an item will be defined in more than one document, causing unintentional changesto the provider, query type, or other object. A future implementation could allowusers to specify a list of URIs representing model objects that could be trusted fromparticular configuration documents. This strategy would ensure that scientists have theopportunity to verify that their configuration information does not clash with any otherconfiguration information.

System Advantages DisadvantagesPrototype Configured data trust Multi-datasource resultsSADI – Assume all data is factualBioGUID Simple record translation –Taverna Data tracing through workflow Typically single locationsSemantic Web Pipes Data tracing for single workflows

Table 6.3: Comparison of data trust capabilities of various systems

6.3.4 Provenance

The prototype can provide a detailed set of configuration information about the methodsthat would be used to resolve any query. This provenance record contains the necessaryinformation for another implementation to completely replicate the query. However,the provenance information is not necessarily static. The model provides a unique wayof ensuring that scientists can make changes to the provenance record without deletingthe original data providers. It does this by loosely coupling query types with inputparameters, and by linking from providers to the query types they support, so querytypes do not have to be updated to reflect the existence of new data providers. Theprototype provides network resiliency across both data providers, so that a single brokendata provider will not result in partial loss of the access to the data, even if differentdata providers use different query methods.

The provenance record for each query includes a plan for the query types andproviders that would be used, even if the query was not executed. An implementa-tion could generate the query plan link and inserted it into the document althoughmany users may not require the query plan to use the document, so it wasn’t insertedin the Bio2RDF documents by default. A complete solution would require a specialapplication that knew how to interpret the URI into its namespace component alongwith any identifiers that were present in the item. It would use the namespace to get alist of query types and the relevant providers for the query types, using the namespaceas the deciding factor. In particular, the provenance for the URI is retrieved using theimplementation by modifying the URI and resolving the modified URI.

The provenance record includes information about the profiles, normalisation rules,query types, and providers, in the context of the query. It contains a subset of theoverall configuration, which has been processed using the chosen profiles to create therelevant queries. The prototype relies on template variables from queries, which are


replaced with values in the provenance record. These changes mean that the prove-nance record is customised for the query that it was created for. In cases where theURI format does not change and the namespace stays the same, the query plan wouldcontain most of the relevant providers and query types for a similar query. For exam-ple, both http://bio2rdf.org/queryplan/geneid:4129 and http://bio2rdf.org/

queryplan/geneid:4128 contain the same providers and query types, as there is noquery type that distinguishes between identifiers in the “geneid” namespace.

An alternative to modifying provenance records to derive new possible queries wouldbe to generate further provenance records using the prototype after it is configured usingthe provenance record as a configuration source. The scientist could then trust that theresulting queries were semantically accurate in terms of the query model.

The prototype did not provide a way to discover what queries would be usefulfor a particular normalised URI without resolving the URI. The document that wasresolved for the normalised URI could have static insertions, that were used in theprototype to indicate which other queries may be applicable. For example, the documentresolved at http://bio2rdf.org/geneid:4129 contains a reference to the URL http:

//bio2rdf.org/asn1/geneid:4129, and the predicate indicates that the URL could beused to find an ASN.1 formatted document containing the given record.

The provenance implementation in the prototype does not include data provenance.There are many different ways of denoting data provenance, and they could be usedtogether with the prototype as long as the sections of the record can be segregated.This is an issue with RDF, as the different segments of the record cannot easily besegregated without extending the RDF model from triples to quads, where the extracontext URI for each triple forms a group of statements that can then be given dataprovenance. The model emphasises data quality, which includes normalisation, and datatrust, which includes proper annotation of statements with trust estimations. Theseconflicting factors make it difficult for both the prototype and the model to supportprovenance at a very low granular level. In the prototype, the choice of which strategyto use is left to the scientist, although the prototype only currently supports RDFtriples.

The strategy taken for the Bio2RDF website is to emphasise data quality over thelocation of the datasets and the provenance of individual RDF triples. The low leveldata strategy, where the data provenance of each statement is given, makes it possibleto analyse the data in different ways, including the ability to make up integrated prove-nance and data quality queries. However, a normalised strategy which does not includeall of the data provenance as part of the results of queries allows processing applica-tions to focus on the scientific data without having to allow for the extra processingcomplexity that is required if the RDF model includes the context reference.

The annotation of specific RDF triples is difficult, and some linked datasets requirethat scientists extend the original RDF triple model to include a quad model, includingan extra Named Graph to each triple. This is possible using the model, but it is notencouraged in general as it reduces the number of places that the RDF documents will





http://bio2rdf.org/asn1/geneid:4129

http://bio2rdf.org/asn1/geneid:4129


be recognised, as there is no recognised standard that defines RDF quad documents.The data item provenance is useful as a justification of results in publications. How-

ever, if the data providers update the versions of the data items without keeping olderdata items available, the model is unable to recreate the queries, particularly if the dataitems are updated without changing the identifier. In part, this formed the basis for theLSID project [41], and implementations which used LSID’s such as the myGrid project[148]. The social contract surrounding LSID’s required that the data resolved for eachidentify be unchanged in future to exactly reproduce past queries. LSID’s failed togain a large following outside of biological science communities. However, many initialadopters are instead moving to HTTP URI Linked Data based systems. The use ofmore liberally defined HTTP URIs does not mean that the emphasis on keeping dataaccessible needs to be lost.

The ability to reproduce large sets of queries exactly on large datasets is a basicrequirement in the internet age. Textual details of a methodology were previously theonly scientific requirement. The model provides the ability to reproduce the same query,with the publication containing the information required to submit the same queries tothe same datasets. It is flexible enough to work with both constantly updated and staticdatasets. The data provider object in a query provenance document may be changedfrom the updated endpoint to the location of a static version that was not updated sincethe query was executed to reproduce the exact results.

The cost of continuously providing public query access to old copies of datasets maynot be economical based on the benefits of being able to exactly reproduce queries. Thismay be leading many data producers to avoid long term solutions such as Handle’sand DOI’s which rely on the data producer paying for the costs of keeping at leastthe metadata about a record available in perpetuity. The model aims to promote acontext sensitive approach which does not rely on an established authority to providecontinuous access to the provenance or metadata about records. It is possible to publishprovenance documents using schemes such as DOI’s if data providers are willing to paythe costs associated with the continual storage of metadata about items by the globalDOI authority.

6.4 Distributed network issues

The prototype was designed to be resilient to data access issues that could affect thereplicability and context sensitivity of queries resolved using the model. Issues suchas computer networks being unavailable, or sites that were overloaded and failed torespond to a query would have produced a noticeable effect on the results. The effectwas reduced by monitoring and temporarily avoiding providers that were unresponsive.This monitoring was made at the level of provider and at a lower level by collating theDNS names that were responsible for a large number of errors.

The prototype relies on simple random choices across the endpoint URL’s that aregiven for each provider to balance the query load across all of the functioning endpointsif they are not included in query groups or provider groups. In comparison to Federated


SPARQL, where each of the services is defined as a single endpoint in the query, querygroups and multiple providers for the same query types allow the system to be loadbalanced in a replicable manner. In addition, it was important for the Bio2RDF websitethat results were consistent across a large number of queries.

In a large scale multi-location query system, such as the set of scientific databasesdescribed in Table 2.1, a broken or overwhelmed data provider will reduce both thecorrectness and speed of queries. The implementation includes the ability to detecterrors based on the endpoint address and avoid contacting them after a certain thresholdis exceeded. If there are backup data providers that can execute the same semanticquery, they can be used to provide redundancy. Although this may have an effecton the completeness of queries, as long as the backup data providers are used for aquery until a success is found, it was viewed as a suitable tradeoff. It avoids very longtimeouts when sites do not instantly return errors which reduces query responsivenessfor scientists. After a suitable period of time, the endpoints may be unblocked, althoughif they fail to respond again they would again be blocked.

The statistics gathering accumulated the total latency of queries made by the pro-totype on providers. The statistics derived from the Bio2RDF mirrors indicated thatthere were 298,807 queries that had low latency errors (i.e., The time for a query to failwas more than 0 and less than or equal to 300ms (milliseconds)). By comparison therewere 328,857 queries that had a total error latency of more than 300ms. This indicatesthat there was a roughly even number of cases where an endpoint had failed completelyand quickly returned an error and where it was working, but busy, and timed out be-fore being able to complete a request. If an endpoint was blacklisted, it would appeara maximum number of times per blacklisting period in the statistics, making it difficultto interpret the statistics. During the research, the blacklisting period varied between10 minutes and 60 minutes.

The server timeout values across the length of the Bio2RDF monitoring periodranged from 90 seconds at the beginning to 30 seconds at the end. A small number ofqueries, 284, had server timeout values of less than 5 seconds. A server timeout onlyoccurred if the endpoint stopped responding as it was processing the query, so higherlatencies were possible. In some cases, the query did not timeout after 30 seconds,presumably because data was still being transferred. The highest total error latency fora query was 2,121,788 ms for a query on November 10, 2009. It contained two queriesthat failed to respond correctly, with the overall query taking 1,201,622 ms to complete.

The highest response time for an overall query was 1,279,359 milliseconds (ms), fora query on November 4, 2009. It contained 1 failed query, which took 345,502 ms (345seconds) to fail, while the 8 successful queries that made up part of the query took acombined total of 8,152,005 ms to complete. This indicated that there were queriesthat legitimately took a long time to complete. Many of the queries did completerelatively quickly considering the network communication that was required. 6,662,019overall queries completed within 3 seconds of the user requesting the data, and 7,022,205queries completed within 5 seconds.


The Bio2RDF website and datasets were not hosted using high performance servers,however the total number of unsuccessful queries was still only 3 percent, as shown inTable 6.4, indicating that the datasets were relatively accessible with the commodityhardware that was used. Two of the mirrors were limited to single servers with 4 coreprocessors and between 1 and 4GB of RAM, while the other mirror was spread acrosstwo dual-quad-core machines with 32 GB and 16 GB of RAM respectively. The requestsfrom users to Bio2RDF were spread equally across all of the servers using DNS roundrobin techniques, so there was no specific load balancing to prefer the system with thehigher specifications.

The current implementation lacks the ability to notify scientists in a systematicmanner when particular parts of their query failed, however, in future this informationmay also be sent to scientists along with the results. This would enable decisionsabout whether to trust the results based on which queries failed to execute. Sincethe prototype was designed as a stateless web service, this information is not easilyretrievable outside of the comments included in the results if they used a format thatsupports textual comments. The query is not actually executed when the scientistrequests the provenance for a query, so that avenue is not applicable.

Using RDF and SPARQL in the prototype, there is no way to estimate what thenumber of results for each query is going to be, as SPARQL 1.0 does not contain astandard count facility, and it cannot be easily derived from the query without previouslyhaving statistics available. It would also incur a performance hit to require each resultsset to be counted, either before or after the actual results are retrieved.

The model and prototype do not include support for specifying the statistics relatedto each dataset. In order to use the model as the basis for a federated query application,where a single query is submitted, planned, filtered, and joined across the data providers,the model would need to be extended to include descriptions about the statistics relevantto each namespace. This functionality was not included, as the model was designed tobe relevant to the contextual needs of many different scientists, including those wherescientists do not have the internet resources available to process very large queries, asis required by similar large grid processing applications.

The prototype is efficient in part due to its assumptions that queries can be executedin parallel, as the system is stateless, and the results can be normalised and included ina pool of statements, without having to join or reduce the results from different queriesuntil after they are resolved and normalised. This design does not favour queries thatrequire large amounts of information from geographically distant datasets.

The prototype was implemented using Java. The Java security settings do not allowan unprivileged program to specify which IP is going to be used for a particular HTTPrequest, as this requires access to a lower layer of the TCP/IP stack than the high levelHTTP interface allows. The main consequence of this limitation is that the prototypeis not able to understand where failures occur in cases where single DNS names are dis-tributed across multiple IP addresses. The mirrored Bio2RDF SPARQL endpoint DNSentries were changed in response to this to create a one-to-one relationship between


DNS entries and IPs, which could then be used to reliably detect and avoid errors with-out affecting the functioning mirrors. For example, there are two IP addresses for theDNS entry “geneid.bio2rdf.org”, but there is only one IP address for each of the specificlocation DNS entries, ie, “cu.geneid.bio2rdf.org” and “quebec.geneid.bio2rdf.org”. Thismay be a problem for the use of the prototype with other websites that rely on DNSlevel abstractions to avoid scientists having to know how many mirrors are available fora given endpoint.

Following a policy of having DNS entries mapping to single IP addresses enables animplementation to understand which endpoints are unresponsive. However, it restrictsthe ability of the scientist to choose which endpoints are geographically close. Bothstrategies enable the best responses in different conditions. The actual endpoint thatwas used can be reported to the scientist in the provenance record, with any alternativesnotes as possible substitutes.

6.5 Prototype statistics

This research was not evaluable based on numeric comparisons, as the research questionsfocus on determining the social and technical consequences of different model designfeatures on queries across distributed linked datasets. However, the use of the prototypeon the Bio2RDF website was monitored and the results are shown here to indicate theextent to which the model features were relevant to Bio2RDF.

The prototype was configured to match the conventions for the http://bio2rdf.

org/ website. The configuration necessary for the Bio2RDF website contains 1,626namespaces, 107 query types, 447 providers and 181 normalisation rules. In order toprovide links out from the RDF documents to the rest of the web, 169 of the namespaceshave providers that are able to create URLs for the official HTML versions of the dataset.The Bio2RDF namespaces can be queried using query types and providers using 13,614different combinations.

The Bio2RDF website resolved 7,415,896 user queries between June 2009 and May2010. The prototype was used with three similar profiles on three mirrors of theBio2RDF website; two in Canada and one in Australia. The configurations were sourcedfrom a dedicated instance of the prototype, 4. Profiles were used to provide access toeach sites internally accessible datasets, while retaining redundant access to datasetsin other locations. The reliability of the website was more important than its responselatency, so redundant access to datasets at other locations increased the likelihood ofgetting results for a user’s query.

Analysis of the Bio2RDF use of the prototype generated a number of statistics,including those shown in Table 6.4. These statistics revealed that the prototype waswidely used including an average over 10 months of 17 user queries per minute. Thesequeries generated 35,694,219 successful queries on providers, with 1,078,354 unsuccessfulqueries (3%). A large number of unsuccessful queries were traced to a small number of

4http://config.bio2rdf.org/admin/configuration/n3

http://bio2rdf.org/

http://bio2rdf.org/



the large Bio2RDF databases that were not handling the query load effectively. Therewere periods where the entire dataset hosting facility at individual Bio2RDF locationswas out of use due to hardware failures or maintenance, which caused the other mirrorsto regularly retry these providers during the outage.

The statistics gathering system was not designed to provide complete informationabout every combination of query type and provider. In particular, there were at least10,697,566 queries that were executed on providers that did not require external queriesto complete, as the providers were designated as no-communication. These queries werenot included in the successful or unsuccessful queries. No-communication providers arefillers for templates to generate RDF statements to add to the resulting documents, sothey did not provide information about the efficiency of the system. The 36,772,573remote queries were performed using 18,455,634 query types on providers that wereconfigured to perform SPARQL queries.

Each of the user queries required the prototype to execute an average of 4.96 queriesover an average of 3.73 of the 447 providers. The 447 providers do not represent distinctendpoints due to the way query types, namespaces, normalisation rules are all contex-tually related to a provider in the model and not an endpoint. The large number ofunique providers and queries reflects the difficulty in constructing alternative workflowsor processes to contain these queries, along with the normalisation rules that are appliedto each provider.

Each of the 447 providers were able to be removed using profiles, and if necessaryreplaced using alternative definitions without needing to change the definitions of anyquery types, namespaces, or normalisation rules in the master Bio2RDF configuration.By contrast, current Federated SPARQL models expose endpoint URLs directly inqueries. This creates a direct link between queries and endpoints, which means thatqueries cannot be replicated, based on query provenance, using any other endpointwithout modifying the query provenance.

Federated SPARQL models may use object types to optimise queries. This strat-egy works for systems where everyone uses the same URI to designate a record, andall records in a namespace are represented using consistent object type URIs. Theseassumptions hold in an environment where every namespace is similar to a typicalnormalised relational database table, with the same properties for every record, andnumeric primary keys for each record. However, as the Bio2RDF case demonstrated,primary keys are not always numeric, so they may require syntactic normalisation. Inaddition, there may be multiple object types in a single namespace, even if the objecttypes are in reality equivalent but represented using different URIs or schemas.

In Bio2RDF, there were a few namespaces that clearly highlighted issues withrelational-database-like assumptions that are used by Federated SPARQL implemen-tations. For example, the Wikipedia dataset has been republished in RDF form bythe DBpedia group. The record identifiers in DBpedia are derived from the titleof articles on Wikipedia, so references to the equivalent Wikipedia URL should be


treated equally to DBpedia. In the Bio2RDF normalised URI scheme, the main names-pace prefix was “wikipedia”, with an alternate prefix of “dbpedia” as it is commonlyreferred to in the Semantic Web. In the prototype, two different URIs, http://

bio2rdf.org/wikipedia:Test and http://bio2rdf.org/dbpedia:Test would resolvethese documents, although the resulting documents both contained http://bio2rdf.

org/wikipedia:Test as the subject of their RDF triples.

In Federated SPARQL, the whole community would need to decide on a singlestructure for URIs, or decide on a single object type for each namespace. In Bio2RDF,different structures were supported by adding a new namespace and applying it and anycorresponding normalisation rules to the differing providers. For example, the prototypeused by Bio2RDF to access and normalise the URIs in the Chembl dataset, availableat the Uppsala University in Sweden 5. There are a number of namespaces inside thisdataset, and a set of namespaces were created to suit the structure.

The prototype was also experimentally used to create a Linked Data infrastructurefor the original URIs. In order to do this, alternative namespaces needed to be createdfor the same dataset, as the namespace given in the original URIs did not contain the“chembl_” prefix that Bio2RDF used to designate the namespace as being part of theChembl dataset.

It would not be possible using Federated SPARQL to create these alternative LinkedData access points, as Federated SPARQL requires single URIs, so each query couldonly contain one of the many published URIs. This would make it impossible for anyoneexcept for the original publisher to customise the data, and it is necessary to customiseboth the source and structure of data to satisfy the data quality, data trust and contextsensitive research questions for this thesis.

The 447 providers were mapped to 1,626 namespaces. The majority of these names-paces represented ontologies published by OBO, converted to Bio2RDF URIs and lo-cated in a single Bio2RDF SPARQL endpoint. In the OBO case, it would be verydifficult to separate the different ontologies based on object types, as the OWL vocab-ulary is universally used to describe the objects in all of the namespaces. The onlyalternative method for segregating the dataset into namespaces is to search for a givenprefix in each URI, as each of the ontologies contain different URI prefixes.

In SPARQL 1.0, it is necessary to use regular expressions to query for a particularprefix in a URI. This is very inefficient and not necessarily simple. To avoid this,optimised SPARQL queries were made up to use the Virtuoso free text search extension.In Federated SPARQL, these extensions would need to be located in the query, makingit difficult to replicate sets of these queries on another system, as they would all need tobe modified. However, in Bio2RDF, the equivalent regular expression query was createdand used on non-Virtuoso endpoints, without changing the query that users refer to.

In the case of OBO, these query and namespace features enabled Bio2RDF to createa small number of providers to represent similar queries across all of the OBO ontologies.

There were at least 10,765 unique users of the Bio2RDF website during the statistics

5http://rdf.farmbio.uu.se/chembl/sparql

http://bio2rdf.org/wikipedia:Test


http://bio2rdf.org/dbpedia:Test



http://rdf.farmbio.uu.se/chembl/sparql


gathering period, although two of the mirrors did not recognise distinct users usingtheir IP addresses as a result of the web applications being hosted in a reverse proxyconfiguration.

Statistic NumberTotal no. of resolutions 7,415,896No. of resolutions within 3 seconds of request 6,662,019No. of resolutions within 5 seconds of request 7,022,205Average no. of resolutions per minute over 10 months 17.16Average latency of queries (ms) 1532Resolutions with low latency errors (1-300 ms in total) 298,807Resolutions with high latency errors (more than 300 ms in total) 328,857Total number of queries on external endpoints 36,772,573No. of successful provider queries by prototype 35,694,219No. of unsuccessful provider queries by prototype 1,078,354Average no. of queries for each resolution 4.96Average no. of endpoints for each resolution 3.73Number of unique users by IP Address 10,765Current number of providers 447

Table 6.4: Bio2RDF prototype statistics

The prototype software was released as Open Source Software and had an aggregateof 2,685 downloads from the SourceForge site 6. The most recent version of the softwarereleased on SourceForge, 0.8.2, was downloaded around 300 times in its compiled form,and around 30 times in its source form at the time of writing.

The normalisation rules in the Bio2RDF case provided a way to make the normalisedBio2RDF URI useful in cases where other data providers used their own URIs. Thisis necessary for data providers that independently publish their own Linked Data ver-sions of datasets using their own DNS authority as the resolving entity for the URI.For example, the NCBI Entrez Gene data item about “monoamine oxidase B”, withthe identifier “4129” has at least three identifying URIs, including http://purl.org/

commons/record/ncbi_gene/4129, http://bio2rdf.org/geneid:4129, and the NCBIURI http://www.ncbi.nlm.nih.gov/gene/4129. The normalisation rules made it pos-sible to retrieve data using any of these URIs and integrate the data into a single doc-ument that can be interpreted reliably using a single URI as the reference for the item.In addition, the 150+ providers that were designed to provide links to the traditionaldocument web did not require normalisation, as the identifiers were able to be placedinto the links.

In the case of Bio2RDF, there was a relatively small number of normalisation rulescompared to the number of providers because there were in practice a limited numberof locations that each dataset was available from, and the semantic structure of thedata did not need to be normalised between providers in most cases. The majority ofrules defined URI transformations so that data from different locations could be directlyintegrated based on RDF graph merging rules.

6http://sourceforge.net/projects/bio2rdf/files/




http://www.ncbi.nlm.nih.gov/gene/4129

http://sourceforge.net/projects/bio2rdf/files/


In comparison, the SADI system requires that the data normalisation be written incode, as opposed to being configured in a set of rules. It must encode each of these URIsand predicates into the relevant code and decide beforehand what the authoritative URIwill be. This makes it simpler to deploy services to a single repository, but makes it verydifficult for users to change the authoritative URIs and predicates to match their ownunique context. Having said that, the SADI system provides SPARQL based access toWeb Services using its mapping language, so it is useful for integrating RDF methodswith the current pool of scientific Web Services.

SADI has been initially built on a selection of premapped BioMoby services, makingit able to potentially access over 1500 web services when the mapping is complete. Incomparison, the Bio2RDF configuration built for the prototype relies mostly on RDFdatabase access using direct SPARQL queries, making it possible to avoid performingmultiple queries to get a single record from a single database, as is necessary usingmany BioMoby Web Services. The SADI system however is being gradually integratedwith the Bio2RDF databases, although the normalisation rules that make the prototypesuccessful across the large range of Bio2RDF and non-Bio2RDF RDF datasources willneed to be integrated into the SADI system manually. In addition, SADI, like its SQLbased predecessor described in Lambrix and Jakoniene [83], has not dealt with the issueof context sensitivity or replicability if the dataset structures materially change, as theBio2RDF datasources have done significantly in the last 2 years.

6.6 Comparison to other systems

There are a number of other systems that are designed using a model that is baseda single list of unique characteristics of each datasource, such as the properties andclasses of information available in each location. These systems use this informationto decompose a query into subqueries and join the results based on the original query.The prototype focuses on the flexible, content-agnostic, RDF model as its basis forautomatically joining all results. Although some other systems are designed to resolveSPARQL queries, they do not focus on RDF as the sole model, with many utilising theSPARQL Results Format7 as the standard format for query results in the same wayprevious systems use SQL result rows.

The SPARQL Results Format is a useful method of communicating results of com-plex queries without having to convert the traditional relational row-based results pro-cessing algorithms into the fully flexible RDF model. However, it does not enable thesystem to directly integrate the results from two locations without knowing what eachof the results mean in terms of the overall query. If the query is not a typical query thatis looking for a set of result rows, rather it is looking for a range of information aboutitems, then the variables from different results will not join easily, as they are definedin the query as text, and they do not form RDF documents. However, any SPARQL

7http://www.w3.org/TR/rdf-sparql-XMLres/

http://www.w3.org/TR/rdf-sparql-XMLres/


SELECT query, which would otherwise return results using the SPARQL Results For-mat, can be rewritten as an equivalent SPARQL CONSTRUCT query which returnsRDF triples.

The prototype can be integrated with other RDF based systems using SPARQLCONSTRUCT/DESCRIBE queries, URL resolution, or other methods as they are cre-ated in future. In all of the other frameworks the distribution of queries across differentlocations relies on knowledge about the structure of the data that will be returned,as queries are planned based on the way that the data from different endpoints willbe joined. In the model and prototype, the distribution is based on knowledge aboutthe high level datasources that are accessible using each endpoint, with the possibilitythat all datasources will be available using a single default endpoint without needing tospecify each of the datasources. These methods make it possible to construct complex,efficient queries, without restricting users to particular types of queries or restrictingthe system to cases where it knows the complete details of the data accessible in eachendpoint.

The prototype provides users with a way of integrating any of the RDF datasourcesthat they have access to, including Web Services wrapped using SADI [143], distributedRDF databases wrapped using DARQ [111], and plain RDF documents that are ac-cessible using URL’s. This makes it possible for scientists to develop solutions in anyinfrastructure and make it possible for other scientists to directly examine or consumetheir results, whether or not the other scientists have a prior understanding of the on-tologies that they have used to describe their results. Although SADI provides directaccess to RDF datasources using specifically designed code sources, its entire processingmethod relies on scientists knowing exactly which predicates were used to describe eachdataset. This makes it necessary for each community to use a single set of predicatesfor any of the datasets that they wish to access with their query, or produce code forthe translation rules to make the SADI engine see a single set of predicates.

The prototype provides an environment where multiple sources of data from differentcontexts, such as open academic, and closed business models, can be combined, andrecombined in different ways by others, without having to publish limited access datasetscompletely in globally accessible locations. In DARQ, for example, the statistics relevantto each dataset must be accessible by each location to decide whether the query ispossible, and if so whether there is an efficient way to join the results. In the prototypethe relevant pieces of information about each dataset can be published, and if the datasetis not intended to be directly accessed, the prototype can be used to publish the set ofoperations that the public can perform with the dataset. In SADI, it is necessary to usea single service repository, as the query distribution plan is compiled in a single locationusing both the coded programs and the service details, making it difficult to transportcurrent queries, or integrate service repositories, even if a scientist can establish theirown repository for their own private needs.

The prototype makes it possible to integrate different datasets by encoding the nor-malisation rules in a transportable configuration file and making sure that namespaces


are always recognised using an indirect method, so that there is not a single global defi-nition of what URI is used to identify each namespace or dataset. In SADI, namespacesare either recognised using custom code for each URI structure, based on the prefixdefinitions in the Life Science Resource Name website 8, or they are recognised usingthe literal prefix string representing the namespace.

In comparison, the prototype makes it possible for scientists to internally redefinenamespaces using the same prefix as an external namespace, while linking it to their ownURI to make it possible and simple to distinguish between their use of the same prefixand other uses of the prefix. It is not possible to use a prefix to name more than onenamespace if one relies on Inverse Functional Properties to name an RDF object usinga database name prefix and the identifier without using a unique predicate URI, suchas the Gene Ontology recommends in its RDF/XML format guidelines9. The prototypeencourages Linked Data style, HTTP URI references to resources so that they can beaccessed using the prototypes query methods, using HTTP method if available, or usingthe entire HTTP URI as an identifier if another method is used in the future.

8http://lsrn.org/9http://www.geneontology.org/GO.format.rdfxml.shtml

http://lsrn.org/

http://www.geneontology.org/GO.format.rdfxml.shtml

Chapter 7

Conclusion

The research presented in this thesis aims to make it possible for scientists to workwith distributed linked scientific datasets, as described in Section 1.1. Current systems,described in Chapter 2, that attempt to access these datasets encounter problems in-cluding data quality, data trust, and provenance which were defined and described inSection 1.2. These problems, shown in Figure 1.6, are inherent in current distributeddata access systems.

Current systems assume data quality is relatively high; that scientists understandand trust the datasets that are publicly available in various locations; and that otherscientists will be able to replicate the entire results set using the exact method aspublished. In each case, scientists need to have control over the relevant methodsand datasources to integrate, replicate, or extend the results in their particular context.They must be able to do this with minimal effort. Solutions such as copying all datasetsto their local site before normalising and storing them in a specialised database, oralternatively relying exclusively on public web services that may not be locally availableor reliable, do not form a good long term solution for access to both local and publicdatasets.

The query model described in Chapter 3 enables scientists to integrate semanticallysimilar queries on multiple datasources into a single, high quality, normalised output for-mat without relying exclusively on a particular set of data providers. The query modeldistinguishes between user queries, templated query types, and the data providers, mak-ing it possible to add and modify the way existing user queries are resolved by addingor modifying templated query types or data providers. In addition, users can createnew normalisation rules to modify queries or the results of queries without requiringany community negotiation, as is necessary with single global ontology systems thathave been created in the past.

The query model was implemented in the prototype web application as described inChapter 5. The prototype makes it possible for scientists to expose their queries andconfiguration information as HTTP URIs. The use of HTTP URIs enables scientists touse the information in different processing applications, while sharing the computer un-derstandable query methods and results with other scientists to enable further scientificresearch.

165

166 Chapter 7. Conclusion

The prototype is designed to be configured in a simple manner. The configurationof their prototype installation can be configured from scratch or the public sources ofconfiguration information can be included and selectively used based on the users profile.An example of a public configuration is the Bio2RDF configuration 1. They can addand substitute new providers without requiring changes to current providers and queries.The use of profiles from the query model enables scientists to choose local sources inpreference to non-local sources where they are available. Profiles allow scientists toignore datasources, queries, or normalisation rules that are untrusted, without actuallyremoving them from published configuration files that they import into their prototype.

Namespaces, corresponding to sets of unique identifiers, can be assigned to dataproviders and queries to perform efficient queries using the prototype. Namespacesenable the prototype to distribute queries across providers who provide the same data-source, without relying on the method that was used to implement either of the providers,or any declarations by the data producer about which datasets are present in a location.

The RDF based provenance and configuration information that is provided by theprototype can be processed together with the RDF data when it is necessary to explainthe origin of the data along with the data cleaning and normalisation methods that wererelevant. This enables scientists to integrate the prototype with workflow managementsystems, although there is not yet widespread support for RDF processing in workflowmanagement systems. Chapter 4 described the integration of the model with medicineand science showed that it can be used for exploratory purposes in addition to theregular workflow processing tasks.

The model can to be used to access untrusted datasets. This is necessary for ex-ploration and evaluation purposes. However, it is simultaneously possible to restricttrusted queries to only specific datasources as deemed necessary by a scientist. Thiswould reduce the number of datasets that they need to evaluate before trusting theresults of a query.

The model is implemented as a prototype web application that can be used toperform complete queries on each data provider for a particular query, and normalise theresults. The scientist is then responsible for performing further queries. In comparisonto other systems, the prototype does not attempt to automatically optimise the amountof information that is being transferred and it does not answer arbitrary SPARQLqueries. It has native support for de-normalising queries to match data in particularendpoints, and normalising the results, something that other systems do not cater for,but which is important for consistent context-sensitive integration of distributed linkedscientific datasets.

The use of the prototype in both private and public scenarios, including the publicBio2RDF.org website2 and private uses of the open source software package3, providesevidence for the success of the model in improving the ability for scientists to performqueries across distributed linked scientific datasets.

1http://config.bio2rdf.org/admin/configuration/n32http://bio2rdf.org/3http://sourceforge.net/projects/bio2rdf/


http://bio2rdf.org/

http://sourceforge.net/projects/bio2rdf/

Chapter 7. Conclusion 167

The scientific case studies described in Chapter 4 and the related systems describedin Chapter 2, show that there are no other models or systems that are designed to enablescientists to independently access and normalise a wide variety of data from distributedscientific datasets. Other systems generally fail by assuming constant data quality andsemantic meaning, uniform references, or they require scientists to have all datasetslocally available.

The prototype was shown in Section 4.7, to be simple to integrate with workflowmanagement systems using HTTP URI resolution as the access method and RDF asthe data model, with no normalisation steps necessary in the workflows. Workflows areideally suited to filtering and parallelisation, and they were used in this way to make itpossible to link the results of queries using the prototype with further queries.

The discussion in Chapter 6, shows that the prototype is useful by virtue of theway it was used by the Bio2RDF project to integrate a large range of heterogeneousdatasets published by different organisations. In addition, the Bio2RDF configurationwas manipulated using profiles to both make the Bio2RDF mirrors work efficiently,and for private users to add and remove elements of the configuration using privateconfigurations.

The model and prototype make it possible for scientists to understand and definethe data quality, data trust, and provenance requirements related to their research bybeing able to perform queries on datasets, including understanding where datasets arelocated and exactly which syntax and semantic conventions they use. In reference tothe issues shown in Figure 1.6, the model and prototype provide practical support asillustrated in Figure 7.1.

The model and prototype could be improved in the future to include more featuressuch as linked queries and named parameters that may be beneficial to scientists, asdescribed in Section 7.3. In particular, the prototype could be extended to supportdifferent types of normalisation, including logic-based rules and query transformations.

7.1 Critical reflection

The model provided a simple way to align queries with datasets, and the prototypeprovided verification of the models usefulness through its use as the manager for queriesto the Bio2RDF website. The prototype phase influenced the design of the model, asthe implementation of various features required changes to the model. In addition tothis, the Bio2RDF datasets were simultaneously updated, requiring the configurationfor the prototype to be maintained and updated out of step with the development andrelease cycle.

7.1.1 Model design review

The model provides a number of loosely linked elements, centralised around providers.Although they act as a central point of linkage, providers still contain a many to many


Data

Medicine

Physically restrict access to patient data: Locally resolvable URIs

Use different trust for different patients to recognise context:

Profiles

Promote useful resolvable labels for

links to public datasets:Preferred and alternative

namespace labels

Give patients access to machine understandable

links to science data: Linked Data

Science

Use normalised label and publish alternative labels so scientists can integrate data:

Normalisation rules

Focus on datasets instead of data types:

Namespaces separate from Normalisation rules

Normalise links to other datasets:

Contextual normalisation rules

Republish major datasets as an illustration to smaller providers :

Bio2RDF, LODD, Neurocommons

Single data model to integrate structured

data: RDF

Maintain distributed data in different locations:

Data providers and profiles

Provide knowledge about protected data: Static RDF insertions

Standardise public dataset labels:

Preferred namespace label

Data labels resolve to data:

Linked Data

Biology

Normalise syntax, and rely on scientists including semantic normalisation as part

of their experiment context:Regular Expressions and SPARQL patterns

Assign namespace labels to datasets to reduce confusion

about the target of a link:Namespaces

Offer different data formats as references from a single

generic data model:RDF and URL

Normalise data access so workflows can be replicated easily:

Query types and Modifiable HTTP URIs

Standardise on a widely available technology for

data access:HTTP

Figure 7.1: Overall solutions


relationship with endpoints. Each provider may have more than one, redundant, end-point, while each endpoint may be associated with more than one provider, in thecontext of different namespaces and/or query types.

The entry points, query types, provide a many to many relationship between queries,namespaces and query types. This enables the construction of arbitrary relationshipsbetween queries and namespaces, as namespaces can be recreated using different URIsto avoid collisions for the same prefix, and normalised using the preferred and alternateprefix structures. Although the prototype only implemented a single query to namespacemapping method, using Regular Expressions, new mapping methods can be created totransform namespaces using intelligent mechanisms, including database lookups andSPARQL queries to transform the input in arbitrary ways.

In comparison to provider-based normalisation rules, this mechanism is not easy toshare between query types, but in all of the Bio2RDF cases the preferred and alternateprefix feature of namespace entries was enough to normalise namespaces. In a future re-vision these rules could be defined separately, or normalisation rules could be integratedwith both query types and providers.

The loosely linked query elements are very useful, particularly in the way queryreplication can be performed without having to modify hard links between queries anddata sources. However, it makes it difficult to determine a priori which elements willbe used for a particular type of query without a concrete example, particularly ina large configuration such as that used for Bio2RDF. For example, the DOI (DigitalObject Identifier) namespace that was largely represented in RDF datasets using textualreferences. Since a query can define itself as being relevant to all namespaces, the listof namespace URIs that are attached to any query type does not immediately identifythe queries that were relevant. This made it difficult to audit the DOI datasets withoutperforming queries to see which datasets were available.

As namespaces needed to be independent of the required query model, to supportarbitrary queries, a similar mechanism was created for providers, the default providersetting. In addition, to enable queries to require explicit annotation of providers with anamespace before sending a query to a provider, query types were able to specify thatthey were not compatible with default providers. This design choice was very usefulaccording to the criteria for this thesis, but in the majority of cases, these features werenot required by Bio2RDF, and made it slightly more difficult to verify the configurationwas acting as desired.

However, in cases where there was a definite query, it was always possible to de-termine the relevant query types, namespace entries, providers and normalisation ruleswithout performing any network operations. In comparison, some federated SPARQLsystems such as SADI, cannot determine which datasources will be required for a query,as they need to continue to plan each query while they are executing it. In addition, itis impossible, by design, to get all properties of a data record from a SADI SPARQLquery, as by design there are no unknown predicates in a query. This inherent lack ofknowledge about the extent of a single query makes it difficult to replicate using the


resulting query provenance. The query needs to be executed again using the full SADIservice registry to be correct.

In comparison, queries using the model described in this thesis are replicable solelyusing the query provenance information, if the content of the datasets do not change.This makes it practical to publish the query provenance for others to use in similarways, even if they do not use the same query, or they want to selectively add or removeproviders to match their context. If the namespace parameters do not change, the queryplan for the first query will be identical for future queries, eliminating the necessity ofa central registry to plan similar queries after the first query.

Although it would have been possible to define parameters at the query group level,this would have a negative effect on replicability and context sensitivity. If all queriesin a query group are required to accept the same parameters, then general query types,which could answer a wide range of queries, would not be allowed. This would havea negative effect on replicability if these general query types are necessary to provideredundant access to data. The model defines parameters at the level of query type toprovide for any possible mapping between a query and the parameters for that query.This makes it possible to pass through queries directly to lower levels, or deconstructthem into parameter sets for direct resolution. If a query type inherited its parametermappings from its query group, it may not be possible to implement in these situationswithout creating an entirely new query group, which defeats the purpose of query groupsas a semantic grouping of similarly purposed query types.

7.1.2 Prototype implementation review

The prototype was implemented using Java Servlets, along with the URLRewrite libraryto support HTTP URI queries in a lightweight manner, and the Sesame RDF library tosupport internal RDF transformations. This made it very flexible during development,as the URL structures were easy to modify using URLRewrite when necessary, andthe Servlets did not need to know about the query method as they received a filteredpath string. In addition, queries to the model are designed to be stateless, reducing thecomplexity of the application.

The processing code was implemented using a single Servlet. Each instance of theservlet used a single Settings class to maintain knowledge about the configuration files,and a similarly scoped BlacklistController class to maintain knowledge about failedqueries on endpoints, and the frequency of client queries. The settings class containedmaps that made it possible to efficiently lookup references between configuration objectsusing their RDF URIs, as objects were not tightly linked in memory using Java objectreferences.

The servlet was able to process multiple queries concurrently per second. However,the basic design would be difficult to optimise if there were a range of different nor-malisation rules in the common queries. For example, the RDF document needs to beprocessed as a stream of characters before being imported to an abstract RDF store tobe processed for each query. Then, these RDF triples are integrated with RDF triples


from other query types to be processed as a pool of RDF triples, in order to performsemantic normalisations that were not possible with the results from each provider.Then, in order to perform textual normalisation, the pool of RDF triples needed to beserialised to an RDF string that could have an arbitrary number of normalisations per-formed on it. All of the string normalisation stages could require the entire string, forexample, an XSLT transformation. This made it difficult to generate a smooth pipeline.

In comparison, the BioMart transformation pipeline assumes that records are in-dependent of each other, so that it can process them in a continuous stream, usingASCII control chracters (\t for the end of a field and \n for the end of a record) ina byte stream to designate the start and end of records. The Bio2RDF case requiredthat the existence of relational database style records with identical fields in each werenot necessary to process data, with search services able to provide arbitrary results,for example. It also required that the results could be normalised from each endpointindividually, to fix errors in particular datasets, and that results could be normalisedacross all endpoints for a query, to make it possible to perform semantic queries acrossdatasets.

The servlet handled documents containing up to 30,000 RDF triples regularly, asthis range was required to process queries on the RDF triples in the configuration,although most results were limited to 2000-5000 triples. In smaller datasets, recordswere very small if they were only sourced from their official dataset, but if they werehighly linked to from other datasets, the results were much larger, resulting in a verylarge document. In other cases, records are relatively small only if internal links are nottaken into account. For example, in the Wikipedia dataset, the interarticle links areso numerous that DBpedia was forced to omit the page links dataset from the defaultLinked Data URI, as it was causing lightweight Linked Data clients to crash due tothe large number of triples. The Bio2RDF installation of the prototype was able toprocess and deliver the full Wikipedia records, including the page links, although thedbpedia.org SPARQL endpoint had a high traffic flow so SPARQL queries sometimesfailed to complete.

7.1.3 Configuration maintenance

The configuration defines the types of queries, data providers, normalisation rules, andnamespace entries, that are to be used to resolve queries. It is flexible, and when usedwith each users set of profiles, it can be used differently based on the user’s context,rather than on something that is built into the basic configuration.

In comparison to Federated SPARQL systems, it is not limited by the basic descrip-tions of the statements and classes in an endpoint. Normalisation rules can be writtento reform the URIs, properties and classes that are the core of Federated SPARQLconfigurations. However, this flexibility requires more effort, as the Federated SPARQLdescriptions can be completely generated without human input, and most are staticallygenerated by producers rather than generated by users.


In order to automatically generate configuration information for the prototype, ad-ditional information is required, defining the rules and queries that would be used foreach namespace. This registry would then be used together with Federated SPARQLdescriptions to generate a complete configuration which could then be selectively cus-tomised by scientists using profiles.

The majority of the configuration management tasks involve identifying changesto datasets and how to either take advantage of the change, or isolate the change ifit will interfere with their current queries. For example, if a new data quality issuewas identified for an endpoint, a new normalisation rule could be added to the dataproviders. If the data provider previously was thought to be consistently applicable toa range of endpoints, the data provider may need to be split into two data providers toisolate the data quality issue.

The RDF statements that make up the prototype configuration files can be validatedusing a schema. The prototype schema is generated dynamically by the prototypeitself. The current version can be resolved from http://purl.org/queryall/, whichin turn resolves the schema using an instance of the prototype. Past versions can begenerated using past prototype versions, or by passing the current prototype the desiredconfiguration version as part of the schema request.

7.1.4 Trust metrics

Truth metrics are used to give users opinions about the trustworthiness of unknownquality items. They generally take the form of numeric scales. For example, there maybe a 5 star hotel, or a 8 star movie. The model does not attempt to define trust metricsinside the model. In centralised registries it is appropriate to compile information fromall users and generate ratings based on the wisdom of the crowds. However, the modelis designed to be extensible based on unique contexts.

If a Web Service is given a trust metric based on its uptime, it may not be relevantto a user that reimplemented the service exactly in their own location. Similarly, ifa user perceives that a data provider is generating incorrect, or non-standard data,they may rate the data provider lowly. However, the model provides the opportunityfor others to generate wrappers to improve data quality, using normalisation rules fortransformations. The wrappers are defined in a declarative manner, enabling them topublish the resulting rules for others to examine and reuse. If others reuse the wrapper,then effectively they are trusting the dataset. This contradicts the trust opinions givento the basic data provider interface, even though the data provider is providing thesame information.

Instead, the trust in a data provider was defined based solely on whether the datasetwas used or not. Although this definition provides clues about useful data providers,it does not reliably identify false negatives. That is, it does not correctly categoriseproviders that are not used, but would be trusted if there were no other contextualfactors, such as the performance capacity of the provider or the physical distance to theprovider.

http://purl.org/queryall/


The data trust research question, defined in Section 1.3, relies on scientists identi-fying the positive aspects of trust. Scientists are able to identify trusted datasets andqueries based on examining published configurations that used the datasets or queries.They can then form their own opinions to generate further publications, which wouldincrease the rank for a particular provider in the view of future peers.

7.1.5 Replicable queries

The model provides a framework for scientists to use when they wish to define andcommunicate details of their experiments to other scientists. The query provenancedetails that combine the replication details, include the data cleaning methods thatwere thought to be necessary for the queries to be successful and useful. They alsocontain indications about positive trust ratings and the general data quality of eachdata provider.

The prototype can take query provenance details and execute the original query inthe current scientist’s context, with selections and deletions defined as extra RDF triples.That is, future scientists do not have to remove any triples from a query provenancedocument to replicate it, as any changes are additions, together with changes to theprofiles used to replicate the queries.

If a new normalisation rule is needed, it can be added to the original provider, byextending the original provider definition with an extra RDF triple. This extension canoccur in another document if necessary, as the RDF triples are all effectively linked afterthey are loaded into an RDF triplestore by the prototype. In addition, non-RDF systemsthat may be used by scientists currently can be integrated using normalisation rules totransform the content to RDF, including XML using XSLT and textual transformationsincluding formats such as CSV [88], or they can be accessed using queries to conversionsoftware such as BioMart [130] or D2R [29]. Any textual queries are supported usingthe model, including URLs and HTTP POST queries.

If a query does not function properly in the scientists context, they can ignore thequery type using their profile, and create a replacement using a different URI. They canthen link this new query type URI into the original provider.

7.1.6 Prominent data quality issues

The use of the prototype as the engine for the Bio2RDF website revealed a range ofdata quality issues, many of which were introduced in Section 1.2. These included dataURIs that did not match between different locations, a range of predicates that wereeither not defined in ontologies, or defined to be redundant with other ontologies, anda lack of consistency in the use of some predicates between locations.

The issue of different URIs for the same data records was identified as a major issueby the Bio2RDF community. Although it was not an issue for the Bio2RDF datasets, itaffected the ability to source extra links from other providers. The Bio2RDF resolver hasa major requirement to be a single resolver for information about different data records,while including references to the original URIs after the normalised URI is resolved,


whether they are resolvable or not. This requirement it necessary to include both theoriginal URI and the normalised URI in some cases, meaning that the URI normalisationcouldn’t be applied in a single step. Some statements needed to be included withoutbeing normalised together with the results from each endpoint. In order to allow for allpossibilities, there were two stages included after these statements were inserted, onewith access to RDF triples, and one with access to the serialised results in whateverformat was requested. For example, http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB00001 was equivalent to http://bio2rdf.org/drugbank_drugs:

DB00001, and when the original RDF triples were imported the URI needed to change,but after the equivalent RDF triples were inserted, the URIs needed not to change.This was achieved by applying the URI normalisation rules to the “after results import”stage, before the equivalence triples were inserted and further normalisation was donein the “after results to pool” and “after results to document” stages.

The Bio2RDF datasets did however have issues with inconsistent URIs over time forpredicates and classes. These issues were solved over a period of time, as best practiceswere identified in the larger RDF community. However, the lack of resources for main-tenance of the RDF datasets meant that the Bio2RDF resolver needed to recognise andnormalise 5 different URI styles while the migrations to older datasets were performed.

For example, an ontology URI was originally written as http://bio2rdf.org/

bio2rdf#property. This was acceptable, but not ideal, as the property would notbe resolved, due to the fragment “#property”, being stripped before HTTP resolution.In order to isolate these cases from other URI resolutions, the ontology URI standardwas changed to http://bio2rdf.org/ns/bio2rdf#property. The base URL in thisURI was designed to resolve to a full ontology document containing a definition for theproperty. However, this was not ideal in cases where the property needed to be re-solved specifically, so it was changed to http://bio2rdf.org/ns/bio2rdf:property.This URI was completely resolvable and distinguishable from other URIs, but it wasnot consistent with some other ontologies that were defined using the normal Bio2RDFURI format, http://bio2rdf.org/namespace:identifier. In response to this, a newnamespace was created for each Bio2RDF ontology, for example, http://bio2rdf.org/bio2rdf_resource:property.

At one stage another alternative was also experimentally used, http://ontology.bio2rdf.org/bio2rdf:property, however the use of a single domain name for all URIswas too valuable to ignore. The normalisation rules for each of these rules were, inpractice, applied in series to eventually arrive at the current URI structure, althoughthey could have all been modified to point to the current URI format.

In some cases, datasets define their own properties that overlap with current prop-erties. Although this research did not aim to provide a general solution to ontologyequivalence, the normalisation rules, particularly SPARQL rules, can be used to trans-late between different, but equivalent sets of RDF triples. It is particularly importantfor the future of Linked Data if there are standard properties that can be relied onto be consistent. For example, the RDF Schema label property could be relied on to





http://bio2rdf.org/bio2rdf#property

http://bio2rdf.org/bio2rdf#property

http://bio2rdf.org/ns/bio2rdf#property

http://bio2rdf.org/ns/bio2rdf:property


http://bio2rdf.org/bio2rdf_resource:property

http://bio2rdf.org/bio2rdf_resource:property

http://ontology.bio2rdf.org/bio2rdf:property

http://ontology.bio2rdf.org/bio2rdf:property


contain a human readable label for a URI in RDF, although its meaning was redefinedby OWL2, and can hence no longer be used for non-ontology purposes in cases whereOWL2 may be used.

One term in particular was identified as being inconsistently interpreted, possiblydue to a lack of specification and best practice information. The Dublin Core “identifier”property is designed to provide a way to uniquely identify an item in cases where theURI may be different. For example, it may be used to provide the unique identifierfrom the dataset that it was originally derived from. In the drugbank case above thiswould be “DB00001”. However, it may also be used to represent the namespace prefixedidentifier, for example, “drugbank_drugs:DB00001”. The SADI designers favour anotherspecification that matches their Federated SPARQL strategy, that of a blank node witha key value pair identifying the dataset using a URI in one triple and the identifier inanother triple attached to the blank node 4. Although this functionality may be useful,it requires a single URI to identify the dataset, which requires everyone to accept asingle URI for every dataset. In the Bio2RDF experience outlined here, that is unlikelyto ever happen.

7.2 Future research

This thesis produced a viable model for scientists to use in cases where other scientistswill need to replicate the research using their own resources. It incorporates the nec-essary elements for dynamically cleaning data, in a declarative manner so that otherscientists could replicate and extend the research in future. However, it left open re-search questions related to studying the behaviour of scientists using the model withrespect to trust and data provenance.

The model provides the necessary functions for scientists to individually trust anyor all of the major elements, including query types, providers and normalisation rules.The trusted elements would then be represented in the query provenance record that amodel implementation could generate for each query. Scientists could substitute theirown implementations for these elements without removing the original implementationsfrom the original provenance records.

This thesis did not study the reactions of scientists to the complexity surroundingthis functionality, as it enables arbitrary changes to both queries and data, in anysequence. For example, there were two flags implemented in the prototype to controlthe implementation of the profiles feature. The first defined whether an element shouldbe included if it implicitly matched a profile, and the second defined whether an elementshould be included if it was not excluded by any profile and did not implicitly matchany profile. Although these rules are simple to define, different combinations of thesetwo flags may change the results of a query in range of different ways.

These flags were used to simplify the maintenance of the Bio2RDF system, as theyreduced the complexity of the resulting profiles. The first flag, “include when implicitly

4https://groups.google.com/group/sadi-discuss/msg/d3e4289082098428

https://groups.google.com/group/sadi-discuss/msg/d3e4289082098428


matching”, was enabled and the second flag, “include when no profiles match flag”was disabled. Future research could study the behaviour of the model using differentcombinations of these flags, in combination with the settings on each profile definingwhether to implicitly include any query type or provider that advertised itself as beingsuitable for implicit inclusion.

Although they were mostly implemented to simplify maintenance, the flags, alongwith other flags inside of named profiles, enable scientists to dynamically switch betweendifferent levels of trust without changing the other configuration information. A verylow level of trust would set both of the overall flags to disabled and include a profile withall of its flags set to disabled. This level of trust would require that they specificallyinclude all of the query types, providers and normalisation rules that they wish to use.In comparison, a very high level of trust may be represented by setting both of theoverall flags set to enabled, without using any profiles.

The query provenance records produced by the prototype included a subset of thecomplete configuration that the scientist had enabled at the time they executed therequest. This may reduce the ability of scientists to change the trust levels when repli-cating queries using the provenance record. Setting broader levels of trust will only everinclude the elements in the record, even if there were other possibilities in the prototypeconfiguration. In the Bio2RDF example, the entire configuration was located in eachprovenance record, the provenance size would increase, depending on which elementsmatched a particular query. The size increase could be significant in terms of futureprocessing of the query using the provenance record and storing and publishing theprovenance record.

Future research could investigate the viability of this model, and the viability of alarge scale combined data and query provenance model with respect to its usefulnessand economic cost. For example, the query provenance, including the entire Bio2RDFconfiguration so that future scientists could fully experiment with novel profile and trustcombinations, could be stored in a database. If it were stored using a single graph foreach of the query provenance RDF records, the 7.4 million queries during the 11 monthstatistics gathering period, there would be in excess of 222 billion triples in the resultingdatabase, as the configuration contained approximately 30 thousand triples. There wereadditional triples for each provenance record detailing the actual endpoints that wereused and the exact queries that were used for each endpoint, which would increase theactual number of triples.

For the evaluation of this thesis, a reduced set of this provenance was stored in adatabase, with the records only containing URIs pointing to the full query type andprovider definitions that would be present in the full record. Most provenance researchassumes that this cost will be either offset by the usefulness of the provenance in future,or it will be taken up by a new commercial provenance industry, perhaps modelled onthe current advertising industry. A full examination of this would be required beforerecommending any large scale use of provenance information as a method of satisfyingtrust and replicability requirements. In comparison, the publication of a single large


configuration, such as the Bio2RDF configuration, along with the methods used foreach of the queries used to produce a given scientific conclusion, would enable flexiblereplicability using a range of trust levels, without incurring the permanent storage costthat is associated with permanent provenance, as 30 thousand triples can be stored andcommunicated relatively easily compared to the alternative of 222 billion.

In the Bio2RDF configuration there was a single profile for each of the three mirrors,along with a single default profile for other users of the prototype. These profiles wereused to exclude items from the configuration based on knowledge about the locationof the provider. Most legitimate changes in the context of the Bio2RDF mirrors werefixed by changing the master configuration, as they were bug fixes that improved theresults of a query.

If a provider was not available any more, it was set to be excluded by default,without being removed. This made it possible to keep knowledge of the provider, if,for example, they reappeared at a different location, without continuing to attempt touse the provider. If a dataset changed so that a normalisation rule was not requiredany more, the normalisation was either changed to be excluded by default, or it wasremoved from the provider, depending on whether it may have some use in the futureagain. Future queries against the Bio2RDF website would use the new configuration,however past provenance records would still include their original elements, even if thequery could not be easily replicated due to permanent disappearance of the provider,or a non-backwards-compatible data change.

Future research could examine the effects of modifications to the Bio2RDF masterconfiguration, perhaps in conjunction with past research into migration of ontologies[136]. There are conflicting objectives influencing migration of configurations using themodel. Firstly, if a public provider changes their data, then the URI for the providerwould not be changed, or current profiles would be affected. The normalisation rulesand namespaces applied to the provider would be added or deleted to match the realchanges. However, if the old version of the provider was referenced in past provenancerecords, then they could not be included together with the current configuration, orthey could introduce conflicting normalisation rules with unknown combined effects. Inthis case it would be advantageous to change the URI of the provider, however, thiswould affect profiles that trusted the provider in both cases, as they may need to bechanged to explicitly include the current URI to continue to have the same effect.

The research in this thesis does not examine ways to include and use identity in-formation in configurations, other than to include it in namespaces to recognise theauthority that defined a particular namespace. Future research could extend this thesisto examine the benefits of adding in identity information, including individual scientistsand the organisations they work for. One benefit might be to automatically identify thenecessary citations for publications, based on the use of a particular group of datasets.Another might be to recognise scientific laboratory contributions, including the actualcontributions from individual scientists in a team, which may only currently be identifiedas contributions from the authors on a resulting publication.


7.3 Future model and prototype extensions

The model may be extended in future to suit different scenarios, including those wherecomplex novel queries are processed using information that is known about data providers.As part of this it may be extended to include references to statistics as a way of opti-mising large multi-dataset queries. The prototype can be extended to provide differentways of creating templates and different types of normalisation rules to suit any of thecurrent normalisation stages.

These changes would require the normalisation process to be intelligently appliedto both the queries and the results of queries, although the way this may work is notimmediately obvious. The independence of namespaces from normalisation rules may bean issue in automatically generating queries, as currently the prototype only includes aninformative link between normalisation rules and any namespaces that they are relatedto. The rationale for this is that the namespaces make it possible to identify manydifferent datasets using a single identifier, while the normalisation rules independentlymake location specific queries functional in some locations without any reference toother locations. Future research could examine this aspect, along with other semanticextensions to the model to facilitate semantic reasoning on the RDF triples that makeup the configuration of the prototype.

The scientific community has endeavoured on a number of occasions to createschemes that require a single URI for all objects, regardless of how that would af-fect information retrieval using the URIs, as the gains for federated queries are thoughtto be more valuable than the losses for context dependent queries and URI resolution.Although the model encourages a normalised URI scheme for queries to effectively trans-late queries and join results using the normalised data model across endpoints, it doesnot require it, so scientists can arbitrarily create rules and use URIs that manipulatequeries and results outside of the normalised URI scheme.

The model could also be extended to recognise links between namespaces to indi-cate which namespaces are synonyms or subsets of each other. This could be used byscientists to identify data providers as being relevant, even if the provider did not in-dicate that namespace was relevant. This would enable scientists to at least discoveralternative data providers, even if the related queries and normalisation rules may notbe directly applicable as they may be specific to another normalised URI scheme.

The implementation of named parameters would need to ensure that if two serviceswere implemented differently, that they would still be used in parallel without havingconflicting named parameters. Regular Expression query types are currently limitedto numerically indexed parameters. If other query types they wished to be backwardsor cross compatible with Regular Expression query types they would need to supportsimilar parameters. For example, they could support the “input_NN” convention usedby Regular Expressions.

Named parameters may restrict the way future scientists replicate queries if theyhave semantic connotations. For example, the two simplest named parameters, “names-pace” and “identifier” from http://example.org/$$namespace$$:$$identifier$$,

http://example.org/$$namespace$$:$$identifier$$


may conflict if “identifier” actually meant a namespace. For example, to get data about“namespace”, one may use http://example.org/$$ns$$:$$namespace$$ which se-mantically connotes two namespace, “ns” and “namespace”, resulting in a possible se-mantic conflict with other query types.

The current model is designed as a way of accessing data for single step queries, asopposed to multi-step workflows. However, the complexity of queries can be arbitrarilydefined according to the behaviour of the query on a provider with ordered normalisationrules filtering the results. In future, this may be extended to include linked queries thatdirectly reference other queries. The model for this behaviour would need to decidewhat information would be required for scientists to be able to create new sub-querieswithout having to change the main query to fit their context.

The current model assumes that the parameters are not directly related to a partic-ular query interface, so that they can be interpreted differently based on the context ofthe user. This is similar to workflow management systems and programming languages,which both require particular query interfaces to be in place to enable different contex-tual implementations. This research aimed to explore a different method that removeda direct connection between the users query and the way data is structured in physicaldata providers. Given that this freedom was based on the goal of context-sensitivity,the model may not be viable for use with linked queries. It may, however, be reasonableto modify the model to include explicit interfaces that query types implement, as thequery implementation would be separated from the interface and the data providers. Ineffect, the model only requires that the separation occur so that each part can be freelysubstituted based on the scientist’s context, without materially changing the results ifthe scientist can determine an equivalent way to derive the same results.

The model and prototype can be used to promote the reuse of identifiers and thestandardisation of URIs. Perhaps surprisingly, normalisation rules can be used to in-tegrate systems that would not otherwise be linked using URIs. This is due to thefact that at the point that the RDF statements from the provider are combined withother results, novel unnormalised RDF statements relating the normalised URIs to anyother equivalent strings or URIs can also be inserted into the pool of RDF results. Thisprovides a solution to the issue of URI sustainability, as in the future, any other nor-malised URI scheme can provide normalisation rules that map the future scheme backto current URIs if a particular organisation did not maintain or update their data usingcurrently recognised URIs. Using this system, scientists can publish their data usingURIs that have not yet been approved by a standards body with the knowledge thatthey can migrate in future without reducing the usefulness of their queries.

The prototype should be extended in future to allow static unnormalised RDF state-ments to derive information from the results of queries. This could be implemented byadding an optional SPARQL SELECT template to each query type that could be used tofill template values dynamically based on the actual results returned from the providerbetween the parsing and inclusion in the pool normalisation stages. This method wouldprovide further value to the model as it allows scientists to derive partially normalised

http://example.org/$$ns$$:$$namespace$$


results without relying on variables being present in their query. This would make itpossible to dynamically insert unnormalised links to URIs that are not transformableusing textual methods. For example, this would allow scientists to dynamically derivethe relationship between the two incompatible DailyMed reference schemes shown inFigure 1.11 (i.e., “AC387AA0-3F04-4865-A913-DB6ED6F4FDC5” is textually incom-patible with “6788”), even if the rest of the result statements contained normalisedURIs.

The system does not need to rely on Linked Data HTTP URIs and RDF. If there isanother well known resolution method, with a suitable arbitrary data model, developedin the future the prototype could easily be changed to suit the situation, and the modelmay not be affected. The model does not rely on HTTP requests for input, so it doesnot need to follow other web conventions regarding the meaning of HTTP requests fordifferent items. For example, the model could be designed to fit with 303 redirects, butit would be entirely arbitrary to require this for direct Java queries into the system,so it does not require it. The metadata which forms the basis for the extended 303discussion can be transported across a different channel as necessary. It is up to thescientist using the data to keep the RDF triples relating to the document separate fromthe RDF triples relating to the document provenance, if they do not wish to processthem together.

The model allows scientists to change where their data is provided without changingtheir URIs, or relying on entire communities accepting the new source as authoritativefor a dataset. For example, some schemes that are not designed to be directly resolvableto information, such as ISBNs, may be resolvable using various methods now and inthe future without the scientist changing either their workflows, or their preferred URIstructure. This eliminates the major issue that is limiting the success of distributedRDF based queries without imposing a new order, such as LSIDs.

The prototype focuses on RDF as its authoritative communication syntax becauseof its simple extensibility and relatively well supported community. If there was anothersyntax that both allowed multiple information sources to be merged to a useful combineddocument by machine–without semantic conflict–and enabled universal references, suchas those provided by URIs, it may be interchangeable with RDF as the single formatof choice. The provision for multiple file formats for the RDF syntax may also beused in future to generate more concise representations of information, as the currentlystandardised file formats, RDF/XML and NTriples, are verbose and may prevent thetransfer of large amounts of information as efficiently as a binary RDF format, forexample.

The prototype is designed to be backwards compatible as far as possible with olderstyle configurations, since the configuration syntax changed as the model and prototypedeveloped. The configuration file is designed based on the features in the model, so itmakes it possible for other implementations to take the configuration file and processit using the model theory as the semantic reference. The prototype has the ability torecognise configuration files designed for other implementations, although it may not be


able to use them if they require functionality that is not implemented or incompatiblewith the prototype.

The prototype was used to evaluate the practical methods for distributing trustedinformation about which queries are useful on different data providers, including refer-ences to the rules that scientists used with the data and the degree of trust in a dataprovider. However, the profiles mechanism is currently limited to include and excludeinstructions, these do not directly define particular levels of trust, other than throughthe use of the profile by scientists in their work. It could be extended in future toprovide a semantically rich reference to the meaning for a profile inclusion or exclusion.This would enable scientists to have additional trust criteria when deciding whether aprofile is applicable to them.

In this study, a constraint that was used for profile selection by the configurationmaintainers was the physical location of a data provider. This constraint was notrelated to the main objectives of data quality, data trust and replicability, although itwas related to the context sensitivity objective. The profiles were not actually annotatedwith the ideal physical location for users of the profile. It was deemed not to be vitalto the research goals and there were only three locations in the Bio2RDF case study.Although the prototype would still need to have a binary decision about whether toinclude or exclude an item, the scientist could have different operating criteria for theprototype based on the distance to the provider. This may be implemented in future foreach instance of the prototype, or it may be implemented using additional parameters toeach query. This may include the IP address of the request being used with geolocationtechniques the best location to redirect the user to, or the public IP address of the serverbeing used to geolocate the closest data providers based on the geographic locations thatcould be attached to each provider.

The model is agnostic to the nature of the normalisation rules, making it simple toimplement rules for more methods than just Regular Expressions and SPARQL CON-STRUCT queries. In particular, the prototype could be extended in future to supportlogic-based rules to either decide whether entire results were valid, or to eliminate in-consistent statements based on the logic rules. Validation is possible using the modelcurrently, as the normalisation rule would simply remove all of the invalid results state-ments from a given provider so that they wouldn’t be included, or they could includeRDF triples in the response indicating a warning for possibly invalid results. However,it could also be possible to have formal validation rules, without modifying the providerinterface, as the rule interface and its implementation can be extended independently.

7.4 Scientific publication changes

The scientific method, particularly with regard to peer review, has developed to matchits environment, with funding models favouring scientists that continually generate newresults and publish in the most prestigious journals and conferences [43]. The growth ofthe Internet as a publishing medium is gradually changing the status quo that academicpublishers have previously sustained. In some academic disciplines, notably maths and


physics, free and open publication databases such as arXiv5 are gradually displacingcommercially published journals as the most common communication channel.

In the case of arXiv, the operational costs are mostly paid for by the institutionsthat use the database the most 6. In the absence of a commercial profit motive, arXivpublishes articles using an instant publication system that is very low cost. Giventhat many of the funding models are provided by governments, ideally, the resultsof the research should be publicly available 7. However, the initial issues with openaccess publication, the perceived absence of a genuine business model for open accesspublishers, and the absence of an authentic open peer review system 8, make it difficultto resolve underlying issues relating to data access and automated replication of analysisprocesses.

The model proposed in this thesis encourages scientists to link data, along withpublication of the queries related to a publication. In doing so, it allows scientists tofreely distribute their methodology, both attached to a publication, and on its own,if they annotate the process description. Even if the underlying data is sensitive andprivate, the publication of the methodology in a computer understandable form allowsfor limited verification. Ideally datasets should be published along with articles andcomputer understandable process models, to allow simple, complete replication. Bothpublishers and specialised data repositories such as Dryad [140]9 or DataCite [33]10

could fill this niche.

The current scientific culture makes it difficult to change the system, even thoughtechnologically there is no reason why a low cost electronic publication model couldnot be sustainable. Critical peer review systems on the internet tend to be anonymous,as are many traditional peer review methods. However, on the internet, peer reviewsystems typically do not include verification of the peer’s identity. In groups of scientistswithin a field, this should not be an issue though, and there is technology available todetermine trust levels in an otherwise anonymous internet system, via PGP key signingas part of a “Web of Trust” [56]. In the case of this thesis, this encourages scientiststo cross-verify publications that they have personally replicated using the publishedmethodology, although possibly in a different context. However, there is currentlyno financial incentive to verify published research, as grants are mostly given for newresearch. Lowering the barriers to replication and continuous peer review will reducethe financial disincentive, although it remains to be seen whether it would be sufficientto change the current scientific publishing culture.

Current scientific culture is designed to encourage a different perception of careerresearchers, compared with professional scientists or engineers. This may be evidencedin the order that authors are named on publications, even if it was patently obvious

5http://arxiv.org/6http://arxiv.org/help/support/arxiv_busplan_July20117http://www.researchresearch.com/index.php?option=com_news&template=rr_2col&view=

article&articleId=11022308http://www.guardian.co.uk/science/2011/sep/05/publish-perish-peer-review-science9http://datadryad.org/

10http://www.datacite.org

http://arxiv.org/

http://arxiv.org/help/support/arxiv_busplan_July2011

http://www.researchresearch.com/index.php?option=com_news&te mplate=rr_2col&view=article&articleId=1102230

http://www.researchresearch.com/index.php?option=com_news&te mplate=rr_2col&view=article&articleId=1102230

http://www.guardian.co.uk/science/2011/sep/05/publish-perish-peer-review-science

http://datadryad.org/

http://www.datacite.org


to their peers and readers of the publication that they did not perform the bulk of theresearch. It may also be evidenced by the omission of lab assistants from publicationcredit completely, in cases where publications are monetarily valuable. In the contextof this research, it may prevent the direct identification of query provenance, as theraw data contributions in a publication may be attributed to the lead author on thepublication, which would make it difficult to investigate the actual collection methodsif they did not actually collect the data.

Researchers require the publication in order to continue receiving grants, while theincome of professional scientists is linked to an employment contract that does notinclude publications as a success criteria. In evolutionary terms, this design favours therecognition of the career researchers who analyse and integrate the results. It offersno competitive advantage to those who do not succeed or fail based on the results. Interms of data provenance, this results in the data being attributed to the scientist, whichmay be accurate, considering their role in cleaning the raw data. Data provenance maybe more important than query provenance in terms of verification, however, in termsof replicability, query provenance may be more useful as it is more clearly useful formachines. Data provenance is generally information, however, it could be used bymachines to select and verify versions of data items and authors if a viable model andimplementation are created in future.

Appendix A

Glossary

BGP A Basic Graph Pattern in SPARQL that defines the way the SPARQL querymaps onto the underlying RDF graph.

Context Any environmental factor that affects the way a scientist wishes to performtheir experiment. These may include location, time, network access to data items,or the replication of experiments with improved methodologies and a different viewon the data quality of particular datasets.

Data item A set of properties and values reflecting some scientific data that are ac-cessed as a group using a unique identifier from a linked dataset.

Data provider An element of the model proposed in this thesis to represent the nor-malisation rules and query types on particular namespaces that are relevant to agiven location.

Link The use of an identifier for another data item in the properties attached to a dataitem in a dataset

Linked dataset Any set of data that contains links to other data items using identifiersfrom the other datasets.

Namespace An element of the model proposed in this thesis to represent parts oflinked datasets that contain unique identifiers, to make it possible to directly linkto these items using only knowledge of the namespace and the identifier.

Normalisation rule An element of the model proposed in this thesis to represent anyrules that are needed to transform data that is given to endpoints, and returnedfrom endpoints.

Profiles An element of the model proposed in this thesis to represent the contextualwishes of a scientist using the system in a given context.

Query type An element of the model proposed in this thesis to represent a querytemplate and associated information that is used to identify data providers thatmay be relevant to the query type.

185

186 Appendix A. Glossary

RDF Resource Description Framework : A generic data model used to represent linkeddata including direct links between datasources using URIs. It is used by the pro-totype to make it possible to directly integrate data from different data providersinto a single set of results for each scientist’s query.

Scientist Any member of a research team who is tasked with querying data and ag-gregating the results.

SPARQL SPARQL Query Language for RDF : A query language that matches graphpatterns against sets of RDF triples to resolve a query.

URI Uniform Resource Identifier : A string of characters that are used to identifyshared items on the internet. URIs are used as the standard syntax for identifyingunique items in RDF and SPARQL.

Appendix B

Common Ontologies

There are a number of common ontologies that have been used to make RDF documentson the internet recognisable by different users. RDF documents containing terms fromthese ontologies are meaningful to different users through a shared understanding of themeaning surrounding each term.

RDF Resource Description Framework Syntax : The basic RDF ontology is representedin RDF, requiring user agents to understand its basic elements without referenceto any other external specification. 1

RDFS RDF Schema : Provides a basic set of terms which are required to describe RDFdocuments and to represent restrictions on items inside of RDF documents.2

OWL Web Ontology Language : Enables more complex assertions about data thanare possible in RDF Schema. It is generally assumed to be the common languagefor new web ontologies, although some endeavour to represent theirs in RDFSbecause of its restricted format which makes reasoning more reliable. The use ofsome limited subsets of OWL are guaranteed to provide semantically completereasoning.3

DC Dublin Core Elements : The commonly accepted set of ontology terms for describ-ing metadata about online resources such as documents. It is referred to as thelegacy set of terms with the dcterms namespace being the current recommenda-tion, although there are many ontologies which still utilise this form, includingthe dcterms set which attempts to reference these terms wherever applicable toprovide backward compatibility.4

DCTERMS Dublin Core Terms : Provides a revised set of document metadata el-ements including the elements in the original DC set, with domain and rangespecifications to further define the allowed contextual use of the DC terms. Theoriginal DC ontology was intentionally designed to provide a set of terms which

1http://www.w3.org/1999/02-22-rdf-syntax-ns2http://www.w3.org/2000/01/rdf-schema3http://www.w3.org/2002/07/owl4http://purl.org/dc/elements/1.1/

187

http://www.w3.org/1999/02-22-rdf-syntax-ns

http://www.w3.org/2000/01/rdf-schema

http://www.w3.org/2002/07/owl

http://purl.org/dc/elements/1.1/

188 Appendix B. Common Ontologies

did not have side-effects, and could therefore be used more widely without preju-dice, but as the Semantic Web developed this was found to be unnecessarily vague,with user agents desiring to validate and integrate ontologies in non-textual forms.5

FOAF Friend Of A Friend : Provides an ontology which attempt to describe relation-ships between agents on the internet in social networking circles.6

WOT Web Of Trust : Provides an ontology which formalises a trust relationship basedon distributed PGP keys and signatures. It can be integrated with the FOAFmodel by utilising the model of hashed mailbox addresses which is unique to thedistributed FOAF community.7

SKOS Structured Knowledge Organisation System : Aims to be able to representhierarchical and non-hierarchical semi-ordered thesauri and similar collections ofliterary terms. It is utilised by the dbpedia project to represent the knowledgewhich is encoded inside of Wikipedia, and hence has been practically verified tosome extent.8

5http://purl.org/dc/terms/6http://xmlns.com/foaf/0.1/7http://xmlns.com/wot/0.1/8http://www.w3.org/2004/02/skos/core

http://purl.org/dc/terms/

http://xmlns.com/foaf/0.1/

http://xmlns.com/wot/0.1/

http://www.w3.org/2004/02/skos/core

Bibliography

[1] Gergely Adamku and Heiner Stuckenschmidt. Implementation and evaluationof a distributed rdf storage and retrieval system. In Proceedings of the 2005IEEE/WIC/ACM International Conference on Web Intelligence (WI’05), pages393–396, Los Alamitos, CA, USA, 2005. IEEE Computer Society. ISBN 0-7695-2415-X. doi: 10.1109/WI.2005.73.

[2] Hend S. Al-Khalifa and Hugh C. Davis. Measuring the semantic value of folk-sonomies. In Innovations in Information Technology, 2006, pages 1–5, November2006. doi: 10.1109/INNOVATIONS.2006.301880.

[3] K Alexander, R Cyganiak, M Hausenblas, and J Zhao. Describing linked datasets.Proceedings of the Workshop on Linked Data on the Web (LDOW2009) Madrid,Spain, 2009. URL http://hdl.handle.net/10379/543.

[4] R. Alonso-Calvo, V. Maojo, H. Billhardt, F. Martin-Sanchez, M. García-Remesal,and D. Pérez-Rey. An agent- and ontology-based system for integrating publicgene, protein, and disease databases. J. of Biomedical Informatics, 40(1):17–29,2007. ISSN 1532-0464. doi: 10.1016/j.jbi.2006.02.014.

[5] Joanna Amberger, Carol A. Bocchini, Alan F. Scott, and Ada Hamosh. Mckusick’sonline mendelian inheritance in man (omim). Nucleic Acids Research, 37(suppl1):D793–D796, 2009. doi: 10.1093/nar/gkn665.

[6] Sophia Ananiadou and John McNaught, editors. Text mining for biology andbiomedicine. Artech House, Boston, 2006.

[7] Sophia Ananiadou and John McNaught. Text mining for biology and biomedicine.Computational Linguistics, 33(1):135–140, 2007. doi: 10.1162/coli.2007.33.1.135.

[8] Peter Ansell. Bio2rdf: Providing named entity based search with a commonbiological database naming scheme. In Proceedings of BioSearch08: HCSNet Next-Generation Search Workshop on Search in Biomedical Information, November2008.

[9] Peter Ansell. Collaborative development of cross-database bio2rdf queries. IneResearch Australasia 2009, Novotel Sydney Manly Pacific, November 2009.

189

http://hdl.handle.net/10379/543

190 BIBLIOGRAPHY

[10] Peter Ansell. Model and prototype for querying multiple linked scientific datasets.Future Generation Computer Systems, 27(3):329–333, March 2011. ISSN 0167-739X. doi: 10.1016/j.future.2010.08.016.

[11] Peter Ansell, Lawrence Buckingham, Xin-Yi Chua, James Hogan, Scott Mann,and Paul Roe. Finding friends outside the species: Making sense of large scaleblast results with silvermap. In Proceedings of eResearch Australasia 2009, 2009.

[12] Peter Ansell, Lawrence Buckingham, Xin-Yi Chua, James Michael Hogan, ScottMann, and Paul Roe. Enhancing blast comprehension with silvermap. In 2009Microsoft eScience Workshop, October 2009.

[13] Peter Ansell, James Hogan, and Paul Roe. Customisable query resolution inbiology and medicine. In Proceedings of the Fourth Australasian Workshop onHealth Informatics and Knowledge Management - Volume 108, HIKM ’10, pages69–76, Darlinghurst, Australia, Australia, 2010. Australian Computer Society,Inc. ISBN 978-1-920682-89-7. URL http://portal.acm.org/citation.cfm?id=

1862303.1862315.

[14] Mikel Egana Aranguren, Sean Bechhofer, Phillip Lord, Ulrike Sattler, andRobert D. Stevens. Understanding and using the meaning of statements in abio-ontology: recasting the gene ontology in owl. BMC Bioinformatics, 8:57,2007. doi: 10.1186/1471-2105-8-57.

[15] Yigal Arens, Chin Y. Chee, Chun-Nan Hsu, and Craig A. Knoblock. Retrievingand integrating data from multiple information sources. International Journal ofIntelligent and Cooperative Information Systems, 2:127–158, 1993. URL http:

//www.isi.edu/info-agents/papers/arens93-ijicis.pdf.

[16] Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, HeatherButler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight,Janan T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, AndrewKasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald,Gerald M. Rubin, and Gavin Sherlock. Gene ontology: tool for the unificationof biology. the gene ontology consortium. Nature Genet., 25:25–29, 2000. doi:10.1038/75556.

[17] Vadim Astakhov, Amarnath Gupta, Simone Santini, and Jeffrey S. Grethe. Dataintegration in the biomedical informatics research network (birn). Data Integrationin the Life Sciences, pages 317–320, 2005. doi: 10.1007/11530084_31.

[18] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia:A nucleus for a web of open data. Lecture Notes in Computer Science, 4825:722,2007.

[19] Amos Bairoch, Rolf Apweiler, Cathy H. Wu, Winona C. Barker, Brigitte Boeck-mann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez,

http://portal.acm.org/citation.cfm?id=1862303.1862315


http://www.isi.edu/info-agents/papers/arens93-ijicis.pdf

http://www.isi.edu/info-agents/papers/arens93-ijicis.pdf

BIBLIOGRAPHY 191

Michele Magrane, Maria J. Martin, Darren A. Natale, Claire O’Donovan, NicoleRedaschi, and Lai-Su L. Yeh. The universal protein resource (uniprot). NucleicAcids Research, 33(suppl_1):D154–159, 2005. doi: 10.1093/nar/gki070.

[20] JB Bard and SY Rhee. Ontologies in biology: design, applications and futurechallenges. Nat Rev Genet, 5(3):213–222, 2004. doi: 10.1038/nrg1295.

[21] Beth A. Bechky. Sharing meaning across occupational communities: The trans-formation of understanding on a production floor. Organization Science, 14(3):312–330, 2003. ISSN 10477039. URL http://www.jstor.org/stable/4135139.

[22] S. Beco, B. Cantalupo, L. Giammarino, N. Matskanis, and M. Surridge. Owl-ws: a workflow ontology for dynamic grid service composition. In e-Science andGrid Computing, 2005. First International Conference on, page 8 pp., 2005. doi:10.1109/E-SCIENCE.2005.64.

[23] François Belleau, Peter Ansell, Marc-Alexandre Nolin, Kingsley Idehen, andMichel Dumontier. Bio2rdf’s sparql point and search service for life science linkeddatasets. Poster at the 11th Annual Bio-Ontologies Meeting, 2008.

[24] François Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, andJean Morissette. Bio2rdf: Towards a mashup to build bioinformatics knowledgesystems. Journal of Biomedical Informatics, 41(5):706–716, 2008. doi: 10.1016/j.jbi.2008.03.004.

[25] S Bergamaschi, S Castano, and M Vincini. Semantic integration of semistructuredand structured data sources. SIGMOD Record, 28:54–59, 1999. doi: 10.1145/309844.309897.

[26] Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat,Helge Weissig, Ilya N. Shindyalov, and Philip E. Bourne. The protein data bank.Nucleic Acids Research, 28(1):235–242, 2000. doi: 10.1093/nar/28.1.235.

[27] Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Sci-entific American Magazine, May 2001. URL http://www.sciam.com/article.

cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21.

[28] A Birkland and G Yona. Biozon: a system for unification, management andanalysis of heterogeneous biological data. BMC Bioinformatics, 7:70, 2006. doi:10.1186/1471-2105-7-70.

[29] C. Bizer and A. Seaborne. D2rq-treating non-rdf databases as virtual rdf graphs.In Proceedings of the 3rd International Semantic Web Conference (ISWC2004),2004.

[30] U. Bojars, J.G. Breslin, A. Finn, and S. Decker. Using the semantic web forlinking and reusing data across web 2.0 communities. Web Semantics: Science,Services and Agents on the World Wide Web, 6(1):21–28, February 2008. doi:10.1016/j.websem.2007.11.010.

http://www.jstor.org/stable/4135139

http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21

http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21

192 BIBLIOGRAPHY

[31] Evan E. Bolton, Yanli Wang, Paul A. Thiessen, and Stephen H. Bryant. Chapter12 pubchem: Integrated platform of small molecules and biological activities. InRalph A. Wheeler and David C. Spellmeyer, editors, Annual Reports in Computa-tional Chemistry, volume 4 of Annual Reports in Computational Chemistry, pages217–241. Elsevier, 2008. doi: 10.1016/S1574-1400(08)00012-1.

[32] Shawn Bowers and Bertram Ludäscher. An ontology-driven framework for datatransformation in scientific workflows. In Data Integration in the Life Sciences,pages 1–16. Springer, 2004. doi: 10.1007/b96666.

[33] Jan Brase, Adam Farquhar, Angela Gastl, Herbert Gruttemeier, Maria Heijne,Alfred Heller, Arlette Piguet, Jeroen Rombouts, Mogens Sandfaer, and Irina Sens.Approach for a joint global registration agency for research data. InformationServices and Use, 29(1):13–27, 01 2009. doi: 10.3233/ISU-2009-0595.

[34] Peter Buneman, Adriane Chapman, and James Cheney. Provenance managementin curated databases. In SIGMOD ’06: Proceedings of the 2006 ACM SIGMODinternational conference on Management of data, pages 539–550, New York, NY,USA, 2006. ACM Press. ISBN 1-59593-434-0. doi: 10.1145/1142473.1142534.

[35] Eithon Cadag, Peter Tarczy-Hornoch, and Peter Myler. On the reachability oftrustworthy information from integrated exploratory biological queries. Data Inte-gration in the Life Sciences, pages 55–70, 2009. doi: 10.1007/978-3-642-02879-3_6.

[36] Diego Calvanese, Giuseppe Giacomo, Domenico Lembo, Maurizio Lenzerini, Ric-cardo Rosati, and Marco Ruzzi. Using owl in data integration. Semantic WebInformation Management, pages 397–424, 2010. doi: 10.1007/978-3-642-04329-1_17.

[37] Michael N Cantor and Yves A Lussier. Putting data integration into practice:using biomedical terminologies to add structure to existing data sources. AMIAAnnu Symp Proc, pages 125–129, 2003.

[38] Monica Chagoyen, Pedro Carmona-Saez, Hagit Shatkay, Jose M Carazo, and Al-berto Pascual-Montano. Discovering semantic features in the literature: a foun-dation for building functional associations. BMC Bioinformatics, 7:41, 2006. doi:10.1186/1471-2105-7-41.

[39] B Chen, X Dong, D Jiao, H Wang, Q Zhu, Y Ding, and D Wild. Chem2bio2rdf:a semantic framework for linking and data mining chemogenomic and systemschemical biology data. BMC Bioinformatics, 11:255, 2010. doi: 10.1186/1471-2105-11-255.

[40] Kei-Hoi Cheung, H Robert Frost, M Scott Marshall, Eric Prud’hommeaux,Matthias Samwald, Jun Zhao, and Adrian Paschke. A journey to semantic web

BIBLIOGRAPHY 193

query federation in the life sciences. BMC Bioinformatics, 10(Suppl 10):S10, 2009.ISSN 1471-2105. doi: 10.1186/1471-2105-10-S10-S10.

[41] Tim Clark, Sean Martin, and Ted Liefeld. Globally distributed object identifica-tion for biological knowledgebases. Brief Bioinformatics, 5(1):59–70, 2004. doi:10.1093/bib/5.1.59.

[42] Paulo C.G. Costa and Kathryn B. Laskey. Pr-owl: A framework for bayesianontologies. In Proceedings of the International Conference on Formal Ontology inInformation Systems (FOIS 2006). November 9-11, 2006, Baltimore, MD, USA,2006. IOS Press. URL http://hdl.handle.net/1920/1734.

[43] Susan Crawford and Loretta Stucki. Peer review and the changing research record.Journal of the American Society for Information Science, 41(3):223–228, 1990.ISSN 1097-4571. doi: 10.1002/(SICI)1097-4571(199004)41:3<223::AID-ASI14>3.0.CO;2-3.

[44] Andreas Doms and Michael Schroeder. GoPubMed: exploring PubMed with theGene Ontology. Nucl. Acids Res., 33(Supplement 2):W783–786, 2005. doi: 10.1093/nar/gki470.

[45] Orri Erling and Ivan Mikhailov. Rdf support in the virtuoso dbms. In Proceedingsof the 1st Conference on Social Semantic Web (CSSW), pages 7–24. Springer,2007. doi: 10.1007/978-3-642-02184-8_2.

[46] T. Etzold, H. Harris, and S. Beulah. Srs: An integration platform for databanksand analysis tools in bioinformatics. In Bioinformatics: Managing Scientific Data,chapter 5, pages 35–74. Elsevier, 2003.

[47] Y Fang, H Huang, H Chen, and H Juan. Tcmgenedit: a database for as-sociated traditional chinese medicine, gene and disease information using textmining. BMC Complementary and Alternative Medicine, 8:58, 2008. doi:10.1186/1472-6882-8-58.

[48] Robert D. Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger,Joanne E. Pollington, O. Luke Gavin, Prasad Gunasekaran, Goran Ceric, Kristof-fer Forslund, Liisa Holm, Erik L. L. Sonnhammer, Sean R. Eddy, and Alex Bate-man. The pfam protein families databases. Nucleic Acids Research, 38(suppl 1):D211–D222, 2010. doi: 10.1093/nar/gkp985.

[49] R. Gaizauskas, N. Davis, G. Demetriou, Y. Guo, and I. Roberts. Integrating textmining into distributed bioinformatics workflows: a web services implementation.In Services Computing, 2004. (SCC 2004). Proceedings. 2004 IEEE InternationalConference on, pages 145–152, 15-18 Sept. 2004. doi: 10.1109/SCC.2004.1358001.

[50] Michael Y. Galperin. The molecular biology database collection: 2008 update.Nucl. Acids Res., 36(suppl-1):D2–4, 2008. doi: 10.1093/nar/gkm1037.

http://hdl.handle.net/1920/1734

194 BIBLIOGRAPHY

[51] Robert Gentleman. Reproducible research: A bioinformatics case study. Statis-tical Applications in Genetics and Molecular Biology, 4(1), 2005. doi: 10.2202/1544-6115.1034.

[52] Yolanda Gil and Donovan Artz. Towards content trust of web resources. WebSemantics: Science, Services and Agents on the World Wide Web, 5(4):227–239,December 2007. doi: 10.1016/j.websem.2007.09.005.

[53] C. A. Goble, R. D. Stevens, G. Ng, S. Bechhofer, N. W. Paton, P. G. Baker,M. Peim, and A. Brass. Transparent access to multiple bioinformatics informationsources. IBM Syst. J., 40(2):532–551, 2001. ISSN 0018-8670. doi: 10.1147/sj.402.0532.

[54] Peter Godfrey-Smith. Theory and Reality: An Introduction to the Philosophy ofScience. University Of Chicago Press, 2003.

[55] Kwang-Il Goh, Michael E. Cusick, David Valle, Barton Childs, Marc Vidal, andAlbert-László Barabási. The human disease network. Proceedings of the Na-tional Academy of Sciences, 104(21):8685–8690, May 2007. doi: 10.1073/pnas.0701361104.

[56] Jennifer Golbeck. Weaving a web of trust. Science, 321(5896):1640–1641, 2008.doi: 10.1126/science.1163357.

[57] Mike Graves, Adam Constabaris, and Dan Brickley. Foaf: Connecting people onthe semantic web. Cataloging & Classification Quarterly, 43(3):191–202, 2007.ISSN 0163-9374. doi: 10.1300/J104v43n03_10.

[58] Tom Gruber. Collective knowledge systems: Where the social web meets thesemantic web. Web Semantics: Science, Services and Agents on the World WideWeb, 6(1):4–13, February 2008. doi: 10.1016/j.websem.2007.11.011.

[59] Miguel Esteban Gutiérrez, Isao Kojima, Said Mirza Pahlevi, Oscar Corcho, andAsunción Gómez-Pérez. Accessing rdf(s) data resources in service-based grid in-frastructures. Concurrency and Computation: Practice and Experience, 21(8):1029–1051, 2009. doi: 10.1002/cpe.1409.

[60] LM Haas, PM Schwarz, and P Kodali. Discoverylink: a system for integratedaccess to life sciences data sources. IBM Syst. J, 40:489–511, 2001. URL http:

//portal.acm.org/citation.cfm?id=1017236.

[61] Carole D. Hafner and Natalya Fridman Noy. Ontological foundations for biologyknowledge models. In Proceedings of the Fourth International Conference on In-telligent Systems for Molecular Biology, pages 78–87. AAAI Press, 1996. ISBN1-57735-002-2.

[62] MA Harris, J Clark, A Ireland, J Lomax, M Ashburner, R Foulger, K Eilbeck,S Lewis, B Marshall, and C Mungall. The gene ontology (go) database and

http://portal.acm.org/citation.cfm?id=1017236

http://portal.acm.org/citation.cfm?id=1017236

BIBLIOGRAPHY 195

informatics resource. Nucleic Acids Res, 32(1):D258–261, 2004. doi: 10.1093/nar/gkh036.

[63] A. Harth, J. Umbrich, A. Hogan, and S. Decker. Yars2: A federated repositoryfor querying graph structured data from the webs. Lecture Notes in ComputerScience, 4825:211, 2007. doi: 10.1007/978-3-540-76298-0_16.

[64] Olaf Hartig and Jun Zhao. Using web data provenance for quality assessment.In In: Proc. of the Workshop on Semantic Web and Provenance Management atISWC, 2009.

[65] O Hassanzadeh, A Kementsietsidis, L Lim, RJ Miller, and M Wang. Linkedct:A linked data space for clinical trials. arXiv:0908.0567v1, August 2009. URLhttp://arxiv.org/abs/0908.0567.

[66] Tom Heath and Enrico Motta. Ease of interaction plus ease of integration: Com-bining web 2.0 and the semantic web in a reviewing site. Web Semantics: Science,Services and Agents on the World Wide Web, 6(1):76–83, February 2008. doi:10.1016/j.websem.2007.11.009.

[67] John J. Helly, T. Todd Elvins, Don Sutton, David Martinez, Scott E. Miller,Steward Pickett, and Aaron M. Ellison. Controlled publication of digital scientificdata. Commun. ACM, 45(5):97–101, 2002. ISSN 0001-0782. doi: 10.1145/506218.506222.

[68] Martin Hepp. Goodrelations: An ontology for describing products and services of-fers on the web. In Proceedings of the 16th International Conference on KnowledgeEngineering and Knowledge Management (EKAW2008), September 29 - October3, 2008, Acitrezza, Italy, volume 5268, pages 332–347. Springer LNCS, 2008. doi:10.1007/978-3-540-87696-0_29.

[69] Katherine G. Herbert and Jason T.L. Wang. Biological data cleaning: a casestudy. International Journal of Information Quality, 1(1):60 – 82, 2007. doi:10.1504/IJIQ.2007.013376.

[70] Mauricio A. Hernández and Salvatore J. Stolfo. Real-world data is dirty: Datacleansing and the merge/purge problem. Data Mining and Knowledge Discovery,2(1):9–37, January 1998. doi: 10.1023/A:1009761603038.

[71] Tony Hey, Stewart Tansley, and Kristin Tolle, editors. The Fourth Paradigm:Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Wash-ington, 2009. URL http://research.microsoft.com/en-us/collaboration/

fourthparadigm/.

[72] Shahriyar Hossain and Hasan Jamil. A visual interface for on-the-fly biologicaldatabase integration and workflow design using vizbuilder. Data Integration inthe Life Sciences, pages 157–172, 2009. doi: 10.1007/978-3-642-02879-3_13.

http://arxiv.org/abs/0908.0567

http://research.microsoft.com/en-us/collaboration/fourthparadigm/

http://research.microsoft.com/en-us/collaboration/fourthparadigm/

196 BIBLIOGRAPHY

[73] Gareth Hughes, Hugo Mills, David De Roure, Jeremy G. Frey, Luc Moreau, M. CSchraefel, Graham Smith, and Ed Zaluska. The semantic smart laboratory: asystem for supporting the chemical escientist. Org. Biomol. Chem., 2:3284 –3293, 2004. doi: 10.1039/B410075A.

[74] Zachary Ives. Data integration and exchange for scientific collaboration. Data In-tegration in the Life Sciences, pages 1–4, 2009. doi: 10.1007/978-3-642-02879-3_1.

[75] Hai Jin, Aobing Sun, Ran Zheng, Ruhan He, and Qin Zhang. Ontology-based semantic integration scheme for medical image grid. Cluster Comput-ing and the Grid, IEEE International Symposium on, 0:127–134, 2007. doi:10.1109/CCGRID.2007.77.

[76] Minoru Kanehisa, Susumu Goto, Miho Furumichi, Mao Tanabe, and Mika Hi-rakawa. Kegg for representation and analysis of molecular networks involvingdiseases and drugs. Nucleic Acids Research, 38(suppl 1):D355–D360, 2010. doi:10.1093/nar/gkp896.

[77] P. D. Karp. An ontology for biological function based on molecular interactions.Bioinformatics, 16(3):269–285, Mar 2000. doi: 10.1093/bioinformatics/16.3.269.

[78] Marc Kenzelmann, Karsten Rippe, and John S Mattick. Rna: Networks & imag-ing. Molecular Systems Biology, 2(44), 2006. doi: 10.1038/msb4100086.

[79] Jacob Köhler, Stephan Philippi, and Matthias Lange. Semeda: ontology basedsemantic integration of biological databases. Bioinformatics, 19(18):2420–2427,Dec 2003. doi: 10.1093/bioinformatics/btg340.

[80] Michael Kuhn, Monica Campillos, Ivica Letunic, Lars Juhl Jensen, and Peer Bork.A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol, 6,January 2010. doi: 10.1038/msb.2009.98.

[81] Thomas Samuel Kuhn. The Structure of Scientific Revolutions. University OfChicago Press, 3rd edition, 1996.

[82] Camille Laibe and Nicolas Le Novere. Miriam resources: tools to generate andresolve robust cross-references in systems biology. BMC Systems Biology, 1(1):58,2007. ISSN 1752-0509. doi: 10.1186/1752-0509-1-58.

[83] P. Lambrix and V. Jakoniene. Towards transparent access to multiple biologicaldatabanks. In Proceedings of the First Asia-Pacific bioinformatics conferenceon Bioinformatics 2003-Volume 19, pages 53–60. Australian Computer Society,Inc. Darlinghurst, Australia, Australia, 2003. URL http://portal.acm.org/

citation.cfm?id=820189.820196.

[84] Ning Lan, Gaetano T Montelione, and Mark Gerstein. Ontologies for proteomics:towards a systematic definition of structure and function that scales to the genome



BIBLIOGRAPHY 197

level. Current Opinion in Chemical Biology, 7(1):44–54, February 2003. doi:10.1016/S1367-5931(02)00020-0.

[85] Andreas Langegger, Martin Blochl, and Wolfram Woss. Sharing data on the gridusing ontologies and distributed sparql queries. In 18th International Conferenceon Database and Expert Systems Applications, 2007. DEXA ’07, pages 450–454,2007. doi: 10.1109/DEXA.2007.4312934.

[86] A. Lash, W.-J. Lee, and L. Raschid. A methodology to enhance the semantics oflinks between pubmed publications and markers in the human genome. In FifthIEEE Symposium on Bioinformatics and Bioengineering (BIBE 2005), pages 185–192, Minneapolis, Minnesota, USA, October 2005. doi: 10.1109/BIBE.2005.3.

[87] D. Le-Phuoc, A. Polleres, M. Hauswirth, G. Tummarello, and C. Morbidoni.Rapid prototyping of semantic mash-ups through semantic web pipes. In Pro-ceedings of the 18th international conference on World wide web, pages 581–590.ACM New York, NY, USA, 2009. URL http://www2009.org/proceedings/pdf/

p581.pdf.

[88] Timothy Lebo, John S. Erickson, Li Ding, Alvaro Graves, Gregory Todd Williams,Dominic DiFranzo, Xian Li, James Michaelis, Jin Guang Zheng, Johanna Flores,Zhenning Shangguan, Deborah L. McGuinness, and Jim Hendler. Producing andusing linked open government data in the twc logd portal (to appear). In DavidWood, editor, Linking Government Data. Springer, New York, NY, 2011.

[89] Peter Li, Yuhui Chen, and Alexander Romanovsky. Measuring the dependabilityof web services for use in e-science experiments. In Dave Penkler, Manfred Re-itenspiess, and Francis Tam, editors, Service Availability, volume 4328 of LectureNotes in Computer Science, pages 193–205. Springer Berlin / Heidelberg, 2006.doi: 10.1007/11955498_14.

[90] Sarah Cohen-Boulakia Frédérique Lisacek. Proteome informatics ii: Bioinfor-matics for comparative proteomics. Proteomics, 6(20):5445–5466, 2006. doi:10.1002/pmic.200600275.

[91] Phillip Lord, Sean Bechhofer, Mark D. Wilkinson, Gary Schiltz, Damian Gessler,Duncan Hull, Carole Goble, and Lincoln Stein. Applying semantic web servicesto bioinformatics: Experiences gained, lessons learnt. In Sheila A. McIlraith,Dimitris Plexousakis, and Frank van Harmelen, editors, The Semantic Web –ISWC 2004, volume 3298 of Lecture Notes in Computer Science, pages 350–364.Springer Berlin / Heidelberg, 2004. doi: 10.1007/978-3-540-30475-3_25.

[92] Stefan Maetschke, Michael W. Towsey, and James M. Hogan. Biopatml : patternsharing for the genomic sciences. In 2008 Microsoft eScience Workshop, 7-9 De-cember 2008., University Place Conference Center & Hotel, Indianapolis, 2008.URL http://eprints.qut.edu.au/27327/.

http://www2009.org/proceedings/pdf/p581.pdf

http://www2009.org/proceedings/pdf/p581.pdf

http://eprints.qut.edu.au/27327/

198 BIBLIOGRAPHY

[93] D. Maglott, J. Ostell, K. D. Pruitt, and T. Tatusova. Entrez gene: gene-centeredinformation at ncbi. Nucleic Acids Res., 35:D26–D31, 2007. doi: 10.1093/nar/gkl993.

[94] Sylvain Mathieu, Isabelle Boutron, David Moher, Douglas G. Altman, andPhilippe Ravaud. Comparison of registered and published primary outcomes inrandomized controlled trials. JAMA, 302(9):977–984, 2009. doi: 10.1001/jama.2009.1242.

[95] S. Miles, P. Groth, S. Munroe, S. Jiang, T. Assandri, and L. Moreau. Extractingcausal graphs from an open provenance data models. Concurrency and Compu-tation: Practice and Experience, 20(5):577–586, 2007. doi: 10.1002/cpe.1236.

[96] J. Mingers. Real-izing information systems: Critical realism as an underpinningphilosophy for information systems. Information and Organization, 14(2):87–103,April 2004. doi: 10.1016/j.infoandorg.2003.06.001.

[97] Olivo Miotto, Tin Tan, and Vladimir Brusic. Rule-based knowledge aggregationfor large-scale protein sequence analysis of influenza a viruses. BMC Bioinformat-ics, 9(Suppl 1):S7, 2008. ISSN 1471-2105. doi: 10.1186/1471-2105-9-S1-S7.

[98] Barend Mons. Which gene did you mean? BMC Bioinformatics, 6:142, 2005. doi:10.1186/1471-2105-6-142.

[99] Michael Mrissa, Chirine Ghedira, Djamal Benslimane, Zakaria Maamar, FlorianRosenberg, and Schahram Dustdar. A context-based mediation approach to com-pose semantic web services. ACM Trans. Internet Technol., 8(1):4, 2007. ISSN1533-5399. doi: 10.1145/1294148.1294152.

[100] C. J. Mungall, D. B. Emmert, and Consortium FlyBase. A chado case study:an ontology-based modular schema for representing genome-associated biologicalinformation. Bioinformatics, 23:i337–i346, 2007. doi: 10.1093/bioinformatics/btm189.

[101] P.H.. Mylonas, D. Vallet, P. Castells, M. Fernández, and Y. Avrithis. Per-sonalized information retrieval based on context and ontological knowledge.The Knowledge Engineering Review, 23(Special Issue 01):73–100, 2008. doi:10.1017/S0269888907001282.

[102] Rex Nelson, Shulamit Avraham, Randy Shoemaker, Gregory May, Doreen Ware,and Damian Gessler. Applications and methods utilizing the simple semanticweb architecture and protocol (sswap) for bioinformatics resource discovery anddisparate data and service integration. BioData Mining, 3(1):3, 2010. ISSN 1756-0381. doi: 10.1186/1756-0381-3-3.

[103] Natalya F. Noy. Semantic integration: a survey of ontology-based approaches.SIGMOD Rec., 33(4):65–70, 2004. ISSN 0163-5808. doi: 10.1145/1041410.1041421.

BIBLIOGRAPHY 199

[104] P. V. Ogren, K. B. Cohen, G. K. Acquaah-Mensah, J. Eberlein, and L. Hunter.The compositional structure of gene ontology terms. Pac Symp Biocomput, 9:214–225, 2004. URL http://psb.stanford.edu/psb-online/proceedings/psb04/

ogren.pdf.

[105] Tom Oinn, Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris,Kevin Glover, Carole Goble, Antoon Goderis, Duncan Hull, Darren Marvin, PeterLi, Phillip Lord, Matthew R. Pocock, Martin Senger, Robert D. Stevens, AnilWipat, and Chris Wroe. Taverna: lessons in creating a workflow environment forthe life sciences. Concurrency and Computation: Practice and Experience, 18(10):1067–1100, 2006. ISSN 1532-0626. doi: 10.1002/cpe.993.

[106] A. Paschke. Rule responder hcls escience infrastructure. In Proceedings of the3rd International Conference on the Pragmatic Web: Innovating the InteractiveSociety, pages 59–67. ACM New York, NY, USA, 2008. doi: 10.1145/1479190.1479199.

[107] C. Pasquier. Biological data integration using semantic web technologies.Biochimie, 90(4):584–594, 2008. doi: 10.1016/j.biochi.2008.02.007.

[108] Steve Pettifer, David Thorne, Philip McDermott, James Marsh, Alice Villeger,Douglas Kell, and Teresa Attwood. Visualising biological data: a semantic ap-proach to tool and database integration. BMC Bioinformatics, 10(Suppl 6):S19,2009. ISSN 1471-2105. doi: 10.1186/1471-2105-10-S6-S19.

[109] William Pike and Mark Gahegan. Beyond ontologies: Toward situated representa-tions of scientific knowledge. International Journal of Human-Computer Studies,65(7):674–688, July 2007. doi: 10.1016/j.ijhcs.2007.03.002.

[110] Andreas Prlić, Ewan Birney, Tony Cox, Thomas Down, Rob Finn, Stefan Gräf,David Jackson, Andreas Kähäri, Eugene Kulesha, Roger Pettett, James Smith,Jim Stalker, and Tim Hubbard. The distributed annotation system for integrationof biological data. In Data Integration in the Life Sciences, pages 195–203, 2006.doi: 10.1007/11799511_17.

[111] B. Quilitz and U. Leser. Querying distributed rdf data sources with sparql. LectureNotes in Computer Science, 5021:524, 2008. doi: 10.1007/978-3-540-68234-9_39.

[112] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current approaches.IEEE Data Engineering Bulletin, 23:2000, 2000. URL http://sites.computer.

org/debull/A00DEC-CD.pdf.

[113] Lila Rao and Kweku-Muata Osei-Bryson. Towards defining dimensions of knowl-edge systems quality. Expert Systems with Applications, 33(2):368–378, 2007.ISSN 0957-4174. doi: 10.1016/j.eswa.2006.05.003.

[114] N. Redaschi. Uniprot in rdf: Tackling data integration and distributed annotationwith the semantic webs. Nature Precedings, 2009. doi: 10.1038/npre.2009.3193.1.

http://psb.stanford.edu/psb-online/proceedings/psb04/ogren.pdf

http://psb.stanford.edu/psb-online/proceedings/psb04/ogren.pdf

http://sites.computer.org/debull/A00DEC-CD.pdf

http://sites.computer.org/debull/A00DEC-CD.pdf

200 BIBLIOGRAPHY

[115] Michael Rosemann, Jan C. Recker, Christian Flender, and Peter D. Ansell. Un-derstanding context-awareness in business process design. In 17th AustralasianConference on Information Systems, Adelaide, Australia, December 2006. URLhttp://eprints.qut.edu.au/6160/.

[116] Joseph S. Ross, Gregory K. Mulvey, Elizabeth M. Hines, Steven E. Nissen, andHarlan M. Krumholz. Trial publication after registration in clinicaltrials.gov: Across-sectional analysis. PLoS Med, 6(9):e1000144, 09 2009. doi: 10.1371/journal.pmed.1000144.

[117] A. Ruttenberg, J.A. Rees, M. Samwald, and M.S. Marshall. Life sciences on thesemantic web: the neurocommons and beyond. Briefings in Bioinformatics, 10(2):193, 2009. doi: 10.1093/bib/bbp004.

[118] Satya S Sahoo, Olivier Bodenreider, Joni L Rutter, Karen J Skinner, and Amit PSheth. An ontology-driven semantic mashup of gene and biological pathway infor-mation: Application to the domain of nicotine dependence. Journal of BiomedicalInformatics, Feb 2008. doi: 10.1016/j.jbi.2008.02.006.

[119] Satya S. Sahoo, Olivier Bodenreider, Pascal Hitzler, and Amit Sheth. Provenancecontext entity (pace): Scalable provenance tracking for scientific rdf datasets.In Proceedings of the 22nd International Conference on Scientific and StatisticalDatabase Management (SSDBM 2010), 2010. URL http://knoesis.wright.

edu/library/download/ProvenanceTracking_PaCE.pdf.

[120] Joel Saltz, Scott Oster, Shannon Hastings, Stephen Langella, Tahsin Kurc,William Sanchez, Manav Kher, Arumani Manisundaram, Krishnakant Shanbhag,and Peter Covitz. cagrid: design and implementation of the core architecture ofthe cancer biomedical informatics grid. Bioinformatics, 22(15):1910–1916, 2006.doi: 10.1093/bioinformatics/btl272.

[121] Matthias Samwald, Anja Jentzsch, Christopher Bouton, Claus Kallesoe, EgonWillighagen, Janos Hajagos, M Marshall, Eric Prud’hommeaux, Oktie Hassan-zadeh, Elgar Pichler, and Susie Stephens. Linked open drug data for pharmaceu-tical research and development. Journal of Cheminformatics, 3(1):19, 2011. ISSN1758-2946. doi: 10.1186/1758-2946-3-19.

[122] Marco Schorlemmer and Yannis Kalfoglou. Institutionalising ontology-based se-mantic integration. Applied Ontology, 3(3):131–150, 2008. ISSN 1570-5838. URLhttp://eprints.ecs.soton.ac.uk/18563/.

[123] Ronald Schroeter, Jane Hunter, and Andrew Newman. The Semantic Web: Re-search and Applications, volume 4519, chapter Annotating Relationships BetweenMultiple Mixed-Media Digital Objects by Extending Annotea, pages 533–548.Springer Berlin / Heidelberg, 2007. doi: 10.1007/978-3-540-72667-8_38.

http://eprints.qut.edu.au/6160/

http://knoesis.wright.edu/library/download/ProvenanceTracking_PaCE.pdf

http://knoesis.wright.edu/library/download/ProvenanceTracking_PaCE.pdf

http://eprints.ecs.soton.ac.uk/18563/

BIBLIOGRAPHY 201

[124] S. Schulze-Kremer. Adding semantics to genome databases: towards an on-tology for molecular biology. Proceedings of the 5th International Conferenceon Intelligent Systems for Molecular Biology, 5:272–275, 1997. URL https:

//www.aaai.org/Papers/ISMB/1997/ISMB97-042.pdf.

[125] Ruth L. Seal, Susan M. Gordon, Michael J. Lush, Mathew W. Wright, and El-speth A. Bruford. genenames.org: the hgnc resources in 2011. Nucleic AcidsResearch, 2010. doi: 10.1093/nar/gkq892.

[126] Hagit Shatkay and Ronen Feldman. Mining the biomedical literature in the ge-nomic era: an overview. J Comput Biol, 10(6):821–855, 2003. doi: 10.1089/106652703322756104.

[127] E. Patrick Shironoshita, Yves R Jean-Mary, Ray M Bradley, and Mansur RKabuka. semcdi: A query formulation for semantic data integration in cabig.J Am Med Inform Assoc, Apr 2008. doi: 10.1197/jamia.M2732.

[128] David Shotton. Semantic publishing: the coming revolution in scientific journalpublishing. Learned Publishing, 22(2):85–94, APRIL 2009. doi: 10.1087/2009202.

[129] Y. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science.SIGMOD Record, 34(3):31–36, 2005. doi: 10.1145/1084805.1084812.

[130] D. Smedley, S. Haider, B. Ballester, R. Holland, D. London, G. Thorisson, andA. Kasprzyk. Biomart – biological queries made easy. BMC genomics, 10(1):22,2009. doi: 10.1186/1471-2164-10-22.

[131] Miriam Solomon. Groupthink versus the wisdom of crowds: The social episte-mology of deliberation and dissent. The Southern Journal of Philosophy, 44(S1):28–42, 2006. doi: 10.1111/j.2041-6962.2006.tb00028.x.

[132] Young C. Song, Edward Kawas, Ben M. Good, Mark D. Wilkinson, and Scott J.Tebbutt. Databins: a biomoby-based data-mining workflow for biological path-ways and non-synonymous snps. Bioinformatics, 23(6):780–782, Jan 2007. doi:10.1093/bioinformatics/btl648.

[133] Damires Souza. Using Semantics to Enhance Query Reformulation in Dy-namic Distributed Environments. PhD thesis, Universidade Federal de Per-nambuco, April 2009. URL http://www.bdtd.ufpe.br/tedeSimplificado/tde_

arquivos/22/TDE-2009-07-06T123119Z-6016/Publico/dysf.pdf.

[134] Irena Spasić and Sophia Ananiadou. Using automatically learnt verb selectionalpreferences for classification of biomedical terms. J Biomed Inform, 37(6):483–497,Dec 2004. doi: 10.1016/j.jbi.2004.08.002.

[135] Robert D. Stevens, C Goble, I Horrocks, and S Bechhofer. Building a bioin-formatics ontology using oil. IEEE Transactions on Information Technology inBiomedicine, 6:135–141, June 2002. ISSN 1089-7771. doi: 10.1109/TITB.2002.1006301.

https://www.aaai.org/Papers/ISMB/1997/ISMB97-042.pdf

https://www.aaai.org/Papers/ISMB/1997/ISMB97-042.pdf

http://www.bdtd.ufpe.br/tedeSimplificado/tde_arquivos/22/TDE-2009-07-06T123119Z-6016/Publico/dysf.pdf

http://www.bdtd.ufpe.br/tedeSimplificado/tde_arquivos/22/TDE-2009-07-06T123119Z-6016/Publico/dysf.pdf

202 BIBLIOGRAPHY

[136] Ljiljana Stojanovic, Alexander Maedche, Boris Motik, and N. Stojanovic. User-driven ontology evolution management. In Proceedings of the 13th European Con-ference on Knowledge Engineering and Knowledge Management EKAW, pages 53–62, Madrid, Spain, 2002. URL http://www.fzi.de/ipe/publikationen.php?

id=804.

[137] John Strassner, Sven van der Meer, Declan O’Sullivan, and Simon Dobson. Theuse of context-aware policies and ontologies to facilitate business-aware networkmanagement. Journal of Network and Systems Management, 17(3):255–284,September 2009. doi: 10.1007/s10922-009-9126-4.

[138] James Surowiecki. The Wisdom of Crowds. Random House, New York, 2004.ISBN 0-385-72170-6.

[139] The UniProt Consortium. The universal protein resource (uniprot) in 2010. Nu-cleic Acids Research, 38(suppl 1):D142–D148, 2010. doi: 10.1093/nar/gkp846.

[140] Todd J. Vision. Open data and the social contract of scientific publishing. Bio-Science, 60(5):330–331, 2010. ISSN 00063568. doi: 10.1525/bio.2010.60.5.2.

[141] Stephan Weise, Christian Colmsee, Eva Grafahrend-Belau, Björn Junker, Chris-tian Klukas, Matthias Lange, Uwe Scholz, and Falk Schreiber. An integration andanalysis pipeline for systems biology in crop plant metabolism. Data Integrationin the Life Sciences, pages 196–203, 2009. doi: 10.1007/978-3-642-02879-3_16.

[142] Patricia L. Whetzel, Helen Parkinson, Helen C. Causton, Liju Fan, JenniferFostel, Gilberto Fragoso, Laurence Game, Mervi Heiskanen, Norman Morrison,Philippe Rocca-Serra, Susanna-Assunta Sansone, Chris Taylor, JosephWhite, andJr Stoeckert, Christian J. The MGED Ontology: a resource for semantics-baseddescription of microarray experiments. Bioinformatics, 22(7):866–873, 2006. doi:10.1093/bioinformatics/btl005.

[143] Mark D Wilkinson, Benjamin Vandervalk, and Luke McCarthy. Sadi semanticweb services – cause you can’t always get what you want! In Proceedings of IEEEAPSCC Workshop on Semantic Web Services in Practice (SWSIP 2009), pages13–18, 2009. doi: 10.1109/APSCC.2009.5394148.

[144] David S. Wishart, Craig Knox, An Chi Guo, Dean Cheng, Savita Shrivastava,Dan Tzur, Bijaya Gautam, and Murtaza Hassanali. Drugbank: a knowledgebasefor drugs, drug actions and drug targets. Nucleic Acids Research, 36(suppl 1):D901–D906, 2008. doi: 10.1093/nar/gkm958.

[145] Limsoon Wong. Technologies for integrating biological data. Brief Bioinform, 3(4):389–404, 2002. doi: 10.1093/bib/3.4.389.

[146] Alexander C. Yu. Methods in biomedical ontology. Journal of Biomedical Infor-matics, 39(3):252–266, June 2006. doi: 10.1016/j.jbi.2005.11.006.

http://www.fzi.de/ipe/publikationen.php?id=804

http://www.fzi.de/ipe/publikationen.php?id=804

BIBLIOGRAPHY 203

[147] J. Zemánek, S. Schenk, and V. Svátek. Optimizing sparql queries over disparaterdf data sources through distributed semi-joins. In Proceedings of ISWC 2008Poster and Demo Session, 2008. URL http://ftp.informatik.rwth-aachen.

de/Publications/CEUR-WS/Vol-401/iswc2008pd_submission_69.pdf.

[148] Jun Zhao, Carole Goble, Robert Stevens, and Daniele Turi. Mining taverna’ssemantic web of provenance. Concurrency and Computation: Practice and Expe-rience, Online Publication, 2007. doi: 10.1002/cpe.1231.

[149] Jun Zhao, Graham Klyne, and David Shotton. Provenance and linked data in bio-logical data webs. In Linked Open Data Workshop at The 17th International WorldWide Web Conference, 2008. URL http://data.semanticweb.org/workshop/

LDOW/2008/paper/4.

[150] Jun Zhao, Alistair Miles, Graham Klyne, and David Shotton. Openflydata: Theway to go for biological data integration. Data Integration in the Life Sciences,pages 47–54, 2009. doi: 10.1007/978-3-642-02879-3_5.

http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-401/iswc2008pd_submission_69.pdf

http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-401/iswc2008pd_submission_69.pdf

http://data.semanticweb.org/workshop/LDOW/2008/paper/4

http://data.semanticweb.org/workshop/LDOW/2008/paper/4

a context sensitive model for querying linked scientific dataeprints.qut.edu.au › 49777 › 1 ›...

Documents